Job: develop spiders for multiple Dutch sites.
We are looking for a professional web company to develop spiders for multiple sites.
What we want is to extract some data of a few sites in the Netherlands. We will insert only the basic data from these sites into our database. This data will be displayed on our website with a link to the original content.
Displaying spidered content in this fashion is legal in the Netherlands.
We will provide you with an URL and a mapping document which shows you which positions in the site hold which data and to what XML data field in the output XML file it should be mapped.
We want you to develop code that extracts several data fields from spidered webpage’s and return those in an XML output file. This code has to be written in Java. The code that grabs the data from the pages should preferably be written in XSLT. Using XSLT allows easy adaptation to changes in the layout of the spidered webpage’s. The SAXON XSLT non validating processor version 8.7.3 should be used ([url removed, login to view])
The extracted data has to be delivered in XML format. We will provide you with the format.
There are about 15 data fields per page to extract and some navigation of the spidered webpage’s is required to obtain next pages with data fields to walk along a list of data objects of which data fields should be extracted. For example if a list of cars is present on a webpage then each car has to be selected automatically and specific data fields of the car need to be extracted and converted to XML data fields.
The software has to write a log file about the spidered content. In this way we can check if the spider is working properly. In this log file also errors should be logged about data fields that contain invalid data. For example if an amount is expected and no amount can be found.
We will start with one website to be spidered. If the job is done according to specifications, time and budget then more sites will follow in additional jobs.