Hello
I neeed to have a tool developed by which I can extract specific data from websites.
Objectives:
The goal is, that the tool can import an XML-sitemap, crawl along those links, check for and extract several information from each page and save the results from all pages into a CSV-file. This means that the CSV-File has the same structure as the sitemap (= the entire site) but is complemented with the results.
The "information to be extracted” is described in detail in the Excel file attached, only look at the columns in RED colour (the rest is optional for now, probably we do this in a second job after this first job is finished). Be aware that you must scroll inside the Excel-file far to the right to see all points! Please read the red text carefully to fully understand.
If there is any point that you find specifically difficult to extract, let me know before bidding on the job!
If any error occurs while scraping, the tool must go on with the next task and should not stop/hang.
Coding:
It can be written in any language (preferrably PHP or similar but you professionally decide what makes most sense to use in this case!). It should run on a webserver, without too much of complicated installation needed.
The code must be fully compliant with the actual standards of the programming language used (PHP 5.5 if written in PHP, etc.)
Interface:
There must be a simple HTML interface with an import(=upload) function for a (local) XML-Sitemap and also a run and (if possible) a stop button. After the script has run through everything it must prompt me with a download link of the CSV-file (or just save the file in the same directory where the script is).
Delivrables:
- Tool/Script that I can run on my Linux Webserver (written in PHP, AJAX, Javascrpt, MySQL, Python, Ruby, etc.)