We need an application program that will data mine a web site and generate a local .csv file. We have copyright permission to use material from this web site.
We will have a list of search phrases in a local .txt file. The application will need to construct an URL that includes these search phrases plus some generated index numbers, send out that URL, and receive back an index web page of 0 to 20 items. If zero returned then go on to the next search phrase.
The application will then need to scan that index page for URL’s that will point to specific wed pages. These specific web pages’ URLs will be sent out and the specific web pages received. These specific pages must then be data mined for 4 variable length text/number items.
There are definite search points that delineate these items. These number/text items need to be put into a comma delimited file, Excel .csv compatable, along with the specific page’s file name (from the URL). The search phrase as well as a couple of other text/number items also need to be inserted into fields of each record. One specific page's data per record. All html formatting (except table coding) and hyperlinks in the text items must be preserved and inserted in the .csv file. All quote marks need to be handled correctly, and all hard returns and other possible illegal characters need to be appropriately treated so that the text will be a legal comma delimited file.
There can be between 0 and 2,000 records generated from each search phrase. There should be a timer such that the application is a good internet citizen and only requests web pages from the site at a maximum rate of one page every 5 seconds.
Attached is an MS Word document that describes the URL's format, the Index page format, and the Specific Web Page's format with the text search points for the fields' data.
1) Complete and fully-functional working program(s) in executable form as well as complete source code of all work done.
2) Deliverables must be in ready-to-run condition, as follows? (depending on the nature? of the deliverables):
a)? For web sites or? other server-side deliverables intended to only ever exist in one place in the Buyer's environment--Deliverables must be installed by the Seller in ready-to-run condition in the Buyer's environment.
b) For all others including desktop software or software the buyer intends to distribute: A software? installation package that will install the software in ready-to-run condition on the platform(s) specified in this bid request.
3) All deliverables will be considered "work made for hire" under U.S. Copyright law. Buyer will receive exclusive and complete copyrights to all work purchased. (No GPL, GNU, 3rd party components, etc. unless all copyright ramifications are explained AND AGREED TO by the buyer on the site per the coder's Seller Legal Agreement).
Application must run on an XP Pro, MS Internet Explorer V6 Platform with broadband connection to the internet. The application source code needs to be well commented so that I can make small changes and recompile. Java, Pearl or any other interpretive language would be preferred. You are not responsible to teach me the language. You do need to provide the tools and instructions on how to compile (if needed) and run the application. The intended compiler needs to be disclosed in the bid as well as methods of acquiring the compiler (i.e. buy from Microsoft.) Compiler cost will be taken into account in awarding the bids. I have an MS C compiler available. All drivers and compiling instructions must also be included with design. I must be able to compile and run the application before the project is declared finished.