1. I must be able to set a starting URL from which the spider will intitiate on the [url removed, login to view] website. In order to save programming time and cost, I am willing to be limited to only 2 URL's if this will keep the cost to an absolute minimum, such as $100. I will provide the 2 specific categories later.
2. The spider must parse the HTML and extract the business name, city, state, zip code, telephone number, email address (if applicable), and website (if applicable) into a CSV formatted text file. Also, a field needs to be added before “business name” called "category". I will tell you what needs to go in these fields later but will be only 2 different words.
3. The spider must do a general clean up of the data must so the fields are as clean as possible. Most important are the phone numbers that absolutely must be in the format 999-999-9999 and be totally clean (no extra characters like semi-colons or extra digits etc.). The spider program must also merge and purge any records with duplicate phone numbers. If duplicate telephone numbers are found, records with the least information must be the ones that are deleted. For example, 2 records with the same telephone numbers but one lists a fax and the other doesn't, then delete the one without the fax number. Addresses are less important than phone numbers and email addresses. Even if there are more than 2 business names for the same phone number, pick one randomly; just make sure one record is left with the phone number.
3. The program must sort by category first, then state second.
4. The program must create files with approximately 10000 records each so they are manageable.
1. You must be easily contacted. Either by phone, or you will be required to answer any e-mail I send to you within 12 hours time.
2. Must speak and write English well.
3. I do not need the source code or documentation, only the program.
4. I would like this done and delivered to me no later than March 4th.
5. You need to be available to help ensure the program delivers the data as required.