I have a Python/Django app, and I need to add the following functionality added to it (in Python):
- I want a script that will search for a given term (that I provide), and for each of the first 100 results in Google, crawl the website and look for an email address. If one is found, record it in my db.
- I have link exchange partners see ([url removed, login to view]). I need to have a Python script that crawls my partner pages and verifies that they have added a link to my site (and record the page the link is on).
- I have a db of a bunch of local businesses (35,000) and I want to add the following functionality:
- I am missing the web site for about half the businesses, I need to have a script search Google for the name of the accountant firm and find the URL for their site (if it exists). This should use some simple heuristics for each of the websites on the first page of the Google results, like is the business name in the H1 or title tag of the home page.
- I am missing email addresses for most of the businesses. I need a script to crawl the businesses web site and find an email address.
It is important that you speak very very good English, as communication is important.
If you have any questions, feel free to ask. Keep the following things in mind:
- These scripts will be run periodical (maybe daily) as part of a Django app on a Linux/Apache/MySQL machine.
- I expect the crawler you use to be multithreaded (or else these scripts will take way too long to run) and be polite to the host domain (no flooding them with requests, respect [url removed, login to view], etc).
- I expect high quality code with tests to verify. I'm a developer myself, and I will be reviewing all the code.