My operating System: Windows 7 64bit (but I would like the program to work on Windows XP, Vista & 7 both 32 & 64 bit versions...this isn't a necessity, but I would like it).
I'm going briefly describe what I need here, and then below, I will describe IN DETAIL what I need.
I need a multi-threaded, proxy supporting, [url removed, login to view] supporting Google Adwords data scraper and a Google SERP scraper (not the search results, just the number of pages that appear for a given query). It needs to be able to do this with HUGE input files (1,000,000 line input files and smaller).
It is going to have to scrape the following from Google Adwords Keyword Tool - Traffic Estimator with United States set as the default country & English as the default language:
Local Monthly Searches and Estimated Cost Per Click (CPC).
It is going to have to scrape the following from the google serps page itself:
The number of results for any given query (it appears directly beneath the search bar after searching a query as: "About [insert number of results here] results" )
In Detail -
In the user interface, I need to be able to select 2 .txt files.
The first file should be my input queries, which will be a line separated list of keywords/phrases. The program needs to be able to handle input files of 1,000,000 lines or smaller.
The second file should be a list of line separated proxies, in this format:
Also from the user interface, I need to be able to input my username & password that will be used to log into the Google Adwords Keyword Tool area.
I should also be able to select the number of threads to work with from the user interface.
Lastly, from the user interface, I need to be able to input my username & password for my account at [url removed, login to view] ([url removed, login to view] is an automated captcha solving service).
The program has to visit (in the background, I don't want/need to see any of this actually happening) the Google Adwords Keyword Tool (this URL: [url removed, login to view] ).
It then needs to log in using the details that were provided in the UI.
The program then needs to navigate to the "Traffic Estimator" part of the Google Adwords Tool. You do this by clicking on "Traffic Estimator" which is under the "Tools" heading on the left hand side of the screen directly beneath "Keyword Tool".
Once at the "Traffic Estimator" screen it has select United States as the default country location, and English as the default language (it must ALWAYS make sure those two things are selected).
Then it has to take the input phrase it is currently working with from the input file and input it into the search box (labeled: Word or phrase (one per line) ) in three variations:
1. The phrase as it was in the input file (I.E. phrase here )
2. That same phrase in brackets (I.E. [phrase here] )
3. That same phrase inside quotation marks (I.E. "phrase here" ).
So it will look like this in the search box:
Then it should click "estimate"
Once the data is returned, it needs to scrape the following for each of the three phrases:
Local Monthly Searches & Estimated Cost Per Click (labeled: Estimated Avg. CPC )
The scraped data for the phrase without quotes or brackets needs to be stored/remembered as "Searches (Broad)" & "Adwords CPC (Broad)"
The scraped data for the phrase in brackets , needs to be stored/remembered as "Searches (Exact)" & "Adwords CPC (Exact)"
The scraped data for the phrase in quotations "", needs to be stored/remembered as "Searches (Phrase)" & "Adwords CPC (Phrase)"
The program then needs to visit [url removed, login to view] itself, and search that same input phrase as it was taken from the input file two different ways.
The first is that phrase in quotation marks (I.E. - "phrase here" ). After it searches the input phrase at [url removed, login to view] inside quotation marks, it has to scrape the number of returned results for that phrase. This number appears directly beneath the search bar after searching for something. This data should be stored/remembered as "SEO Comp".
The second way it has to search the input phrase, is like this:
It needs to again scrape the number of returned results for that query. This data should be stored/remembered as "SEO Title Comp".
After this has been done for the phrase it is working with, it needs to export that data to a CSV file in real time and save the file. This way, it can remove that data from the programs working memory so the program doesn't continuously use more and more memory trying to "remember" all of the data it has scraped previously.
When it does it for the next phrase, it needs to simply append that newly scraped data to the previously saved file.
It needs to export the data into the CSV file in this format:
keyword phrase as taken from input file,searches (broad),searches (phrase),searches (exact),adwords cpc (broad),adwords cpc (phrase), adwords cpc (exact)
Occasionally when querying lots of things in the adwords keyword tool, it will ask for you to solve a captcha. The program should "freeze" and solve the captcha using the [url removed, login to view] credentials given in the user interface in conjunction with the [url removed, login to view] API (can be downloaded for free at [url removed, login to view] after signing up which is completely free).
The program should change proxies after EVERY query, this goes for both the queries at the adwords tool AND the queries at [url removed, login to view] directly when getting the number of returned results.
It should check to make sure the proxy worked, and if it didn't, it should try that same query with another proxy, and do this until it works.
Everything that has been described above, has to happen SIMULTANEOUSLY using the number of selected threads in the UI.
So if I select 10 threads, it should simultaneously be working with 10 different input phrases from the input file AT THE SAME TIME. It should always have 10 live threads. Meaning I don't want it to complete the current 10 threads, and then start a new 10 threads. It needs to always be working with 10 threads, so if it finishes one thread and is down to nine threads, it should start working with another query to make it 10 again.
IF YOU HAVE ANY QUESTIONS OR DON'T UNDERSTAND SOMETHING, PLEASE ASK BEFORE MAKING A BID. DON'T JUST PLACE A BID EXPECTING TO FIGURE THINGS OUT LATER.