I need to scrape search results data from google scholar and avoid getting blocked. There is a similar python based tutorial here [login to view URL] so this should be a quick and easy project.
This project actually has 2 parts.
The first part is collecting historic data from 2017-2021 for about 45 different searches. Here is an example search - [login to view URL]
I will provide a txt file with a list of the exact search parameters on each line which will look similar to:
"weight loss" OR "obesity"
"Lymphoma" OR "Lymph" OR "Lymphatic"
"Eye" OR "Cataract" OR "Retina" OR "Glaucoma" OR "myopia" OR "hyperopia"
Whatever solution you create should allow me to add to or modify the specific searches either in a text file or directly in the code.
IMPORTANT: The searches need to be filtered to only search the abstracts. The default is to search the entire document. The searches for the first part of this project are historical (from 2017-2021). Therefore, these searches will result in over 10,000 results per search. I need to avoid the limit of 1000 results per search result. I could filter by year but the results will probably still be over 1000 for any given year. So you need to find a solution to capture all historical data. See attached document with further explanation.
The data will need to be saved to a mysql table and it will be appended on a weekly or monthly basis with the second part of the project.
Again, the first part is to collect the historical data and save it in a mysql database table and the second part is to append that table with new results.
When sorting the google scholar results page by date, the results will be defined by the "XX days ago - " text which is seen when you sort results by date (see [login to view URL],47&as_vis=1&q=%22weight+loss%22+OR+%22obesity%22&scisbd=1). For this part, we would simply need the script to only grab results that meet the date criteria (for example if plan to run the script once a week we would need to scrape specific results tothat include one of the following texts: "1 day ago", "2 days ago", "3 days ago", "4 days ago", "5 days ago", "6 days ago", or "7 days ago")
Te attached pdf provides further explanation of the needed extraction.
The final code should capture the data and store the data in a predefined mysql database table. The code must be delivered as either a fully functioning .py file or .php file. This file will be run weekly as a cron.
You will need to work from your own server environment. When the project is complete you can send me a video showing the functioning script and sample output. Be sure to test the script running many searches because the script needs to avoid being blocked by google. You can incorporate a proxy if needed and I will purchase accordingly (the proxy cost should not be more than $5-$10/month).
I will create 2 milestones. The first milestone will be 50% of the total and will be released upon confirmation of the functioning script (video and sample output).
The final milestone will be released after the script has been delivered and is functioning on my server. When delivering the script you can leave the server section with dummy data. I will modify the script with the server details for the storage of the collected data.
hey, I read what you need. I have scraped from google scholar for a freelancer project. (can give you reference to that project as well) I am interested to talk more details in chat.