To develop a software, we are led to automatically retrieve (by scraping) all PDFs available on institutional websites (which provide public data only). However, the diversity of the sites visited means that, in some cases, we miss some documents. A diagnosis allowed us to identify these failures and we would like to correct them.
At this stage, we do not have a precise typology of the causes of these failures. We know that on some sites, PDFs are only accessible through a search engine. In other cases, the site in question is a "Single Page Application", that our scraper does not handle well.
The goal of the mission is twofold. On the one hand, recovering all PDFs available on the sites which will have been specified to you and, on the other hand, to provide us the code which you used to carry out this recovery. We will then integrate this code to our code base to scrap weekly the website.
Please see attached file for technical instructions.
You have to master Python, the Scrapy package and also Git.
Technical support will be provided if needed.
To evaluate your work, I will use two indicators. The first is the number of PDFs scraped before the mission. Of course, after the mission, this number must have significantly increased. We have not been able to manually count PDF files on each site, so we do not know what the exact target is.
It is important for you to understand that we want to scrap all PDFs of the indicated sites BUT we especially want a particular type of PDFs which are the administrative documents.
If we were not able to manually count PDF files on each site, we were able to count the administrative documents. So, we know the target. Of course, we wouldn't expect to see the exact same number after the mission. We just want to be in the same orders of magnitude.
Beyond these raw numbers, we will randomly select a few administrative documents URLs and make sure you've actually scraped them.
A very important thing to note is sometimes, administrative documents are stored on one or several subdomains. Of course, we want these documents. We will tell you which subdomains to explore if necessary.
Since we don't know the exact causes of problems, we don't know how long it will take you to fix them. This is why we want to hire you initially for a single working day. You must therefore offer a price corresponding to this single day. We will submit enough problematic URLs to fill your day. The number of cases that you will have treated will serve as a reference for us to renew the mission according to modalities yet to be specified.
In fact, we estimate the total number of sites to be corrected at several hundreds. If this mission is a success, it could therefore lead to many others for you. This should you lead to consider this day of work for us (and the cost of entry it represents) as an investment for the future (subject of course you want to renew this type of missions).
We hope that this mission will mark the beginning of a long and fruitful collaboration with us!