I have a web site that has around 2000 copyrighted documents that are publicly posted on our web site. We are constantly finding that our copyrighted documents are getting posted on the following web site ([url removed, login to view]) without our permission.
I would like you to create an automated search script that will perform queries of "signatures" of our files on [url removed, login to view] and will automatically generate a text file that lists the URLs on [url removed, login to view] that are duplicates of the documents on my site.
The script you write will need to take about 4 random samples of strings of text from my documents/HTML pages (around 8-10 words in each string of text/sample). These will be the "signatures".
Then the script will need to perform a search for each "signature" (in quotation marks) using the web site's search form located at the top right corner at [url removed, login to view] If all 4 signatures match exactly, then the search script can assign a confidence level of 100% that the document has been copied onto that site, 3 matches = 75%, 2 matches = 50%, 1 match = 25%, etc.
The script will need to produce a list of each URL, the confidence level, and the source URL from my site (the source of the signature).
I will then be able to take the text file the script produces and email it to that company so that they can remove the documents from their site.
If you are interested in this project, please note I am very price sensitive, so please bid accordingly. Please include the word "Excel" in your bid comments so that I know you read these requirements and speak English. If someone bids the right amount, I will very likely end bidding within 12-24 hours from now.