I need a code (in java, .net or php) that will choose random 10.000 URL-s from CommonCrawl dataset ([url removed, login to view]). For each URL you need to extract:
1) page title of that page (from <title> tag)
2) most frequent anchor text used to link to that URL - excluding one word anchor text and excluding URL anchor texts (anchor texts with http:// and www)
The results should be exported in excel or csv file. The file will have these columns:
URL, TITLE, ANCHOR TEXT.
For step 2) you will probably need to use external API like [url removed, login to view], [url removed, login to view] or similar. The 1 monthly cost of these api will be paid by me.
Hi, I am expert crawler maker. So this project wont be any problem for me. I will use [url removed, login to view] for api. And I will use .net for codding. Thanks
10 freelancer bu iş için ortalamada 162$ teklif veriyor
Hi Can be done. You need 10000 random pages from public dataset, and not care which pages? Also, how many anchor texts you need? One most popular or more? As I check, 5 most popular availiable for free on [url removed, login to view] Daha fazlası
Hi, Please feel free to discuss the project with me ........................................................................... Thanks, Murtaza
Hi! I am interested in your project. I am working in same projects (web spider) so I strongly believe that my abilities fit to your requirements. I look forward to working with you!