i am looking for someone to build a very easy crawler (linux commandline prog / script).
the crawler should crawl a hostname / domain and just write the urls of the website to a textfile.
- check the [url removed, login to view] to crawl just allowed urls
- check the meta robots noindex / index - just check urls with index
- check meta robots nofollow / follow - just check urls with meta follow
- check rel nofollow - dont add links with rel nofollow to queue
- multiple threads - crawling boost ;)
to save traffic please:
- just load html / plain text files -> readable file formats - no exe, doc, xls, gif, jpg ... (stop downloading if the header content type is not html, plain text, rss, xml ...)
- stop downloading if the filesize is over 2 mb (ignore this files)
this is a low budget project.
you can use already build crawlers and change it for my requirements.