I need a web crawler particularly scalable and simply to use.
You can use any technology you want if there is an secured API (JSON or XML through SSL) to use it (authentication, launch/pause/resume/end a crawl, get its status, etc.) with PHP.
I read a little on this subject and thought to use these appreciated technologies:
- Apache or nginx
- in PHP (with symfony2 (goutte?)) or in Python
- cache/storage server (AWS S3 for HTTP responses and My/Post/No SQL like MongoDB for URL lists)
- multi-thread (AWS EC2, Hadoop/MapReduce, Gearman)
- multi-spiders with specific characteristics : spider localisation (based in France) management, can be polite, can set specific exclusion/inclusion limits (hosts, http and https protocols, URL pattern, 304 HTTP status management, spider-trap, kind of documents, etc.)
Even if I use PHP to connect and use the API, a graphical interface could be used for
- alerting and extending infrastructure (more threads, EC2 instance, storage, etc.)
- managing crawls (priority, close/pause/unpause crawls, crawled-resource type, crawl limits or crawl recursivity, sent HTTP headers) and recrawls (crontab, etc.)
- (automatically) managing multi-threaded parsers
- (robots.)TXT / (sitemaps.)xml / HTML parsers based on xpath
- proxies managing (update list automatically)
In fine, different applications will connect to this tool (with the API), launch crawls and get results in order to parse them (big parsing could be done in multithread).
Of course, I would like documentation to install it :-) and finally, if possible, how does it cost to maintain/use it on AWS.
Is it possible? Can you help me?
4 freelancer bu iş için ortalamada 368€ teklif veriyor
I specialize in web scraping jobs like this. You can see on my profile that I have completed many similar jobs. I can have this project done properly and in a timely manner.
hello, I am an expert in web scrapying, and also interested in your project. Please contact me to discuss more details for your project, Thanks!