I need a web crawler particularly scalable and simply to use.
You can use any technology you want if there is an secured API (JSON or XML through SSL) to use it (authentication, launch/pause/resume/end a crawl, get its status, etc.) with PHP.
I read a little on this subject and thought to use these appreciated technologies:
- Apache or nginx
- in PHP (with symfony2 (goutte?)) or in Python
- cache/storage server (AWS S3 for HTTP responses and My/Post/No SQL like MongoDB for URL lists)
- multi-thread (AWS EC2, Hadoop/MapReduce, Gearman)
- multi-spiders with specific characteristics : spider localisation (based in France) management, can be polite, can set specific exclusion/inclusion limits (hosts, http and https protocols, URL pattern, 304 HTTP status management, spider-trap, kind of documents, etc.)
Even if I use PHP to connect and use the API, a graphical interface could be used for
- alerting and extending infrastructure (more threads, EC2 instance, storage, etc.)
- managing crawls (priority, close/pause/unpause crawls, crawled-resource type, crawl limits or crawl recursivity, sent HTTP headers) and recrawls (crontab, etc.)
- (automatically) managing multi-threaded parsers
- (robots.)TXT / (sitemaps.)xml / HTML parsers based on xpath
- proxies managing (update list automatically)
In fine, different applications will connect to this tool (with the API), launch crawls and get results in order to parse them (big parsing could be done in multithread).
Of course, I would like documentation to install it :-) and finally, if possible, how does it cost to maintain/use it on AWS.
Is it possible? Can you help me?
Bu iş için 3 freelancer ortalamada €315 teklif veriyor
Hi. We already have build such an infrastructure. And have lots of feature you noted and even other. If you interested, please contact me. Note that the price you are asking for such a product is too low. T Daha Fazla
Hi, Iam interested in your project and I will be happy to do that for you. I have rich experince in scrapping curl regular expressions Dom and Selenium RC. I worked for [login to view URL] and [login to view URL] search Daha Fazla
hello, I am an expert in web scrapying, and also interested in your project. Please contact me to discuss more details for your project, Thanks!