Scalable web crawler

Hello everyone,

I need a web crawler particularly scalable and simply to use.

You can use any technology you want if there is an secured API (JSON or XML through SSL) to use it (authentication, launch/pause/resume/end a crawl, get its status, etc.) with PHP.

I read a little on this subject and thought to use these appreciated technologies:

- Apache or nginx

- in PHP (with symfony2 (goutte?)) or in Python

- cache/storage server (AWS S3 for HTTP responses and My/Post/No SQL like MongoDB for URL lists)

- multi-thread (AWS EC2, Hadoop/MapReduce, Gearman)

- multi-spiders with specific characteristics : spider localisation (based in France) management, can be polite, can set specific exclusion/inclusion limits (hosts, http and https protocols, URL pattern, 304 HTTP status management, spider-trap, kind of documents, etc.)

Even if I use PHP to connect and use the API, a graphical interface could be used for

- alerting and extending infrastructure (more threads, EC2 instance, storage, etc.)

- managing crawls (priority, close/pause/unpause crawls, crawled-resource type, crawl limits or crawl recursivity, sent HTTP headers) and recrawls (crontab, etc.)

- (automatically) managing multi-threaded parsers

- (robots.)TXT / (sitemaps.)xml / HTML parsers based on xpath

- proxies managing (update list automatically)

In fine, different applications will connect to this tool (with the API), launch crawls and get results in order to parse them (big parsing could be done in multithread).

Of course, I would like documentation to install it :-) and finally, if possible, how does it cost to maintain/use it on AWS.

Is it possible? Can you help me?

Beceriler: Amazon İnternet Servisleri, NoSQL Couch & Mongo, Symfony PHP, Web Scraping

Daha fazlasını gör: xpath and or, web scraping https, web scraping api, web-crawler, sql server resume, resume update services, resume help services, python get type, parsers for web, list get python, how to type a resume, get web services, amazon aws s3 api, amazon api scraping, web technologies 2014, resume services cost, python xpath, python html parsing, python hadoop, nginx authentication, mongodb php, mongodb amazon, hadoop mongodb, hadoop and python, aws sql server

İşveren Hakkında:
( 0 değerlendirme ) France

Proje NO: #6527416

Bu iş için 3 freelancer ortalamada €315 teklif veriyor


Hi. We already have build such an infrastructure. And have lots of feature you noted and even other. If you interested, please contact me. Note that the price you are asking for such a product is too low. T Daha Fazla

in %bids___i_period_sub_35% gün içinde250%project_currencyDetails_sign_sub_37% %project_currencyDetails_code_sub_38%
(33 Değerlendirme)

Hi, Iam interested in your project and I will be happy to do that for you. I have rich experince in scrapping curl regular expressions Dom and Selenium RC. I worked for [login to view URL] and [login to view URL] search Daha Fazla

in %bids___i_period_sub_35% gün içinde140%project_currencyDetails_sign_sub_37% %project_currencyDetails_code_sub_38%
(5 Değerlendirme)

hello, I am an expert in web scrapying, and also interested in your project. Please contact me to discuss more details for your project, Thanks!

in %bids___i_period_sub_35% gün içinde555%project_currencyDetails_sign_sub_37% %project_currencyDetails_code_sub_38%
(12 Değerlendirme)