Hello, everyone, we are seeking a well tailored web crawler for our need of vertical search market in china, here is the details :
1. able to run multiple instance simultaneously for multiple download of web pages
2. able to set up index in form of word count statistic of each page, also the hyperlink structure of the page have to be maintained for programmable access
3. proper archive with suitable backup and recovery
4. able to scale to large cluster of computers
5. distributed technology allow optimization automatically both the parrallel and seriel algorithm of data mining regardless of detail of algorithm itself, but if that function too hard, you can leave the interface for us, we do the rest of the job instead, the detail of issues like what interface to be left remained to be negotiated if you accept the contract
6. able to provide some preprocessing and postprocessing to filter out unused data, we can provide the interface for the detail of the algorithm
7. extensible to our potential use through loosely coupled interface, for example, redundant page filtering, indexing using text summarization rather than word count, and distributed workload scheduling, and maybe others
8. good user experience of good look and feel and fully manageable, the manageability shall cover the detail of the function like how many instance to run at the same time, how much total workload or page downloaded, where to backup and restore the data, and they shall be delivered through web page view such as asp or php, but asp is our currently preferred,
9. remote access is highly preferred
10. good security in design and coding, particularly if you use language like c or c++, and other security best practices industrial wide shall be used, issues such as data privacy and integrity and authorization is essential
11. integration with google mapreduce or bigtable is highly preferred but not essential
best wishes