I need a web robot to gather links from web pages and store the information in a database.
Links to HTML Documents and RSS Feeds will be followed and data will be collected about the new document.
The following from each page will be collected:
- Document Type
- Domain Name
- Url String (Relative to the domain)
- Query String
- Date / Time visited
- All outgoing links from that page, indicating the Media Type of the subsequent page (includes type and version where applicable).
Document Types are:
- HTML Document, RSS Feed, Image, Video, email Address, File download, etc..
Must follow 'courteous' robot ettiquette:
- Adheres to [url removed, login to view] inclusions / exclusions
- Browses pages at a comfortable pace - does not overload a single site with multiple hits at once.
- Tracking - knows when a page was last visited, only re-visits a site at a given interval
- Identifies itself properly (configurable agent name)
- Must be built using C#.NET.
- Runs as a service.
- Must use MySQL database.
- May use / refine existing Open Source Software - developer to provide reference to source location and licenses.
- Working application
- All source code.