Summary: The spider should take a list of websites from a table in a database with the fields: siteID, siteURL. With that list of URLs, it should have the ability to spider any website in it's entirety and store within the database in a manner that allows browsing offline (i.e., convert internal linking to browsable linking). Upon completion of full spidering, a website is considered 'archived'.
* Windows Service, written with [url removed, login to view]
* Windows Application, tied to service or a form as part of the service that allows start/stop of spider, allows visual into the spider history
* All file types should be spidered that enable offline browsing
* .htaccess rules may have to be taken into account
* Each re-spider of a currently 'archived' website will store only updated pages as a new version (old version retained). A table will exist with the following fields: UID (Unique ID), OldVersion, NewVersion, Date). This will store changes in a central location.
* Code must be well-structured in class objects such that I can use it within other projects easily.
* All source code must be given and not compiled.
* Basic website (php) to allow viewing of the spidered content. all links should be internal to the website and should NOT go out to the website. To give an example of what is expected, you can check out the wayback machine for Scriptlance at: [url removed, login to view]*/http://www.scriptlance.com. This is what I would expect to see for the content. For individual archive browsing, I would expect what is seen there when you click one of the archive links.
NOTE: I need the functionality of this, but do not need an awesome interface - just the basics.