Given a search term, use Google WebSearch to generate search results, download the referenced content from the search result into a directory along with all content on that page and change the references to the page contents to be local to a sub-directory.
This is a very simple and quick project for an experienced Linux/BSD programmer/admin who can use standard utilities and software to perform much of the work. It can be written as a series of shell scripts (Bourne/Korn/zsh) calling open source utilities such as wget and QT/WebKit. Much of it can be done with htdig and wget.
You will be creating a program which will run from the Linux/BSD command line, taking command-line arguments and creating a directory containing a number of files. My preference is that the code be written in PERL or C, however other languages such as PHP or Python may be acceptable - check with me first.
1) You will be invoked with the following arguments:
-d DirectoryName [Which is the name of the base directory into which you should store all of your data]
-s "Search String" [A booloean search string]
-n NumberOfResults [The maximum number of results we're looking for. You may have to itterate searches with the search engine to get enough results. If this is 0, or is not present, the maximum number of results available from the search engine should be fetched.]
-c CustomSearchName [If the search is to be restricted to a custom set of sites, such as with Google's Custom Search Engine, its name should be provided here.]
2) Store the search query into a file called [url removed, login to view]
3) Store the search results into a file called [url removed, login to view]
4) For each URL in the results, perform the following:
a) Create a subdirectory using the following for the directory name:
- A four digit number with leading zeros
- The number should be the number of the search result, with zero being the first result, and incrementing by 1 for each additional search result.
b) A file called "TIMESTAMP" which contains the creation date of this directory in yyyy-mm-dd hh:mm:ss format.
c) Grab the entire contents of the web page or the referenced file (such as a .pdf, .mov, .jpg, etc).
d) If its a web page: Store its html as [url removed, login to view] in the directory.
e) Download all of the content used to display the page, such as graphics, videos, pdfs, etc. (Using wget or htdig?)
f) A sub-directory called "links" where all of the downloaded media used to satisfy the references in the file are placed. You should retain the naming conventions, including subdirectories, as in the original page.
g) A .png image of the web page must be generated using WebKit (preferably using Qt) and saved in the directory called [url removed, login to view]
The downloaded pages should not have any links which cannot be resolved via a local source in it. The ultimate test of the downloaded URLs is to open the URL's directory with Firefox, Safari and Microsoft IE [url removed, login to view] file and be able to see the URL with all content in it. Any links to full web pages should take the browser to a sub-directory for that page which contains the entire page with all of its content. Any links off that page may go to the web and are not cached locally.
You may need to register with the search engine to get a key to use while developing. If there are any fees charged by the search engine during the development phase, let me know, so we can decide how to cover them.