I need a program to daily spider [url removed, login to view] and download all media files found, and keep the records in an MS SQL 2005 database. (I’ll give you the table).
This project is intended for those who have done similar spiders/scrapers in the past.
I’m providing a flow chart to code to; along with the database table and the rest of this description it is intended to be a complete guide for development. Read everything over before bidding. Answer the questions at the end while posting your bid, bids without answers will not be considered.
First of all, go to [url removed, login to view] and see the format of the site. There are five sections: movies, pictures, games, animations and links. Each section contains a multi-page list of files. Each item in the list contains a thumbnail, a title, a description and a link to a page that contains a ‘download’ link to the actual file. Program must be able to grab all of these items, so you obviously need to be able to make full use of RegEx.
The following is a flow chart of the code logic:
For sections movie, pic, game, animation, links
Start at page 1
For each item in page
if ‘url’ exists in db
Matches + 1 // set matches = 0 at start of each section
If (matches > 10)
Move on to next section
Add item to db and set all columns (url, type, etc)
Download the thumbnail and save as [id] + [textension]
If type is mlink or link // test by checking if domain is external
Mark record as downloaded
After all of the above is done,
For each record in db not marked as downloaded
Go to ‘url’ of the record
Grab the ‘url’ of the file (from ‘download’ link)
Download the file and save as [id] + [extension]
Mark as downloaded
Program must maintain the following MS SQL table:
id (int, auto identity column)
type (movie = 1, pic = 2, game = 3, flash = 4, mlink = 5, link = 6)
title (title of the file, grabbed from the page)
description (description of the file, also grabbed from the page)
date (date the file was downloaded)
extension (file extension, eg .avi, .swf, .mpeg, etc)
textension (file extension of the thumbnail, eg .jpg, .jpeg, .gif)
url (this is the url to which the thumbnail points, or for text links, the text link points. Normally this is the same as the page that contains the ‘download’ link.)
downloaded (bit, default is 0. set to 1 after the file has been successfully downloaded.)
ALL of the requirements must be met:
Program must run either as a windows service or be executed in daily interval by the windows task scheduler. If the program is not a windows service, it must begin work upon execution (so it can be safely triggered by task scheduler).
Program must grab all items from dumpalink's five sections: movies, pictures, games, animations, links. Files and thumbnails are downloaded. Titles, descriptions, urls are saved into a database. [see database table and flow chart]
Thumbnails are always downloaded, for all items. Most files can be downloaded from the ‘download’ link. For pictures, the actual picture is downloaded. In the links section, text links from the linkdump must be added to the database like everything else, but nothing needs to be downloaded.
Each time the program runs it must spider ONLY new pages since its last execution. One way of doing this is to start checking each section at page 1 and increment until you hit at least 10 links that already exist in the database. [see flow chart] Also make sure you never download the same file twice, by checking with the database.
All files must be downloaded into a single folder and named [id] + [extension]. For example, [url removed, login to view] would become 123.mpeg. Thumbnails must be saved into a separate folder, and named [id] + [textension].
Program must check on every execution for any files that have not been downloaded successfully and make an attempt to do so. [see flow chart]
Program must have a config file or another way to configure the following parameters:
1) path to file folder, path to the thumbnail folder
2) sql server connection string
3) all RegEx expressions used in the code
4) scan frequency if the program is a service (how often it spiders, in hours)
5) a filter regex to ignore certain links (before adding a record to the database a filter regex is called on the ‘url’ string, if match occurs that file or link is ignored)
6) number of matches before section scan is terminated (default 10)
Answer the following questions along with your bid, bids without answers will not be considered:
1. Can you work in C#? If not, what language do you plan to use for development?
2. Do you indent to develop a windows service or an executable triggered by task scheduler?
3. Have you done similar projects before?
4. Do you have MS SQL 2005? An express version is available for free from Microsoft.
The flow chart got completely mangled by GOF. See attached doc.