Devam Ediyor

488486 Database Scraper Wanted

need a small application that will scrape this site: [url removed, login to view]

and

[url removed, login to view]

and return data in a clean tab delimited format.

A typical query results in one of three different records:

Salesperson - [url removed, login to view]

Broker - [url removed, login to view]

Corporation - [url removed, login to view]

SPECIFICATION:

Each record should contain a field populated with the record type (Salesperson, Broker or Corporation).

All data for each record type should be delimited into matching fields (see examples at bottom).

Null or empty records should be notated by "Null Record" populated into "Record Validity" field

All records should include a field populated with all characters in the source URL trailing the "=" character - usually 8 numbers (ex: 01172800), ie the record ID.

Certain records have an addendum to the "Mailing Address" field - (Above address is marked unreliable in DRE database) (for example, see [url removed, login to view]) This line should go in a separate field named "Address Validity" populated with that particular string ie "Above address is marked unreliable in DRE database"

Certain records have multi-line fields. This data should be stored with one CR/LF between lines, and two CR/LF between apparent records ( see [url removed, login to view] and [url removed, login to view] & [url removed, login to view] License_id=00511018 for examples)

The last field "COMMENT" gets concatenated and a comma added between each line.

No leading or trailing spaces in fields

No escaped strings in fields ( "ABC, Inc." should read ABD, Inc. )

The application should be able to read a text file with the URL record identifier values and use that to cycle through the specified records. The instruction file format should be both a range of numbers separated by CR/LF such as:

00511000

00511055

00511098

00511025

00511005

and a range with a delimiter, such as:

00511000 [TAB CHARACTER] 00709999

The application should identify itself as IE/Mozilla browser, and be indistinguishable from a a normal browser.

First record value: [url removed, login to view]

Last record value: [url removed, login to view]

*******************EXAMPLE***************************************************

HEADER:

License Type [TAB CHARACTER] Name [TAB CHARACTER] Mailing Address [TAB CHARACTER] License ID [TAB CHARACTER] Expiration Date [TAB CHARACTER] License Status [TAB CHARACTER] Salesperson License Issued [TAB CHARACTER] Former Names [TAB CHARACTER] Employing Broker [TAB CHARACTER] Comment

DATA:

SALESPERSON [TAB CHARACTER] Lo Bue, Robert Anthony [TAB CHARACTER] 131 OTTAWA WAY [TAB CHARACTER] FREMONT [TAB CHARACTER] CA [TAB CHARACTER] 94538 [TAB CHARACTER] 00796892 [TAB CHARACTER] 09/15/84 [TAB CHARACTER] EXPIRED [TAB CHARACTER 09/16/80 (Unofficial -- taken from secondary records) [TAB CHARACTER] NO FORMER NAMES [TAB CHARACTER] NO CURRENT EMPLOYING BROKER [TAB CHARACTER] NO DISCIPLINARY ACTION, NO OTHER PUBLIC COMMENTS

Header:

License Type [TAB CHARACTER] Name [TAB CHARACTER] Mailing Address [TAB CHARACTER] License ID [TAB CHARACTER] Expiration Date [TAB CHARACTER] License Status [TAB CHARACTER] Broker License Issued [TAB CHARACTER] Former Names [TAB CHARACTER] Main Office [TAB CHARACTER] DBA [TAB CHARACTER] Branches [TAB CHARACTER]

Affiliated Licensed Corporations [TAB CHARACTER] Sales Persons [TAB CHARACTER] Comment

Data:

[TAB CHARACTER] BROKER [TAB CHARACTER] Myers, Richard Arthur [TAB CHARACTER] 3444 FREEMAN ROAD [TAB CHARACTER] WALNUT CREEK [TAB CHARACTER] CA [TAB CHARACTER] 94595 [TAB CHARACTER] 00796893 [TAB CHARACTER] 02/06/11 [TAB CHARACTER LICENSED [TAB CHARACTER] 10/07/88 (Unofficial -- taken from secondary records) [TAB CHARACTER] NO FORMER NAMES [TAB CHARACTER] 3444 FREEMAN RD

WALNUT CREEK [TAB CHARACTER] CA [TAB CHARACTER] 94595 [TAB CHARACTER] NO CURRENT DBAS [TAB CHARACTER] NO CURRENT BRANCHES [TAB CHARACTER] NO CURRENT AFFILIATED CORPORATIONS [TAB CHARACTER] NO DISCIPLINARY ACTION, NO OTHER PUBLIC COMMENTS

*******************EXAMPLE***************************************************

Beceriler: Her şey Kabul, ASP, Veri Tabanı Yönetimi, Web Scraping

Daha fazlasını görün: wanted format, string matching in c, salesperson license, range query, need sales persons, named query, matching strings, matching string, first data corporation, abc 11, need a salesperson, string matching, salesperson, rd web, Ottawa, need a three character, DBA, data scrape database, arthur, tab delimited text, text file scraping, browser current url, added database, tab delimited format, numbers database

İşveren Hakkında:
( 1 değerlendirme )

Proje NO: #2234397