Manual Web Data Collection Project
9 cents per website processed. Approximately 40 websites can processed in an hour. I am looking to process about 1500 websites @ .09 or less per site.
Overview of Deliverable:
This is a data collection project. The deliverable is an XLS file from Microsoft Excel which contains the collected text. The data entry worker collects data from web pages and inserts the data into the spreadsheet, one row per website processed.
These are the steps involved:
1. I provide a list of search terms for Google and Yahoo. (See [url removed, login to view] file) For each search term in the input file, perform the following procedure once for Google, and once for Yahoo.
2. Before beginning, the worker needs to set up the search engine to display 10 search results per page.
3. For each search term:
a. Enter it into the search engine and get a page of search results, 10 per page
b. Search results appear, with advertisements. Click through each advertisement that appears. Adverisements appear at the top of search results and also along the right side of search results. (see Figure 1)
c. Click through each ad, one by one. For each ad you click, there is a resulting website. For each website, collect and insert the following information into an Excel spreadsheet:
i. Search Term used- the search string used
ii. Search Engine- the search engine used
iii. Web Address- the web site address that appears in the browser address bar when you click the adverisement .
iv. Phone – the phone number found in the ‘contact us’ page of the web site. The phone number is also often found at the bottom of the home page.
v. Email- the email found in the ‘contact us’ page of the web site. The email is also often found at the bottom of the home page.
vi. State- The state where the advertiser is located in the USA.
vii. Page Number- the results page number where the ad is found. The first 10 ads are on page 1, the next 10 are on page 2, etc. Enter “a-01” to denote page 1, “a-02” to denote page 2, etc.
d. Insert one row into Excel for each advertisement clicked and processed. A sample deliverable XLS file is provided ([url removed, login to view]). All of the results can go on a single worksheet named ‘Sheet1’
e. Process up to 10 pages of ads which are found to the right, next to search results. Use judgement and stop processing ads when the ads no longer seem to be related to the search term you are currently using.
4. Tips on how to collect the data:
a. [Control][C] and [Control][V] are best for cutting and pasting in this project. Use [Control] [C} to copy text from web pages. Remove formatting by pasting [Control][V] into Notepad. This removes the web formatting. Then copy the unformatted text from Notepad and paste into the correct cell in your Excel spreadsheet. This procedure can be used for the Phone and Email data items you locate and copy from each web site.
b. The phone and email can be found on the home page, on the “Contact Us” page, or the “Email Us” page if one exists on the web site. Use judgement to locate these date items on the website. Not all websites provide an email address and it is expected that not all rows of data will contain an Email address date item. However about 80% of the rows you create are expected to have the Email data item.
c. Some phone number and email data items are displayed in images rather than text that you can cut and paste. In such cases you must manually enter the Phone number and Email into the spreadsheet deliverable.
d. Do not collect information on ads that seem to go to directories. I am interested in the web sites of actual vendors only. Exercise judgement in this regard.
1. [url removed, login to view]: Sample input file.
2. [url removed, login to view]: Sample output file
3. [url removed, login to view]: Sample search results with advertisements