Web Crawler Identifying Layout
Bütçe $250-750 USD
I am looking for a solid web crawler, that has one task, and one task only...
Identify different page layouts on a site.
Some site, especially webshops have category pages, subcategory pages, product pages, checkout pages...
This crawler, should not identify the purpose of the page, but be able to take a site with 500.000 pages, and identify how many different page layouts there are.
In the end, it should end up making a list of each url, and add a layout ID (XML)
EXAMPLE XML
<website>
<ws_info>
<ws_url>http://domain.com/</ws_url>
<ws_pages>146.000</ws_pages>
<ws_cats>6</ws_cats>
<ws_scraped>01.07.2010 11:53:07</ws_scraped>
</ws_info>
<cpage>
<cpage_scraped>
<cpage_url>http://domain.com/some-page-url</cpage_url>
<ws_cat>3</ws_cat>
<cpage_scraped>
<cpage_scraped>
<cpage_url>http://domain.com/some-page-url</cpage_url>
<ws_cat>6</ws_cat>
<cpage_scraped>
</cpage>
</website>
Performance and speed of the scraper - as well as how it will intelligently view one page appart from the other is a main ingredient of this scraper.
Some sites have very similar pages, however making the scraper identify an element as a menu, submenu or navigation - thereby making it ignore the element is very much wanted...
I dont want to scrape a site with 200.000 pages, and the scraper comes up with 110.000 different category's of pages.
Bu iş için 7 freelancer ortalamada $427 teklif veriyor
I have been working as a .net developer for last six years. I also have experience on sharepoint. I think i suit well for this work. my core skillset includes: C#, SQL Server, .Net framework, Sharepoint and html.
Let me help you out in this task. I done similar kind of task in a semester project of mine BS(CS) degree.
I have MS in CS and 10 years working experience in web and search engine fields, I am experienced in web crawler development.
Hi, We are the group of people working from both India and US with knowledge in PHP, C#, ASP.NET, Data processing, Sql Server, MSSql, DB2, Joomla, Drupal did several projects as the same and we are really interested in Daha Fazla