We are an educational company and want the below project for one of our clients
Our customer’s details:
During this course, you will develop an information portal on a topic of your choice based on focused crawling technology. A focused crawler is a specialized crawler which "learns" a set of target topics from user-provided training data and is then able to automatically classify web pages based on their content (both based on structural and content-related features of the web pages that it finds). The web pages that are classified into your topics of interest should be indexed by Apache's Lucene search engine and be accessible by a user via regular keyword searches. Optionally, you may want to enable the user to browse the crawled topics according to your topics of interest ("topic exploration"), or further cluster the documents according to their contents ("faceted search"). See the Weka library below for more data mining tools.
Detailed descriptions of the architecture of focused crawlers are available via the above research papers.
A demo of the BINGO! focused crawler is available for download from the following URL:
[url removed, login to view]
A suggested topic for building your information portal is the computer science domain. If you choose this domain, you may consider crawling and classifying the homepages of computer-science researchers and their publications (which are usually available as PDF files). DBLP, for example, is a very good source for seed URLs in this domain: [url removed, login to view]~ley/db/
JAVA PACKAGE STRUCTURE
The file [url removed, login to view] provides a predefined Java package structure and several abstract classes which should serve as the basis for your implementation of the project. The preferred way to edit and compile the Java sources is probably to use the Eclipse IDE ([url removed, login to view]). You need to add the three Jar files in the ./lib directory to your Java classpath in order to compile the sources.
Please read the whole project in the docx file attached