I need a program in JAVA that it will read from a directory the training set that it contains spam and legitimate [url removed, login to view] will read all the emails breaking them into words and put all the words in an one dimensional array with their frequency. then each email must have an array with the length of the previous array with the words that have the most frequency and each vector has to declare if the email is spam or not . The program will make use of a stop list that it will contains all the unnecessary words and symbols like and on (, . "") and it will remove them from the vectors. we need to decide which of the words we will keep.
Output file format:
The left column will have the names of the .txt testing files
The right column will have the predictions
‘s' for spam
‘l' for legitimate
The two column must be separated with a tab (\t)
I will give you the training set