We have a constantly updated database of news articles and are interested in a way to group them by similarity. So all the news story of the same subject are listed in a group, not by each item. Something the way google news does. The articles are in Romanian, if that is of any relevance.
What the project entails is creating a php script that will parse the database and compare articles based on their text. One possible solution is extracting word bi-grams/tri-grams and comparing the articles by those. But any working solution is welcome.
You're welcome to contact me by PMB with your idea and please don't place a bid unless you have done so.