I need a mini-app (Compiled C on Linux) that groups similar sentences together.
I have 100,000 sentences (say in a PostgresSQL DB, Unicode text). It must perform VERY fast - by indexing each root-word to a 16bit integer (which would reduce its memory foot print), then re-creating a new data structure with sentence delimeters and sentence length. Group into buckets of similar sentence length.
Then iterate through doing word-by-word comparisons (16bit comparisons).
Two algos are acceptable:-
1. Simple - Take a source sentence and iterate through XORing word by word (irrespective of word order or word frequency). If there are more than x words outstanding - then it is NOT a similar sentence. X in this case would be 25% of the number of total words.
We leave such large gap so that we don't need to worry about word roots.
From the smaller data set - we then proceed to do a classic levenstechn comparison - but with an upper bound of x deviation - meaning after it detects more than say 10% deviation - it exists that comparison. Here it is a character by character comparison.
The app should communicate with a folder of .gz files that contain the text and it could use a text boundary to distinguish each sentence.
The output would need to be a new text file that sorts every sentence into groups of similarity - separated by a text boundary.
I need something in 36 hours. A mediocre algorithm is fine.
Bu iş için 12 freelancer ortalamada $440 teklif veriyor
You can trust my expertise, I can finish in time, thanks a lot! I am very proficient in c and c++. I have 16 years c++ developing experience now, and have worked for more than 7 years. My work is online game developin Daha Fazla
Hello, I'm c developer with 6+ years of experience and mathematician with a number of publications. Also I'm participant and problem writer of many algorithm competitions (Topcoder, ACM ICPC, etc). Just 2 weeks Daha Fazla
Hi, I have 4 years of experience in C/C++ development in Linux environment. Looking forward for your response to discuss further. Regards, Akram