Analysis of n-grams (simple text analysis) in SEQUENTIAL, PARALLEL AND DISTRIBUTIVE programming + Latex Report
N-grams are sequences of n words. Knowing which n-grams and how frequent
they appear in a given text is useful in various search related problems and
text analysis. Given an input corpus(text), construct a list of frequencies of
n-grams where n is the program parameter. We update the list by weighting
the elements by the probability of occurrence of parts of n-grams.
Counting n-grams Write a program that given an input n (size of n − gram)
and a given corpus returns lists of all n-grams with he corresponding frequencies (number of occurrences in the given corpus).
Relative frequencies Some (n)grams can occur frequently simply because one
of their words is very frequent. An interesting statistic can be obtained by
dividing the number of occurrences of n−gram ”A B” with the total number
of occurrences of all n-grams that begin with a letter ”A”. This gives P(B|A)
probability of seeing ”B”, following a letter ”A”. The program should print
the n-grams with aforementioned metric.
1. Running the program:
• The program can be ran in different modes (sequential, parallel,
distributed) by specifying a parameter.
• User can specify the n (n-gram) and the input file (text) considered
• The program measures run-time needed to complete.
2. Problem specific implementation requirements
• Every version (sequential, parallel, and distributed) measures cycles passed. Every update of positions of all particles is considered
a cycle. All three implementations run the simulation until they
reach a specified number of cycles.
• The implementation must adapt automatically to the hardware it
is being ran on (Physical CPU’s, Cores, Memory, etc..);
The report must include extensive testing and explanation of results (numeric
and graphical). All three versions must be tested. The parameters that
influence the runtime are the size of the input corpus and n. Consequently,
both need to be tested independently to show how the implementation scales.
Present the results with informative charts/figures and explain them in detail.
To obtain corpus use the Internet. Project Opus [login to view URL] is
a good source of interesting texts.
• Testing by limiting n
Start with n = 2 and increase it by 1 to obtain a new configuration.
For every configuration test all three versions with different input files
(corpus). There should be at least 5 different text files. The smallest
corpus is 100MB of size with each following corpus being a 100MB bigger. Measure runtime for every every version, for every configuration,
and for every text file.
Bu iş için 3 freelancer ortalamada $83 teklif veriyor
Hi, I'm an expert in text analysis and java programming. I'm sure that I can easily do this project for you. We can have a chat about it. Thanks
Hello. I read your requirement carefully. I'm very interested your project. I'm talented Java developer. If you assign to me this project, you can get a cool result. please contact me. thanks. best requards.