# Analysis of n-grams (simple text analysis)

Analysis of n-grams (simple text analysis) in SEQUENTIAL, PARALLEL AND DISTRIBUTIVE programming + Latex Report

N-grams are sequences of n words. Knowing which n-grams and how frequent

they appear in a given text is useful in various search related problems and

text analysis. Given an input corpus(text), construct a list of frequencies of

n-grams where n is the program parameter. We update the list by weighting

the elements by the probability of occurrence of parts of n-grams.

Counting n-grams Write a program that given an input n (size of n − gram)

and a given corpus returns lists of all n-grams with he corresponding frequencies (number of occurrences in the given corpus).

Relative frequencies Some (n)grams can occur frequently simply because one

of their words is very frequent. An interesting statistic can be obtained by

dividing the number of occurrences of n−gram ”A B” with the total number

of occurrences of all n-grams that begin with a letter ”A”. This gives P(B|A)

probability of seeing ”B”, following a letter ”A”. The program should print

the n-grams with aforementioned metric.

Implementation guidelines

1. Running the program:

• The program can be ran in different modes (sequential, parallel,

distributed) by specifying a parameter.

19

• User can specify the n (n-gram) and the input file (text) considered

a corpus.

• The program measures run-time needed to complete.

2. Problem specific implementation requirements

• Every version (sequential, parallel, and distributed) measures cycles passed. Every update of positions of all particles is considered

a cycle. All three implementations run the simulation until they

reach a specified number of cycles.

• The implementation must adapt automatically to the hardware it

is being ran on (Physical CPU’s, Cores, Memory, etc..);

Testing

The report must include extensive testing and explanation of results (numeric

and graphical). All three versions must be tested. The parameters that

influence the runtime are the size of the input corpus and n. Consequently,

both need to be tested independently to show how the implementation scales.

Present the results with informative charts/figures and explain them in detail.

To obtain corpus use the Internet. Project Opus [login to view URL] is

a good source of interesting texts.

• Testing by limiting n

Start with n = 2 and increase it by 1 to obtain a new configuration.

For every configuration test all three versions with different input files

(corpus). There should be at least 5 different text files. The smallest

corpus is 100MB of size with each following corpus being a 100MB bigger. Measure runtime for every every version, for every configuration,

and for every text file.

Beceriler: Java

İşveren Hakkında:
( 2 değerlendirme ) Koper, Slovenia

Proje NO: #21135250

## Bu iş için 3 freelancer ortalamada \$83 teklif veriyor

utkarshkatiyar19

Hi, I'm an expert in text analysis and java programming. I'm sure that I can easily do this project for you. We can have a chat about it. Thanks

\$200 USD in 7 gün içinde
(301 Değerlendirme)
7.2
AndreiPPB

Hello. I read your requirement carefully. I'm very interested your project. I'm talented Java developer. If you assign to me this project, you can get a cool result. please contact me. thanks. best requards.

\$20 USD in 7 gün içinde
(7 Değerlendirme)
4.4
taimour7

Hello, I have gone through your project details and am really interested in building this program for you. I am a java developer with over a year of industry experience in developing web applications. Ping me Daha Fazla

\$30 USD in 7 gün içinde
(0 Değerlendirme)
0.0