I have some rather large json .jl files. they have thousands of lines of html in them. each blob can be between 100mb to 3gb. i would like a script or cmd etc.. that i can point at a json file and it will parse the file as quickly as possible and output a histogram of word usage aka i want to know the most used words in the 1gb file. the output can be a csv file, obviously we cant output everything we do we need some lower adjustable threshold.
The hard part of this project is dealing with the large json single file. We need to utilize every core of the machine we run this on and we need to highly efficient in how we analyze the text.
if needed you can assume we have enough memory on the machine to fit the entire json file in it, so if its 3gb file we are sure we have 3gb of available memory if needed. likely we will run this on a server with 8+gb of free space.
Ideally this can run on windows machine but i am open to others if you can make a case for it being better.
15 freelancer bu iş için ortalamada 196$ teklif veriyor
Hello Sir, I have much experience on data parsing with json and data mining i can help you do it pl ping me and give me more details thanks!!!!!!!!!!!!
Hi! I am experienced Python coder. I can develop a script to read and interpret your files quickly and efficiently. Please contact me for further discussion. Thanks
Sounds like a great project. Multiprocess/multithread for the load, and the analysis is really easy. How do you want the output, another json file, text file, or database? I can do all of them perfectly.