I have some rather large json .jl files. they have thousands of lines of html in them. each blob can be between 100mb to 3gb. i would like a script or cmd etc.. that i can point at a json file and it will parse the file as quickly as possible and output a histogram of word usage aka i want to know the most used words in the 1gb file. the output can be a csv file, obviously we cant output everything we do we need some lower adjustable threshold.

The hard part of this project is dealing with the large json single file. We need to utilize every core of the machine we run this on and we need to highly efficient in how we analyze the text.

if needed you can assume we have enough memory on the machine to fit the entire json file in it, so if its 3gb file we are sure we have 3gb of available memory if needed. likely we will run this on a server with 8+gb of free space.

Ideally this can run on windows machine but i am open to others if you can make a case for it being better.

