PRELIMINARY NOTE: this project require parsing content from Wikipedia. Wikipedia is licensed under CC, so this is not only perfectly legal, but encouraged. There is a page on Wikipedia to give users advice on how to do exactly this, and we won't be scraping the website, but using a downloadable version that Wikipedia themselves provide.
We need to break up every page on Wikipedia into multiple articles.
For instance, this article: [url removed, login to view] is already divided into:
2.2 Bronze Age
2.3 Iron Age
2.4 Migration period
2.5 Viking Age
2.6 Kalmar Union
2.7 Union with Denmark
2.8 Union with Sweden
2.9 Dissolution of the union
2.10 First and Second World Wars
2.11 Post-World War II history
4 Politics and government
4.1 Administrative divisions
4.2 Judicial system and law enforcement
4.3 Foreign relations
6.1.1 Oil fields
7.3 Largest cities of Norway
8.1 Human rights
9 International rankings
10 See also
On Wikipedia, the links point to an area of the page. Instead, we need to have the area of the page like a standalone article, so that we can import it as a module.
We need to generate—for each article extracted from the page—a JSON file or database entry with metadata like the page title and the category the page was filed under, and an array of the articles generated (including the article introduction, which is not under "Contents").
If opting for JSON files, we could have a folder with the articles saved into individual HTML files (for instance, "1 [url removed, login to view]", "2 [url removed, login to view]", "[url removed, login to view]" for the introduction).
We also need to generate a JSON file with the tree of all categories on Wikipedia.
Being CC, anyone can download Wikipedia.
It will be needed to parse the ZIM file with all articles. We will be using the Italian version (downloadable here: [url removed, login to view], file [url removed, login to view]). While the locale shouldn't matter, we will ultimately need to populate Imparato with contents from the Italian version.
The software should ideally run from the command line on Unix systems, something like:
zim-extract-categories --zim-file [url removed, login to view] --dest .
zim-extract-articles --zim-file [url removed, login to view] --dest . --category 22