This project is for translating a large number of articles from? Spanish, French, German, Italian, Portuguese, Dutch, Swedish, Russian, Polish, Chinese, Japanese and Korean into English.?
This project is for translating foreign language Wikipedias from Spanish, French, German, Italian, Portuguese, Dutch, Swedish, Russian, Polish, Chinese, Japanese and Korean into English. [url removed, login to view]
It must run automatically and continuously without human intervention.?
What needs to be done in order of processing:?
P1. Set up Systran Web Translator on a server to accept article translations automatically. (We will buy a copy).?
P2. Download each non-English language Wikipedia from [url removed, login to view]
P3. Identify which language Wikipedias have been updated.?
P4. Store each article for each language in a database (but ignore the User: namespace). MediaWiki already has tools to import from XML to SQL (see [url removed, login to view]:Importing_XML_dumps)
P5. Import the langlinks table [url removed, login to view] for the English Wikipedia - this already has the language links between articles.?
P6. For each article follow the language links, retrieving the wiki article in languages other than English.
P7. Using an installation of MediaWiki, convert this article to HTML.?
P8. Machine translate the article into English, store as HTML in a database (see fields below). You will need to find a way to automate this as I believe the program was designed to be run manuallly.?
P9. Retrieve images for each.
P10. Build very simple frontend that simply shows all translations to English for a particular article with the name of the source language in the title for each. It must be able to display the images from the same box stored in P9.?
P11. Repeat steps P2-P9 for each language. Only pause if all the work is done. Only retranslate each article if more than 3% of the words have changed.?
Fields to include in the translation table:?
F1. source language
F2. article (just the translated version is fine)
F3. title in english (indexed)
F4. title in original language?
F5. size of original language article
F6. the revision ID (this is just sitting in the dump. May be useful later to know which revision it is for other features).?
F7. date of timestamp field in the XML)?
F8. date of langlink
F9. date article first created
F10. date article was last updated
Suggested milestones. I list these in order of development:?
M1: Importation of data into HTML and language links - (P2 to P5) - 10% payment.?
M2: Create machine translated HTML for one language with frontend just to demonstrate it - (P1, P6, P7, P8, P10) - 10% payment
M3: Retrieve images for just these articles (P9)- 10% payment?
M4: Run for all languages (images and articles - (P1 - P11) - 40% payment
M5: Demonstrate that it's running on an ongoing basis - 30% payment
D1: MySQL table with all data described in F.
D2: Images stored on disk
D3: Any code you required to do this.?
D4: Documentation with step-by-step instructions on how to install and run the whole system
In your bid please include:
* description of your experience working with large databases and files (over 10GB)
* whether you agree with the proposed milestone schema or what you suggest your own to be.?
* what dates you can deliver each milestone by (be conservative if you're unsure).