For Advanced CAT, we trained a new dictionary-based RBMT engine using Microsoft Custom Translator. Since we were working with Russian and English, it soon became evident that Russian presents special challenges with this kind of engine, since Russian adjectives and nouns decline into six cases, and English does not have this feature. A simple list of words in Russian and English would be insufficient if we were dealing with a more grammatically complex text. Thus, we decided to choose a domain where the words selected would generally stay in the nominative case in Russian, and ended up with nutrition labels.
Our ultimate goal would be to train the engine to translate nutrition labels of various brands. Different brands have different ways of doing things, even varying from item to item: some shorten words to fit on the box, and while they are generally saying the same thing, there are slightly different ways certain kinds of information can be conveyed, such as the way that a certain brand tends to put that these numbers are based on a 2,000-calorie diet. We started with a few different items in English to find the text we needed to start our engine, but we hope to add a variety of other items to make sure we have as many possibilities as possible. I think success rate could be very high, as the texts are low in complexity and variation is minimal.
For our source texts, we used regular food items (Fruit by the Foot Variety Pack, Market Pantry Rigatoni, and Special K Chocolatey Delight), as well as some information from Agricultural Information Management Standards from the Food and Agricultural Organization of the United Nations. Our dictionaries/text corpora used to translate into Russian were Linguee, Context Reverso, and Multitran. Our resulting dictionary file had 85 English>Russian pairs: nutrition_en-ru1.xslx.
Training Process and Progress
Despite our dictionary only having 85 pairs of words, training still took some time. It took about half an hour for the training to complete.