All things local

Tag: CAT Tools

CAT Tool and Non-CAT Tool Musings

Finding ways to automate the localization process can come in various forms. Be it a new tool or feature within a CAT tool, finding that new thing to add to your automation arsenal could potentially save you time and money. One question that I continued to ask myself over the course of this semester was: “How can this be automated?”

Tips for Trados QA (using Regex)

Trados has a labyrinth of features that can be deployed to automate the localization process. The issue is visualizing how these features can be applied to your specific workflow, language locale, and content. One such feature that is worth looking into is regular expression (regex). Regex is a special text string for describing a search pattern. Here are some examples of how regex in tandem with Trados can enhance the localization process:

  • Regex can search for a pattern, like a web address, a date, a combination of number and measurement
  • Changing date formats or the sequence of elements
  • Use regex in the QA checkers to find specific things, like numbers and measurement units that are not separated by a non-breaking space

Using Trados, I sought to test these enhancements with English and Chinese.

Commas and Periods for Chinese

In the QA checker I added [.,] in the RegEx target field to detect for non-Chinese commas and periods. When the translator runs the QA verification, any non-Chinese commas and periods will show up in the target field as a warning. These can then be replaced using the same regular expression in the “Find what” field as shown in the screenshot above.  

Time

In Chinese, time is typically expressed using a 24-hour clock. The QA checker will now detect for any expressions of time that follow the AM/PM format in the source, and then compare these to the target. If target and source match (meaning time is expressed using AM/PM in Chinese rather than converting time to a 24-hour clock format), this will be marked as a potential error in translation.

“The best defense is a good offense.” Using regex to preempt potential issues that may arise during translation will save a lot unneeded repair work and streamline the localization process.

Utility Demo/Training Video

Apart from CAT tools, what other methods are out there waiting to be tapped into by unsuspecting PMs? I took the liberty to dig deeper into one whose presence has become hard to ignore – Monday.com. On their website they claim to be “a tool that simplifies the way teams work together – Manage workload, track projects, move work forward, communicate with people.” Assuming the role of a PM, I sought to explore how Monday.com could be effective in automating the life of a localization project.

Click here to watch my review

Let’s Train TED with SMT

Overview

This semester I dove deeper into the world of Computer-Assisted Translation (CAT). Building upon the general foundation I established during my first semester, this course provided the platform to gain greater insight into the different tools used by today’s language professionals.

Having been exposed to a wide range of CAT tools such as SDL Trados, Memsource, and memoQ, there is no question that CAT tools streamline the translation process and lower its associated costs. This coupled with ever-improving machine translation (MT) and the possibilities for an even faster translation process are undeniable. However, MT has not reached a point where it can be the sole form of translation, largely due to its inaccuracies vis-à-vis the nuances of human language.

Understanding the strengths of MT also requires one to understand its limitations. For one, there are several forms of MT: there is Rule-Based Machine Translation (RBMT), Statistical Machine Translation (SMT), Hybrid Machine Translation (HMT), and Neural Machine Translation (NMT). This semester I had the opportunity to work on a team with the goal of building a SMT engine. We learned first-hand the process and thinking needed to train our own SMT engine.

To train or not to train?

For starters how does SMT work? SMT uses statistical models that are based on the analysis of large volumes of bilingual text. Using these large volumes of bilingual text, it attempts to find correspondence between a word from the source language and a word from the target language. Depending on the subject matter text that is used to train the engine, the SMT is most suited for documents pertaining to the same subject.

In our first group meeting we brainstormed possible subjects and styles of documents that our SMT engine would translate. Given that we’d be using Microsoft CustomTranslator to build our SMT engine, we referred to the Microsoft Translator User Guide in helping us best determine how to select a topic. It stated the following:

“If your project is domain (category) specific, your documents, both parallel and monolingual, should be consistent in terminology with that category. The quality of the resulting translation system depends on the number of sentences in your document set and the quality of the sentences. “

We decided that TED talks provided us ample material that would be consistent in terminology among our source and target language. Below is our project overview.

Let the training begin!

In order to effectively manage our time and roles we came up with the following process steps:

The goal of this project was to meet the following criteria:

  • Efficiency: PEMT roughly 30% faster than human translation.
  • Cost:  PEMT roughly 27.5% lower than human translation
  • Quality:No critical errors in any category, no major errors for accuracy, fluency and terminology, and total score <= 15 per 1,000 words.

Images above display our quality metrics

As previously mentioned, MT is not infallible and thus requires editing, hence the PEMT (Post-Editing Machine Translation) shown within our Efficiency and Cost criteria. Apart from our own criteria, our training produced BLEU scores (Bilingual Evaluation Study). The BLEU score basically measures the difference between human and machine translation output. While a low BLEU score may indicate poor output quality (MT output vs. human ), it does provide a mechanism for improving the overall output (if the BLEU score increases over rounds of training).

Is TED worth training?

The short answer is no. Speeches are inherently filled with an array of lexical minefields, and the TED talks we used to build our corpora were no different. Simply put, SMT is not built for the translation of speeches. That being said, we did make interesting findings regarding our training, tuning, and testing data.

Data cleaning before training and tuning

The Microsoft Translator Hub User Guide states “a sentence length of 8 to 18 words will produce the best results,” more than half our tuning data sentences were under 8 words and over 18 words. We believed shortening the sentence length in a new set of tuning data (500 words) would rectify this issue, however, the opposite occurred – our BLEU score dropped by 1.04.

Why?
  • The general sentence length of the new tuning data set was too different from that of the previous training and tuning data
  • Conclusion: Train with short sentences in the beginning and tune with short sentences!
Testing

The challenging nature of EN/ZH translation

Translating from English to Chinese is HARD. This especially true when dealing with speeches where language isn’t confined to structured or predicable patterns. The biggest drawback of SMT is that it does not factor in context, which is crucial to making sense of the TED talks we used. As you can see from our BLEU scores, significant improvements to our score were few and far between (see below). We conducted nine rounds of training and were unable to surpass or match our original BLEU score.

Conclusions

If you’re aiming for high-quality translations, be prepared to invest time and money training a SMT engine. SMT require a very large bilingual corpora. Not only does it need to be large, but also high-quality. Using low-quality data to train your engine will only lead to disappointment. While this project has reinforced my belief that SMT shouldn’t be used to translate speeches, I am not completely against the use of SMT with PEMT.

This is because after nine rounds of training and editing the raw MT output from our SMT, there is enough evidence to suggest PEMT could be 30% faster than human translation and 27.5% cheaper. Where we erred during our project, was creating quality metrics specific to the raw MT output and not the PEMT itself. While all our raw MT output failed to meet the standards designated by our quality metrics, it was never incomprehensible. Quality metrics designed around PEMT would better determine the amount of post-editing needed; the fewer post-edits there are the better. I could envision a scenario in which TED Talks uses SMT to mass translate all its content into other languages. Thereupon, translators could take the raw MT output and edit it to a level TED Talks deems fit.

Ultimately, training a useful SMT engine requires time to achieve good results. The key is to use time in a manner that aligns with what SMT is effective at. In order to do this, you need to ask yourself what is it that you want to translate and the translation quality that you desire to achieve.

Click here to see the slides from my group’s final presentation.

Trados Translation Project

In our Intro to Computer Assisted Translation (CAT) course we’ve had hands on practice creating translation memory, managing terminology, reusing previous translations, and editing translations.

Our final project gave us the opportunity to simulate the experience of translating in a small, in-house translation team or in a small group of associated freelancers. Using SDL Trados Studio, my team was tasked with providing the following CAT Project Files to a client of our choice:

  • Proposal/SOW
  • Deliverables

Proposal/SOW:

Our client, the Chengdu Tourist Bureau, requested we translate one of their Chinese blog posts into English.  Our proposal would be used to outline the major goals and scope of the project. For example, costs, resources required, outline of preparation, production and finalization phases, etc. In short, the proposal will ensure that the client understands our workflow and that the project’s specifications are understood by all parties.

Deliverables:

The deliverables are what will  ultimately be given to the client at the conclusion of the project.

For this project, they were as follows:

Source

  • The original source text (English)

Target

  • The translated target text (Chinese)

Translation Memory (TM)

  • Creation of a new TM for the client

Glossary

  • Creation of a glossary for the client containing terms relevant to the source text

Pseudo Translation 

  • Used to resolve localization issues before they take place

 

Click here to see the deliverables for this project.

Finals Thoughts 

After completing this project I feel confident knowing which are the appropriate uses for CAT tools. Furthermore, this project gave me hands on experience with the different components of Trados. If in the future there is a need to learn a new CAT tool, I know can do it on my own.

Presentation of Lessons Learned

Here is a video of my team describing the lessons we learned during this project.

 

Sites DOT MIISThe Middlebury Institute site network.