Make Neural Machine Translation better, faster!
- a new way to measure NMT quality!

Ofer SHOSHAN
8 min readApr 23, 2018

A new, human based, NMT quality index can dramatically improve NMT performance and adoption. Replacing the current automatic quality scores is essential to advance the use and performance of NMT systems.
One Hour Translation is releasing a new human based NMT quality index to solve the NMT quality evaluation problem, and enable faster NMT improvements.

By: Ofer SHOSHAN, CEO, One Hour Translation
April 2018

In brief
Neural Machine Translation (NMT) systems are producing very high quality translations, and have the potential to radically change the professional translation industry. These systems require a quality feedback/ score on an ongoing basis. Today NMT quality is tested using computer programs such as BLEU, these programs are not good enough. A better way to asses NMT quality is by simply having enough native-speakers rate each translation. OHT is doing just that, our new NMT index is going to be released in 2–3 weeks for the benefit of the translation community.
(If you work for an LSP I also recommend reading the last section — “A word about the future”).

New Neural Machine Translation Quality Score

NMT marks a new age in automatic machine translation. Unlike previous technologies developed over the past 60 years, a well-trained and tested NMT has the potential to replace human translators using the NMT systems already available today.

Without going into technical details, the main factors affecting the performance of an NMT system, besides processing power, are the amount and quality of initial training materials, as well as the ongoing quality-feedback process. This means that for a NMT system to work well, it first needs to be properly trained, ie “fed” with hundreds of thousands (and in some cases millions) of correct translations. Afterwards, it requires feedback on the quality of the translations it produces.

In short, NMT is the future of translation — it is already much better than all of the previous machine translation technologies — but training and quality assessments are major delaying factors.

Neural Machine Translation (NMT) is a “disruptive technology” that is going to change the way most translations are performed. For the first time in over 50 years, machine translation can replace human translator in many cases.

So where is the problem?!

The main issue with NMT systems today is the ability (or lack thereof) to test the quality of the translations they produce. Therefore, while these systems have real potential to revolutionize the translation market, their development and adoption are slowed down by these two issues: the amount of quality input and the ability to provide translation feedback.

Another limiting factor is the processing power available for these systems. I expect the processing power issue to be solved in the next few years, thanks to two factors. First, per Moore’s law, processing power increases exponentially. Second, as more companies realize how much money can be saved using NMT, more resources will be allocated for NMT systems.

Measuring quality is a different issue, and a more problematic one. Today, NMT systems are using computer programs like BLEU, METEOR and TER to try and guess automatically what a human being would say about the quality of a given machine translation. While these tests are fast, easy and inexpensive to run (because they are simply software applications), their value is very limited. They do not provide an accurate quality score for the translation, and they fail to estimate what a human reviewer would say about the translation quality (a simple search will reveal the issues with the existing quality tests).

Simply put, translation quality score generated by computer programs,
which are trying to guess what a human would say about the translation, is just not good enough
.

With more major corporations including Google, Amazon, Facebook, Bing, Systran, Baidu and Yandex joining the game, producing an accurate quality score for NMT translations becomes a major problem, that has a direct negative impact on the adoption of NMT systems.

There must be a better way!

There is a need for a better way to evaluate the performance and quality of NMT systems — a way that is closer to the original intention, ie what a human would say about the translation. The solution is actually very simple: instead of having some software try and guess what a human would say about the translation, why not simply have enough people rate the quality of each translation? While this solution is simple, direct and intuitive, doing it right in a way that is statistically significant requires running many evaluation projects at a time.

NMT systems are highly specialized, if a system has been trained using travel and tourism contents, testing it with technical material will not produce the best results. Thus, each type of material has to be tested and scored separately. In addition, the rating must be done for every major language pair, as some NMT engines are better in some languages and some in other. Furthermore, to be statistically significant, at least 40 people need to rate each project per language per type of material per engine. Besides that each project should have at least 30 strings.

Checking one language pair with one type of material translated with one engine is relatively easy. 40 reviewers will each check and rate the same neural machine translation of about 30 strings. This approach will produce relatively solid (statistically significant) results, and repeating it overtime will also produce a trend, ie whether the NMT system is getting better or not.

The key to doing this one separate examination is selecting the right reviewers and making sure they do their job right. As one might expect, using freelancers for the task requires some solid quality control procedures to make sure the answers are not “fake” or “random”.

At that magnitude (one language, one type of material, one NMT engine, etc), the task is feasible, even when managed manually. It becomes more difficult when an NMT vendor, user or LSP wants to test 10 languages and 10 different types of material with 40 reviewers each. In this case, to be conducted correctly, each test will require between
400 reviewers (1 NMT engine x 1 type of material x 10 language pairs x 40 reviewers) and
4,000 reviewers (1 NMT engine x 10 types of material x 10 language pairs x 40 reviewers).

Running a human based quality score is a major task, even for just one NMT vendor. It requires up to 4000 reviewers working on thousands of projects.

The main challenge is of course finding, testing, screening, training and monitoring thousands of reviewers in various countries and languages. And monitor their work while they handle tens of thousands of projects in parallel.

The above procedure is relevant for every NMT vendor who would like to know the real value of their system and obtain real human feedback for the translation it produces.

The greater good — industry level quality score

Looking at the greater good, at the industry level, what is really needed is a new NMT quality score measuring all the systems with the same benchmark, same strings and same reviewers, in order to compare one with another on the same ground. As discussed above, since the performance of NMT systems can vary dramatically between different types of materials and languages, doing a real human-based comparison using the same group of linguists, and the same source material, is the only way to produce real comparative results. Such scores will be useful both for the individual NMT vendor or user and for the end customer or LSP trying to decide which engine to use where.

Using the same numbers as we did above for one NMT vendor, and running the tests at the industry level to produce a comparative quality index will require a slightly larger project:
● Assuming the top 10 language pairs are evaluated, ie EN > SP, FR, DE, PT-BR, AR, RU, CN, JP, IT and KR;
● 10 types of material — general, legal, marketing, finance, gaming, software, medical, technical, scientific and tourism;
● 10 leading (web-based) engines — Google, Microsoft (Bing), Amazon, DeepL, Systran, Baidu, Promt, IBM Watson, Globalese and Yandex;
● 40 reviewers rating each project;
● 30 strings per project; and
● 12 words on average per string.

This makes 40,000 projects (10 language pairs x 10 types of material x 10 NMT engines x 40 reviewers), each with at least 30 strings, ie 1,200,000 strings of 12 words each, results in evaluating approximately 14.4 million words. This evaluation is needed to create just one instance (!) of a real, comparative, human-based NMT quality index.

The problem is clear:
to produce just one instance of a real viable and useful NMT score, 4,000 linguists need to evaluate 1,200,000 strings and in total well over 14 million words!

The magnitude of the project, the number of people involved and the need to recruit, train and monitor all the reviewers, as well as make sure, in real time, that they do their job right, are obviously daunting tasks, even for large NMT players, and certainly for traditional translation agencies.

Completing the entire process within a reasonable time (e.g. less than one day), so that the results are “fresh” and relevant, makes it even harder.

There are not many translation agencies with the capacity, technology and operational capability to run a project of that magnitude and do it on a regular basis.

Obviously, as the CEO of One Hour Translation (OHT), I am writing this post because we are doing just that! We have recruited, trained and tested thousands of linguists in over 50 languages, and already run well over 1,000,000 NMT rating and testing projects for our customers. By the end of the month (April 2018), we will publish the first human-based NMT quality index (initially covering several engines and domains and later expanding), with the goal of promoting the use of NMT.

OHT is already one of the first (and few) translation agencies to deploy a “hybrid” model, combining NMT and human post-editing to reduce the cost and time it takes to deliver high-quality business translations. We think that a revolution in the traditional human translation sector is already happening, and creating the NMT index is our modest contribution.

A word about the future

In the future, a better NMT quality index can be built using the same technology NMT is built on, i.e. deep-learning neural networks. Building a Neural Quality system is just like building a NMT system. The required ingredients are high quality translations, high volume, and quality rating/ feedback. With these elements, it is possible to build a deep-learning, neural network based quality control system, that will read the translation and score it like a human does.

At any rate, a reliable, human based, quality score/feedback is needed first, once the NMT systems are working smoothly it is possible to develop a neural quality score.

After a neural quality score is available, it is further possible to have engines improve eachother, create a self-learning and self-improving translation system by linking the neural quality score with the NMT (obviously it does not make sense to have a closed loop system as it cannot improve without additional external data).

With additional external translation data, the system can “teach itself” and improve, without the need for human feedback. Google has done it already. It’s AI subsidiary, DeepMind, developed AlphaGo, a neural network computer program, that already beat the world’s (human) GO champion. AlphaGo is now improving, becoming better and better, by playing against itself again and again — no people involved.

More about that how this will affect the translation industry, in the next post.

If you are interested in the NMT quality score you are invited to visit https://fast.onehourtranslation.com/nmt-quality-score/

--

--