The goal of this report is to compare translation quality between old and new language models. New models have not only improved quality but performance and memory usage. We used the BLEU metric and primarily Flores 101 test set in the report.
BLEU is the most popular metrics in the world for machine translation evaluation. Flores 101 test set was released by Facebook Research and has the biggest language pair coverage.
QUALITY METRICS DESCRIPTION
BLEU
BLEU is an automatic metric based on n-grams. It measures the precision of n-grams of the machine translation output compared to the reference, weighted by a brevity penalty to punish overly short translations. We use a particular implementation of BLEU, called sacreBLEU.
It outputs corpus scores, not segment scores.
References
- Papineni, Kishore, S. Roukos, T. Ward and Wei-Jing Zhu. “Bleu: a Method for Automatic Evaluation of Machine Translation.” ACL (2002).
- Post, Matt. “A Call for Clarity in Reporting BLEU Scores.” WMT (2018).
COMET
COMET (Crosslingual Optimized Metric for Evaluation of Translation) is a metric for automatic evaluation of machine translation that calculates the similarity between a machine translation output and a reference translation using token or sentence embeddings. Unlike other metrics, COMET is trained on predicting different types of human judgments in the form of post-editing effort, direct assessment, or translation error analysis.
References
- COMET - https://machinetranslate.org/comet
- COMET: High-quality Machine Translation Evaluation - https://unbabel.github.io/COMET/html/index.html#comet-high-quality-machine-translation-evaluation
On-premise Private Software Updates
New version - 1.30.0.
Changes in functionality:
- Added the ability to accelerate the denoiser using the GPU to the speech recognizer.
- Added the ability to configure the default speech recognizer parameters for each language separately.
- Added the ability to translate a document into multiple languages sequentially without having to re-upload it on the demo page.
- Fixed document translation errors.
LANGUAGE PAIRS
Note: The lower size of models on the hard drive means the lower consumption of GPU memory which leads to decreased deployment costs. Lower model size has better performance in translation time. The approximate usage of GPU memory is calculated as hard drive model size x 1.2
Language Pair | Current Model's Size, mb | Test Data | Previous Model's BLEU | Current Model's BLEU | Difference | Previous Model's COMET | Current Model's COMET | Difference |
---|---|---|---|---|---|---|---|---|
English - Japanese | 190,63 | Flores 101 | 36,77 | 39,62 | +2,85 | 90,33 | 91,56 | +1,26 |
English - Lithuanian | 113,91 | Flores 101 | 30,84 | 31,28 | +0,44 | 89,61 | 90,11 | +0,50 |
English - Czech | 113,91 | Lingvanex | 47,73 | 48,94 | +1,21 | 91,66 | 92,09 | +0,43 |