Lingvanex's Approach to Data Selection

Whether it's translating a business document or chatting online with someone from another country, machine translation (MT) has become an essential tool. But to provide users with accurate, contextually sound translations, it’s critical to carefully select and refine the test data used to train these models.

At Lingvanex, we use a multi-level approach to selecting test data, focusing on maximising representativeness and adapting to real customer requests. The goal is to create models that can translate texts accurately both lexically and grammatically while preserving context and style. To achieve this, we develop advanced neural network architectures and use unique methods for selecting and analysing test data.

In this article, we will take a closer look at how the Lingvanex team selects test datasets that ensure high software performance and will discuss the limitations of existing standards.

Data Segmentation: Training, Validation, and Testing

The training process starts with properly dividing data into training, validation, and test sets. This helps avoid overfitting and ensures that the model is capable of generalising new information rather than just memorising examples.
 

  • Training Set: We create training corpora consisting of millions of sentence pairs in different languages, extracted from parallel texts. These data undergo a cleaning procedure: duplicates, incorrect translations, and misleading sentences are removed. Preprocessing tools are used for tokenization, text normalisation, and syntactic structure tagging.
  • Validation Set: This data set is used to monitor the training process. Regular checks on the validation set are applied to measure the model's accuracy at intermediate stages of training. This allows us to adjust model hyperparameters, such as learning rate, regularisation parameters, and neural network architecture. It’s worth noting that validation data help prevent overfitting and improve the model's quality as training progresses.
  • Test Set: In the final stage, the test data are used for objective evaluation of the model's performance on new, previously unseen texts. This data set is never mixed with the training or validation data, eliminating the risk of memorization.


While data segmentation is vital, the effectiveness of these models also depends on the quality and diversity of the test sets used.

Limitations of Standard Test Sets

Standardised datasets such as Flores 101 and NTREX provide a baseline for testing but have several limitations that reduce their applicability in real-world scenarios:
 

  • Limited Genre Coverage: NTREX and Flores 101 primarily contain texts from general sources, such as news articles or Wikipedia, with few domain-specific texts (e.g., legal, medical, or technical). Models trained on these sets may struggle with translating specialised terminology.
  • Lack of Conversational Texts: Standard sets rarely include conversational speech, examples from messengers, or social media. However, in real life, such texts are common, and the model needs to handle slang, abbreviations, and even emojis.
  • Insufficient Complex Grammar Structures: Complex grammar constructions, idioms, and polysemous words are rarely found in standard sets, limiting the model's ability to handle such challenges.
  • Low Representation of Languages: Standard sets often lack sufficient examples for rare languages or dialects, limiting their usability for multilingual models.


Having identified the limitations of standard test sets, we now turn to Lingvanex’s innovative methods that effectively bridge these gaps.

Lingvanex Test Data Selection Methodology

To overcome these limitations, Lingvanex has developed a custom test data selection methodology that better aligns with the complexities and demands of real-world translation tasks. Our methodology is based on three key aspects: text diversity, analysis of rare terms and polysemous words, and the use of both automatic and human evaluations.

Textual Diversity

For each language, we select approximately 3,000 sentences from authorised sources covering the following criteria:
 

  • Sentence Length: We test the model's ability to handle both short sentences (e.g., “See you!”) and longer ones (e.g., “I would greatly appreciate it if we could change our appointment to March 6 at 3:00 pm.”), containing complex syntactic structures and nested clauses.
  • Special Characters and Unicode: We use texts with various formats such as HTML tags, special characters, mathematical formulas, and Unicode symbols to evaluate how the model handles web content and technical documentation. We test how well the model deals with emojis, ASCII characters, and mixed languages. For example:
    - Emojis: “Hello my friend ^_^ :)”
    - Formulas: “The formula is: Cr2(SO4)3 + CO2 + H2O + K2SO4 + KNO3.”
    - Tags: “I want to buy XXXX items,” where XXXX is a tag that should not be translated.
  • Lexical Features: Test data include sentences with various figures of speech, verb tenses, idioms, slang expressions, direct and indirect speech, as well as examples of different parts of speech and proper names. It is crucial for the model to adapt to different speech types and translate both scientific texts and informal expressions accurately. For example:
    - Idioms: “Break a leg!”
    - Slang: “Hey dude, wanna hang out?”
    - Polysemous Words: the word “bank” could refer to a financial institution or a river bank.
  • Proper Names, Abbreviations, and Numbers: The test sets include sentences with proper names, abbreviations, brand names, and numerical data. We apply specific rules to handle these elements, ensuring the model doesn’t translate proper names as regular words but keeps them in their original form or adapts them when possible.
    - Proper Names: “I love the song ‘Купалинка’.”
    - Abbreviations: “The model was named 15.BVcX-10.”
    - Numbers: “It was in the XII century”
  • Multilingual Sentences: Lingvanex checks how the model handles sentences that contain words from multiple languages. For example: “The word cat can be written as 'кот', '猫', or 'Γάτα' depending on the language.”
  • Text Stylistics: Sentences vary in style—from formal to conversational:
    - Formal: “Dear sir, we would like to inform you...”
    - Informal: “Yo, what's up?”
  • Errors and Typos: Test data may include sentences with typos or errors, which often occur after optical character recognition (OCR). This ensures the model can handle imperfect input.


In addition to varied text structures, topic diversity is equally important in ensuring that our models can handle a wide range of real-life translation scenarios.

Topic Diversity

Lingvanex places a strong emphasis on the diversity of topics included in the test data. This ensures that the model is prepared to translate texts from various fields, such as: medicine, technology, construction, politics, economics, law, culinary arts, sports and gaming, military, religion and culture, scientific texts, conversational speech and slang, and idiomatic expressions. This classification helps the model cover numerous real-life use cases, ensuring accurate translations in various domains.

Combining Automatic and Human Evaluations

For precise performance assessment, we use not only automatic metrics such as BLEU and COMET but also human evaluations. Our methodology involves professional linguists who assess translations based on:
 

  • Meaning accuracy.
  • Grammatical correctness.
  • Text flow and naturalness.


This comprehensive evaluation approach helps us identify the strengths and weaknesses of our models and make timely improvements.

Regular Data Updates and Continuous Model Improvement

The world of language is constantly evolving, and at Lingvanex, we ensure that our models remain up-to-date. We regularly refresh both training and test data, taking into account new trends, slang, idioms, and technical terms. This is especially important in dynamic fields like IT and social media, where new words and expressions appear daily. Data updates help models maintain high translation accuracy and adapt to new challenges.

Conclusion

By continuously refining data selection, Lingvanex ensures that its models stay ahead of linguistic trends, delivering accurate, versatile translations for users across all domains. Standard datasets like NTREX and Flores 101 provide only basic coverage, so we complement them with more complex and diverse texts that better reflect real-world scenarios. This approach allows our machine translation models to demonstrate high accuracy and adaptability, making them suitable for a wide range of tasks, from professional texts to conversational speech on social media.


Frequently Asked Questions (FAQ)

What is a good BLEU score for machine translation?

A good BLEU (Bilingual Evaluation Understudy) score typically ranges from 30 to 40 for machine translation, indicating fair translation quality. Scores above 40 are considered strong, with 50 or higher indicating very high-quality translation. However, the score depends on the complexity of the text and the language pair.

What is the COMET metric?

COMET is a neural-based evaluation metric for machine translation that focuses on both accuracy and fluency. It uses a combination of human judgments and deep learning models to assess the quality of translations, providing more nuanced and reliable results than traditional metrics like BLEU. A good score typically ranges from 60 to 80.

What is the Flores 101 dataset translation?

The Flores 101 dataset is a standardised multilingual test set used for evaluating machine translation models. It includes high-quality human translations in a wide variety of languages, aiming to assess MT performance across diverse language pairs and domains.

What is the NTREX dataset translation?

The NTREX dataset is a large-scale collection of parallel texts used for training and evaluating machine translation models. It primarily contains texts from general sources like news and Wikipedia, making it a useful baseline but limited in specialised domains like legal or medical translations.

More fascinating reads await

How Lingvanex Helps Expats Feel at Home

How Lingvanex Helps Expats Feel at Home

December 02, 2024

Advances in SOTA and Lingvanex translation models

Advances in SOTA and Lingvanex translation models

November 26, 2024

How is Artificial Intelligence Evaluated?

How is Artificial Intelligence Evaluated?

November 21, 2024

Contact us

0/250
* Indicates required field

Your privacy is of utmost importance to us; your data will be used solely for contact purposes.

Email

Completed

Your request has been sent successfully

× 
Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site.

We also use third-party cookies that help us analyze how you use this website, store your preferences, and provide the content and advertisements that are relevant to you. These cookies will only be stored in your browser with your prior consent.

You can choose to enable or disable some or all of these cookies but disabling some of them may affect your browsing experience.

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Always Active

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Always Active

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Always Active

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Always Active

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.