Whether it's translating a business document or chatting online with someone from another country, machine translation (MT) has become an essential tool. But to provide users with accurate, contextually sound translations, it’s critical to carefully select and refine the test data used to train these models.
At Lingvanex, we use a multi-level approach to selecting test data, focusing on maximising representativeness and adapting to real customer requests. The goal is to create models that can translate texts accurately both lexically and grammatically while preserving context and style. To achieve this, we develop advanced neural network architectures and use unique methods for selecting and analysing test data.
In this article, we will take a closer look at how the Lingvanex team selects test datasets that ensure high software performance and will discuss the limitations of existing standards.
Data Segmentation: Training, Validation, and Testing
The training process starts with properly dividing data into training, validation, and test sets. This helps avoid overfitting and ensures that the model is capable of generalising new information rather than just memorising examples.
- Training Set: We create training corpora consisting of millions of sentence pairs in different languages, extracted from parallel texts. These data undergo a cleaning procedure: duplicates, incorrect translations, and misleading sentences are removed. Preprocessing tools are used for tokenization, text normalisation, and syntactic structure tagging.
- Validation Set: This data set is used to monitor the training process. Regular checks on the validation set are applied to measure the model's accuracy at intermediate stages of training. This allows us to adjust model hyperparameters, such as learning rate, regularisation parameters, and neural network architecture. It’s worth noting that validation data help prevent overfitting and improve the model's quality as training progresses.
- Test Set: In the final stage, the test data are used for objective evaluation of the model's performance on new, previously unseen texts. This data set is never mixed with the training or validation data, eliminating the risk of memorization.
While data segmentation is vital, the effectiveness of these models also depends on the quality and diversity of the test sets used.
Limitations of Standard Test Sets
Standardised datasets such as Flores 101 and NTREX provide a baseline for testing but have several limitations that reduce their applicability in real-world scenarios:
- Limited Genre Coverage: NTREX and Flores 101 primarily contain texts from general sources, such as news articles or Wikipedia, with few domain-specific texts (e.g., legal, medical, or technical). Models trained on these sets may struggle with translating specialised terminology.
- Lack of Conversational Texts: Standard sets rarely include conversational speech, examples from messengers, or social media. However, in real life, such texts are common, and the model needs to handle slang, abbreviations, and even emojis.
- Insufficient Complex Grammar Structures: Complex grammar constructions, idioms, and polysemous words are rarely found in standard sets, limiting the model's ability to handle such challenges.
- Low Representation of Languages: Standard sets often lack sufficient examples for rare languages or dialects, limiting their usability for multilingual models.
Having identified the limitations of standard test sets, we now turn to Lingvanex’s innovative methods that effectively bridge these gaps.
Lingvanex Test Data Selection Methodology
To overcome these limitations, Lingvanex has developed a custom test data selection methodology that better aligns with the complexities and demands of real-world translation tasks. Our methodology is based on three key aspects: text diversity, analysis of rare terms and polysemous words, and the use of both automatic and human evaluations.
Textual Diversity
For each language, we select approximately 3,000 sentences from authorised sources covering the following criteria:
- Sentence Length: We test the model's ability to handle both short sentences (e.g., “See you!”) and longer ones (e.g., “I would greatly appreciate it if we could change our appointment to March 6 at 3:00 pm.”), containing complex syntactic structures and nested clauses.
- Special Characters and Unicode: We use texts with various formats such as HTML tags, special characters, mathematical formulas, and Unicode symbols to evaluate how the model handles web content and technical documentation. We test how well the model deals with emojis, ASCII characters, and mixed languages. For example:
- Emojis: “Hello my friend ^_^ :)”
- Formulas: “The formula is: Cr2(SO4)3 + CO2 + H2O + K2SO4 + KNO3.”
- Tags: “I want to buy XXXX items,” where XXXX is a tag that should not be translated. - Lexical Features: Test data include sentences with various figures of speech, verb tenses, idioms, slang expressions, direct and indirect speech, as well as examples of different parts of speech and proper names. It is crucial for the model to adapt to different speech types and translate both scientific texts and informal expressions accurately. For example:
- Idioms: “Break a leg!”
- Slang: “Hey dude, wanna hang out?”
- Polysemous Words: the word “bank” could refer to a financial institution or a river bank. - Proper Names, Abbreviations, and Numbers: The test sets include sentences with proper names, abbreviations, brand names, and numerical data. We apply specific rules to handle these elements, ensuring the model doesn’t translate proper names as regular words but keeps them in their original form or adapts them when possible.
- Proper Names: “I love the song ‘Купалинка’.”
- Abbreviations: “The model was named 15.BVcX-10.”
- Numbers: “It was in the XII century” - Multilingual Sentences: Lingvanex checks how the model handles sentences that contain words from multiple languages. For example: “The word cat can be written as 'кот', '猫', or 'Γάτα' depending on the language.”
- Text Stylistics: Sentences vary in style—from formal to conversational:
- Formal: “Dear sir, we would like to inform you...”
- Informal: “Yo, what's up?” - Errors and Typos: Test data may include sentences with typos or errors, which often occur after optical character recognition (OCR). This ensures the model can handle imperfect input.
In addition to varied text structures, topic diversity is equally important in ensuring that our models can handle a wide range of real-life translation scenarios.
Topic Diversity
Lingvanex places a strong emphasis on the diversity of topics included in the test data. This ensures that the model is prepared to translate texts from various fields, such as: medicine, technology, construction, politics, economics, law, culinary arts, sports and gaming, military, religion and culture, scientific texts, conversational speech and slang, and idiomatic expressions. This classification helps the model cover numerous real-life use cases, ensuring accurate translations in various domains.
Combining Automatic and Human Evaluations
For precise performance assessment, we use not only automatic metrics such as BLEU and COMET but also human evaluations. Our methodology involves professional linguists who assess translations based on:
- Meaning accuracy.
- Grammatical correctness.
- Text flow and naturalness.
This comprehensive evaluation approach helps us identify the strengths and weaknesses of our models and make timely improvements.
Regular Data Updates and Continuous Model Improvement
The world of language is constantly evolving, and at Lingvanex, we ensure that our models remain up-to-date. We regularly refresh both training and test data, taking into account new trends, slang, idioms, and technical terms. This is especially important in dynamic fields like IT and social media, where new words and expressions appear daily. Data updates help models maintain high translation accuracy and adapt to new challenges.
Conclusion
By continuously refining data selection, Lingvanex ensures that its models stay ahead of linguistic trends, delivering accurate, versatile translations for users across all domains. Standard datasets like NTREX and Flores 101 provide only basic coverage, so we complement them with more complex and diverse texts that better reflect real-world scenarios. This approach allows our machine translation models to demonstrate high accuracy and adaptability, making them suitable for a wide range of tasks, from professional texts to conversational speech on social media.