Everything you need to create machine translation
system for any domain
Machine Translation Toolkit
Parse, filter, markup parallel and
monolingual corpora. Create blocks for
test and validation data
Train custom neural architecture with
parallel job lists, GPU analytics and
When model training finishes it can be
automatically deployed as API or
available to download for offline use
From Novice to Expert
Lingvanex dashboard combines the latest linguistic and statistical techniques that are used to train
the software to customer domains and improve translation quality. In the picture below: on the right
is a list of tasks and GPU servers on which models are being trained. In the center are the parameters
of the neural network, and below are the datasets that will be used for training.
Work with Parallel Data
Working on a new language began with datasets preparation. The dashboard has many predefined
datasets from open sources such as Wikipedia, European Parliament, Paracrawl, Tatoeba and others.
To reach an average translation quality, 5M translated lines are enough..
Dictionary and Tokenizer Tuning
Datasets are lines of text translated from one language to another. Then the tokenizer splits the text
into tokens and creates dictionaries from them, sorted by the frequency of meeting the token. The
token can be either single characters, syllables, or whole words. With Lingvanex Data Studio you can
control the whole process of creating SentencePiece token dictionaries for every language separately.
Data Filtering and Quality Estimation
More than 20 filters are available to filter parallel and monolingual corpora to get the quality dataset
from opensource or parsed data. You can markup named entities, digits and any other tokens to train
system to leave some words untranslated or translated in specific way..