Machine Translation Toolkit

Data Preparation

Parse, filter, markup parallel and
monolingual corpora. Create blocks for
test and validation data

Model Training

Train custom neural architecture with
parallel job lists, GPU analytics and
quality estimation


When model training finishes it can be
automatically deployed as API or
available to download for offline use

From Novice to Expert

Lingvanex dashboard combines the latest linguistic and statistical techniques that are used to train
the software to customer domains and improve translation quality. In the picture below: on the right
is a list of tasks and GPU servers on which models are being trained. In the center are the parameters
of the neural network, and below are the datasets that will be used for training.

Work with Parallel Data

Working on a new language began with datasets preparation. The dashboard has many predefined
datasets from open sources such as Wikipedia, European Parliament, Paracrawl, Tatoeba and others.
To reach an average translation quality, 5M translated lines are enough..

Dictionary and Tokenizer Tuning

Datasets are lines of text translated from one language to another. Then the tokenizer splits the text
into tokens and creates dictionaries from them, sorted by the frequency of meeting the token. The
token can be either single characters, syllables, or whole words. With Lingvanex Data Studio you can
control the whole process of creating SentencePiece token dictionaries for every language separately.

Data Filtering and Quality Estimation

More than 20 filters are available to filter parallel and monolingual corpora to get the quality dataset
from opensource or parsed data. You can markup named entities, digits and any other tokens to train
system to leave some words untranslated or translated in specific way..

Create your Translation System
in a 1 DAY

with the good quality and features.

