History of Lingvanex
How did Lingvanex come into being
and why Google isn’t our competitor
After graduating from university, I was young and naive: I tried to realize various ideas with the hope that sooner or later I would earn a lot of money. But none of the ideas were confirmed by the market, and every time I ran out of resources to make them.
The scenario was the next: I wornbspk in some company, earn money, invest it into the project, it doesn’t become successful, I’m going to my native city – Novopolotsk (because to live in capital – Minsk is expensive), I’m trying to launch the project there, the money is running out, and I’ m going back to Minsk and getting a job. The same case was three times in five or six years.
The third time happened with the crisis: I was not hired, I crashed a leasing car and need to pay rent. It couldn’t go this way anymore – I closed all the old projects and started thinking about what to do further. I had to be honest with myself that I didn’t know which niches to choose. Make projects that you just like is the way to the nowhere. I worked as an iOS programmer for the most of my career, and all I could do was – to create iPhone apps.
I decided to make 500 simple applications for iPhone on all possible topics: games, healthy lifestyle, music, drawing, language learning, and see on which of them I can earn money.
The applications were simple: a beautiful image and a couple of buttons with only one function. For example, the application for run is simply tracked a person’s speed, determine the distance and calculate the calories burned. The goal was to test the market. In a month I prepared 20 applications, in three months – about 100, then the work went faster because of experience.
I bought an AppStore account (only $99 per year), uploaded applications and saw that some of them were downloaded more often than others. Statistics on some applications was 50 times better than others. Music apps and translators were among the leaders.
I had about 40 translator apps: different for each language pair. I made translation pairs examining the specifics of the countries: for example, in France there are a lot of emigrants from Arab countries, so it is necessary to make an Arabic-French translator; Indonesia and Malaysia are two large countries located near each other, so the residents should visit each other.
Basically, it was just a Google Translator in my wrappers: you pay for the Google API and connect it to your translator app. But despite this, they were popular: literally for the first month I got a million of downloads.
My applications were free with ads but making good profit. The business was based on the difference between how much you paid to attract a client and how much money it brings to you. To get a client cost me nothing except my time. I had a lot of time because I didn’t got a job and the profit from ads and paid apps was huge.
After 3-4 months of testing I realized that the translators had an interesting niche. Even if you are on the 100th position beyond Google, Microsoft and a lot of other competitors, you have a lot of downloads. And how many will you have if you get into the top ten?!
Music applications were also profitable. There are a lot of musical instruments, but for earning in this niche it is necessary to acquire users by search keywords. You can’ t call a guitar another way other then “guitar” or may be “ukulele”. The piano can be called a pianoforte or forte-piano. Because of the small number of synonyms, users are focused on high-frequency requests. Sooner or later the user acquisition will be expensive. It turns out that you paid 1 $ to acquire a user by the keyword “guitar”, then a person downloaded your application, brought you 1 US dollar of profit in a year and as a result you earned zero cents.
With the translator apps everything is different. There are thousands of language translation pairs, so people are looking for a solution of their problem in thousands of different ways. So you don’t need to acquire users using one search phrase, and often ASO (Appstore Search Optimization) is enough. Even if there is competition in some niches, because of the many language pairs, the translators still don’t compete so much.
At the same time there wasn’t much competition in the music app market too. But I knew that sooner or later, companies like Gismart would come to this niche and spend huge budgets on advertising, and I wouldn’t be able to compete with them. So 4 years ago I decided to make translation apps, and all the others I put aside.
The second release of my translators were already more functional: they had advertisements, in-app purchases and voice translation. The statistics became better, I decided to make paid versions: one application costs about 7 Euros.
It was 2015. I had enough money to move to Minsk, buy a flat, a new car and feel comfortable. The growth period lasted about a year. At that time I had 50-70 applications for translation (2 language pairs plus one universal, more expensive translator) and 5 million of downloads. They were all without a name and brand.
The revenue grows but Google’s translation API costs used by my apps grows too. Google Corporation takes a fee for the quantity of translated characters with it’s API: $20 per million symbols. If a user bought your application for 7 euros and translated 500 000 symbols, you will have a loss. At first it wasn’t obvious: more than half of your user translations were for one word. But when there are a lot of users you suddenly notice that some of them are using your application regularly, translating large blocks of text, and thus driving you into debts after six months. At first the margin was 90% from all apps, then with the increase of users, it was decreased to 30%. If you add more advertising to cover the costs, the users will go to the competitors. It was necessary to keep the balance. It was also necessary to make a decision not to depend on Google API and move to the new markets of Android, Mac OS and Windows.
There were no good projects in opensource so we could create our own translator. I started to talk with people who deal with mobile applications, attended several startup events, showed that I could earn $250,000 only with the tests of the market, but did not meet any interest – nobody understood why to climb into this market when there is Google Translator.
I believed in my idea. I asked a few big companies to sell me licenses for their translation solutions. I needed 20 languages and was ready to pay $30,000 for them. They answered that these was a small price for the translator and invoiced me 60-100 thousand euros per year. It was a lot for me for that quality of translation that they provided, and I wasn’t sure that mobile application users would be happy with it.
Then I told about project to my friend, who has his own outsourcing company in Minsk, and he offered to check whether it’s possible to make such a translator by ourselves. He gave me a team of seven senior engineers, and we started doing tests. It was the end of 2016.
We have found several open source projects – Joshua and Moses (“Jesus” and “Moses”). Joshua Translator was made by couple of guys from Canada and the United States. It was just a statistical machine translation. The quality was primitive, but at least it was something we can test. We deployed it, but people complained that the quality was bad, and we unistalled it. Moses translator was more complex, and he was supported by 30-40 people. But the quality of the translation was also not so good. So the Moses was not our chance.
We have tried a few more projects, but have not achieved good results. At the beginning of 2017 we realized that we couldn’t make a quality translator by ourselves on the basis of some open source project. And Google or Microsoft don’t reveal their source code.
Nevertheless the work continued.
At the same time, two guys were experimenting with opensource translators, and five others were improving translator applications that used the Google API. My task was to extend the functionality of the translators so that they would translate not only the text, but also pictures, sites, files, voice, everything. We had to do all the features that the market leaders had. Revenue from applications was growing, and I was enthusiastic. It seemed to me that by the time we improved the applications, we would have made our own quality translator somehow.
In March 2017 we met a project called OpenNMT. This is a joint development of Systran, one of the leaders in the machine translation market, and Harvard University. The project has just started. Systran also faced with a lack of enthusiasts, startups and open source projects to get new ideas and hire people. Modern machine translation technologies belong to big companies, which are private. Small teams, realizing how difficult it is to penetrate into this world, don’t make such attempts. This slows down the technology development.
That’s why Systran company made an absolutely revolutionary trick: they put their work in opensource, so that enthusiasts like me could get involved in the job. They created a forum where their experts began to help newcomers for free, and a private chat for quick assistance. And it had a good effect: new small companies began to appear, as well as scientific papers on translations. Systran became the leader of this community. Then Ubiqus and other translation companies joined them.
At that time neural machine translation wasn’t widespread, and OpenNMT offered an open research in this area. I and other guys all over the world could take the latest technologies and ask for advice from the best specialists. They were happy to share their experience, it allowed me to understand in which direction to move.
In March 2018 Systran invited the whole community to Paris to share their experiences and also organized a free master class on the main problems facing startups in translation. Everyone was interested to see each other in real life. Everybody had different projects.
Someone was making a chat bot for learning English, which can speak as a person. Another used OpenNMT for text summarization. Many of startups have made plugins for SDL Trados Studio focused on specific topic translation (medicine, construction, metallurgy, etc.) to help interpreters to save their time in editing of the translated text.
Besides the enthusiasts, the guys from Ebay and Booking came to Paris. They make translators optimized for auction and hotel descriptions.
I was wondering: why does Systran help its competitors? But over time I realized the rules of the game, when more and more companies started to post their natural language works as opensource.
Even if everybody has the computing power to process large data sets, the question of searching for specialists in NLP (natural language processing) is quite complex, even for large companies. This theme is much less developed than image and video processing. Fewer data sets, research papers, specialists, frameworks, etc. There are even fewer people able to build a business based on that. For the top companies like Google, and for smaller players like Systran, it is necessary to gain a competitive advantage over players in their category.
How do they deal with that?
At first sight it seems strange but in order to compete with each other they decide to support new players (competitors) smaller then they are. The entry level is still high and the demand for natural language processing technology is growing (voice assistants, chat bots, translations, speech recognition, text analysis, etc.).
The number of startups that can be bought to strengthen their positions is still very small. Scientific works from Google, Facebook, Alibaba teams are publishing in the public domain. The frameworks and datasets are posting in opensource too. New forums are created with answers to questions on how to achieve better translation.
Big companies are interested to buy such startups like ours. They want us to occupy new niches and show maximum growth. They are happy to buy NLP startups to strengthen their team and expertise.
Because even if you have all data sets and algorithms in your hands, it doesn’t mean that you will make a quality translator or other NLP startup. And even if you do, it is not a fact that you will catch a big piece of the market. That’s why we need help and they help us.
I spent the whole 2018 year solving the problem of quality translation for European languages. I thought it would take another six months and it would be all right. It seemed that the solution can be simple. But the final moment didn’t come, I wasn’t satisfied with the results in translation where I was training the neural networks. About $450,000 had already been spent, and I had to decide what to do next. After a while I realized how many management mistakes I had made by launching this project alone and without investment. But this is experience. The decision was made. Go to the end! All money to R&D
Realizing that I don’t have enough money for advertising the project and attracting users, I had to analyze the experience of other companies in marketing with small budget.
Soon I decided to launch an internal project to save advertising costs – Backenster (www.backenster.com), which advertises some applications in others as cross-marketing. Via this system, I am going to transfer users of my old translation applications to the new ones at the right time. By that time 27 millions of downloads had been made on the test applications and about 10 million applications remained on the mobile phones. When the API + applications will be ready, it will only need to press “start”. It will be much cheaper to attract the same number of users. Backenster system was made to manage subscriptions, updates, configuration, notifications, etc., as well as the ability to advertise mobile applications in browser extensions, chatbots, desktops, voice assistants and vice versa. The marketing costs decreased to 97% using it.
I decided to foresee all the problems that have appeared during 3 years with the applications.
Time was passing, money was running out… In 3 years I have invested 600 000 US dollars in development. The old applications using Google API, still brought income, but now it was necessary to pay for team of 10 people. Updated apps were almost ready, and there was no quality translation… At the end of 2018 – beginning of 2019, I was already in a panic.
At that time I noticed that the community was talking about Google’s new Transformer neural network architecture. Everyone rushed to train neural networks on the basis of this Transformer model and began to move to Tensorflow instead of the old Lua (Torch). I decided to try it too.
In our works for translator we use sequence-to-sequence model consisting of one or several layers, each of them includes encoder and decoder. In previous works we used simple encoder and decoder components: LSTM or GRU (Simple Network Recurrence), Google encoder, simple attention layers, etc. In the current version, we use the so-called transformer architecture, when additional self-attention layers are inserted for connection between recurring and convolution networks.
A powerful computer was required to train the neural network. We use 20 computers (with GTX 1080) and simultaneously ran 20 simple tests on them – each test took a week. To get better quality, we had to run it with other parameters that required more resources. It was slow. We needed a cloud computing and a lot of tests. Then we decided to rent the Amazon cloud service. It is fast, but very expensive. We launched the test overnight, and in the morning I got a $1,200 bill. I had to give up this idea and look for cheaper options. Maybe I can buy such computer?
In Minsk nobody sells such powerful computers. Calls to various retail companies finished with a proposal that we have to send a detailed configuration to them, and they will build it. Which one is better from the point of performance/price for our tasks, nobody could answer. Then we tried to buy it in Moscow but met some problems. Company website was of high quality, the sales department was qualified. But they did not accept a bank transfer, and the only option for payment was to send money to their accountant on the credit card. It was strange and dangerous. We started to consult with the team and decided that it is possible to assemble a computer by ourselves from several powerful GPUs with a price less than 10 000 U.S. dollars, which will solve our problems and will pay off in a month. Components were bought in different places: we call to Moscow, ordered something in China, something in Amsterdam. In a month everything was ready.
In March 2019 I finally assembled this computer at home and started doing a lot of tests without worrying about paying rent. The tests were fast. I noticed that the Spanish translation was close to Google in quality. According to the BLEU metric there were 70 in relation to Google translation. But I didn’t understand this language, and for the night I deployed a model of English-Russian translation for training to understand the point where I was. The computer was buzzing and heating all night long – it was impossible to sleep. I had to make sure there weren’ t any mistakes in the dev console. In the morning I ran a test for translation of 100 sentences with lengths from 1 to 100 words and screamed: a good translation was done, including long sentences. This night changed everything! I saw the light at the end of the tunnel.
Of course it wasn’t just the Transformer Model. Actually there were a lot of small steps there: we took a new tokenizer, made a text parser, started filtering and label data differently, process the text after translation. The rule of 10,000 hours worked: there were many steps to the goal, and I realized that the quality of the translation was good enough to sell people my API for translation. The Transformer model wasn’t a magic pill, it just added 10-20% of the quality when people continue to use the product and don’t leave to competitors.
Then we started to connect various tools that allowed us to improve the quality of translation further: named entity identifier, transliteration, dictionaries, and word error correction system. In 5 months of this work the quality of translations in some languages began to reach to the quality of “Google”. Now when we replaced for several languages Google API for our own, people in general didn’t complain. It was a breaking point.
You can already sell the translator app, and because you have your own translation API, you can make the price cheaper than your competitors. It is possible to increase an app subscription sales or user count as free version, because the costs will be only for API servers.
Today we have translators for six languages – Spanish, Portuguese, French, German, Italian and Russian, all in pair with English. In future we plan to make direct models without English, for example, Arabic-French.
We have made not only a translator, but also a big platform for it to filter data, parse it, process, train neural models and place them on servers automatically – like API from ” Google “.
Now we are remaking the platform so that it doesn’t fall apart under the pressure of user count. Another two weeks will take to do that. Then we plan to move away from the API “Google” completely into more than 100 languages.
Over the years, the project has grown many times. There are applications not only for mobile platforms, but also for desktop PC, wearable devices, messengers, browsers, voice assistants. In addition to text translation, there is a translation of voice, images, files, websites and phone calls. API became available not only for online translation, but also for dictionary, voice recognition, speech synthesis, word study, and offline translation. It is planned to earn money from subscriptions to mobile and desktop applications, as well as from API sales for translation and voice recognition for B2B.
My goal is to earn $350 millions by sale of the company in three years, gaining 0.5% of the world’s translation market with revenues of $35 million / dollars per year for team of 40 people in the future. This is not a big number. Till 2023 the global market for translations, according to the data of analytical services, will reach 70 billion US dollars per year. All types of translations are taken into account here, first of all those which are made by a human. For example, the calculation was made only for the B2B segment, assuming that machine translation will capture 10% of the entire market by 2023. The share of machine translation is small now, as the machine is not yet able to make it perfectly. But neural translation is improving, and its quality is getting closer to human level.
Example: the DeepL translator (www.deepl.com), who started as a small startup and now more and more people are choosing it instead of Google because of the quality translation.
And when companies get better translation results, there will be a big step forward to the human translation market. Many sources, for example, predict that by 2024 the machine translation market will be estimated at $1.5 billion. But the forecast is based on annual market growth, when translation wasn’t on neural networks and quality wasn’t high enough for a long time. In 2018 many things changed. I am sure that the growth of the machine translation market will be faster, and it will happen in the next 2-3 years. Due to the fact that the achievements of big companies are being made available to the public, there is more and more scientific work and attention to this area, small companies will also be able to participate in this process. It is very important till that moment to have technologies that can translate as human beings. Perhaps I will be in the center of events at that moment, because Systran, Ubiqus and all 200 people who are in our openNMT chat room are sharing their ideas and helping us to achieve this.
Google has 500 million monthly users, and itranslate.com, the second most popular mobile translator, has 5 million, which is 100 times less than the leader. I’m 5,000 times behind Google now – and so what? I don’t have a goal to beat the corporation and capture a large volume. If I capture 0.5% of the world’s translation market and replace it with machine translation ($70 billion in 2023), Google will not even notice it, but for a team of 40 people it will be an excellent money.
A lot of work has been done in four years. Right now I have a team of experienced engineers.
Linguanex LLC has recently been registered. Documents have been submitted for joining the High-Tech Park. After joining the HTP, employees will be given company shares as options in addition to their salaries.
Nowadays designers are developing a new UX, QA is writing autotests, and DevOPS is moving the whole infrastructure to Kubernetes. We continue to look for experienced and motivated professionals. We need computational linguists and datascience engineers,
Frequently Asked Questions
1) Why are you better than Google Translate ?
Our current goal is to achieve the quality of Google’s translator for the main European languages and then provide next solutions:
Translation of large blocks of text via our API three times cheaper than competitors (Google, Amazon, Microsoft), providing better support services and easy integration. Now our translation expenses for API are $1 per million symbols. Google offers translation API at $20 per million symbols. A good sales department will allow us quickly win a part of the huge market. Even if Google ever notices us, it’s more likely that it will let us grow and further buy instead of dumping with prices. Our startup could be a good assistant for Google in the war for machine translations against Microsoft, Amazon and others. That’s why we are making our company for sale from the beginning.
- Voice translation for mobile applications (from 2 languages simultaneously) with no internet connection. Chats with voice translation for tourist groups. Many different functions with a focus on language learning (Single account to synchronize work on all devices, augmented reality functions, etc.).
- Exact translation of specific documents (medicine, metallurgy, law, etc.) for computers running Mac OS and Windows with integration into the tools for professional translators (such as SDL Trados).
- Integration into business processes of enterprises for running translation tools on their servers under our license. This allows to save the privacy of data, do not depend on the volume of translated text and optimize the translation for the specifics of a particular company.
- Phone call translations.
2) You have so many applications. There is no single focus.
Actually, there is only one product – the translator. It just has a lot of features and platforms on which it works. You have to keep up with your competitors, and in some niches you have to overcome them.
3) Why do you think that you will achieve Google quality?
Because more and more companies are trying to make their work as opensource and grow the natural language processing (NLP) community. Every year, machine translation competitions are held and everyone exchanges ideas on them (http://www.statmt.org/wmt18/).
Our openNMT community helps each other (http://opennmt.net/). Our startups have a common chat room for assistance, where employees from Systran, Ubiqus, the guys at Harvard University’s NLP department and many others help solve problems for newcomers.
We know what to do to improve quality – we just need to find more DataScience engineers to delegate tasks to them.
4) Where are the models training and API deployed?
Basically – Hetzner Dedicated GPU, for peak load – AWS , for neural network tests – our own machine was built in the office. In the nearest future we plan to lease the DGX-2.
5) Where do you get the data sets for training? And what if there are not enough data sets for a certain language?
Basically, we take them from this site http://opus.nlpl.eu/, but they need to be filtered. There are also marketplaces where you can buy high-quality data sets at once. If the data is not enough – you can use the technique of reverse translation on mono-corpora. They are easy to compose by parsers.
6) What’s the best translator from your point of view right now?
If we keep in mind the quality of translation, in my opinion – DeepL (www.deepl.com). This is a small startup. It has an interesting story. Read about it in Internet.
7) Do you have an API to test?
At the moment there is a translation of the text (40 languages) and voice recognition (now only English).
Instructions at https://old-lingvanex.nordicwise.com/en/apireference/
8) You told that you have 27 millions of downloads, but Nordicwise account stats are about 1000.
I got 27 millions downloads of old apps that have been tested to find market niches. About 15 million of them are translators. They are divided into 8 accounts. The new applications are placed on a separate account. Now there are 5 thousands of users for tests and bug finding. The design is not yet completed. When it is ready, they will be promoted. If anyone is interested to see the old translators – write to the private messages, I will send the links.
9) Who are your team?
Now there are 23 people taking part in the project. 12 of them are engineers, everyone has 10+ years of experience in their field. One founder.
10) Did you take an investment?
Not yet. The project is self-supporting. The goal now is to raise revenue and hire more people to go straight to round A.
11) Achieving 85% of Google translation quality can be easier and cheaper than going up from 85% to 90%. And even if Google ever make its work as opensource, it is not a fact that it will be possible to repeat it. Although the quality of the translation is not the biggest problem, much more challenge is its distribution and monetization. As with browsers – there are Chrome, Firefox, and there are niche builds.
The increase in quality from 85% to 100% is the most difficult, every next percentage. The main thing is to pass “this factor”, when customers continue to use. The strongest things we have are exactly the distribution and monetization, and several years of experiments were spent on it. This is what we are betting on.
12) Did you contribute to OpenNMT as an opensource?
OpenNMT is a project of community, although only 2 companies started it. Many people have contributed in it including us. For example, we ported it for mobile platforms and also put it in opensource
and then based on it, wrote the scientific paper about how to get models of neural networks of small size and translate on phones without Internet.
This work is also in the public domain.
13) Does computational linguistics have practical cases other than machine translation and information retrieval, where there are successful startups?
Yes, sure! For example: voice assistants, chat bots, semantic analysis of the text. There are about a dozen startups only in Minsk. For example, I like the Andy project (a language learning application through communication with a bot).
Something similar is being done by the guys in our openNMT community. They are training the chatbot’s conversations on movie subtitles ?
In fact, the NLP market is growing very fast as speech recognition, semantic analysis and other niches improve. The whole HYIP here will start in 2-3 years, when today’s NLP market promotion by big companies will give results. A series of mergers and acquisitions will begin. The main thing at this point is to have a prepared company for sale, while everyone is busy with other niches like image and video processing.
14) As for the fact that everything depends on power, I will tell you that I even used a quantum computer from IBM for NLP – there are no major improvements.
Modern NLP algorithms are quite limited, and this limitation is already set at the level of theoretical approach, they are all based on statistic. They contain one idea, that statistics can be used to make artificial intelligence. It came from Bell’s laboratory.
I agree that not everything depends on computational power. But until a certain moment, computational power has a great importance in achieving translation quality. Now neural networks are being supplemented with Ruled-based systems. They are always hybrid, not all based on neurons. DeepL wins Google just of this part, when you can correctly translate phraseology, for example.
You said, “Modern NLP algorithms are quite limited”.
I completely agree with that, and that’s why I’ve already described many times this problem. All the giants (Google, Microsoft, Amazon) who have in their staff even 100 highly qualified NLP specialists have faced.
High-quality translation or 100% speech recognition is a very complicated engineering tasks. Even for very big companies. For a long time Google, Microsoft, Amazon worked without any major progress. Scientists who develop NLP don’t go to the giants because they feel comfortable at their universties. Everyone works, closing the achievements from others.
New professionals who are potentially interested in NLP don’t go into this niche because of the high entrance level (money and technology). If I hadn’t earned $600,000 for my applications before, I wouldn’t have entered to this industry too.
Such a small amount of money was enough for me only because the openNMT project was opensource. Otherwise I would have needed at least 3 millions US dollars to develop a similar project from scratch. And to others it is the same. That’s why only a few people and startups participated in this niche and, as a result, there was no breakthrough for a long time.
Everything has changed since that moment when large and small companies started to make their developments as opensource and develop communities. Everyone understood that there would be no breakthrough for a long time without it.
But now everything is changing. The result is noticeable.
15) If it’s not a secret, did you consider other countries such as Cyprus, Ireland, etc. when registering a company? If you considered it, why did you choose the Republic of Belarus?
Yes, sales office was registered in Cyprus. In Belarus there is a development center.
16) The figure (by 2024, the machine translation market will be estimated at $1.5 billion) seemed to me at first too optimistic, but now I was helping to my girlfriend choose a car from Germany on mobile.de, looked at how many times I had to translate everything from German to Russian, and realized that the market exists, and it is very real.
In fact all even much more interesting! If $1.5 billion is just machine translations, then the entire translation market is $70 billion (according to the forecasts for 2023 year).
Why is this difference about 50 times?
Let’s imagine that the best machine translator now translates well 80% of the text. The remaining 20% needs to be edited by a human being. The biggest costs in translation is proofreading, i.e. people’s salaries.
An increase in the quality of translation by even 1-2% (up to 82% in our example) can reduce the cost of proofreading by 3-5%.
3-5 % of the difference between all translations minus machine translation will be (70 – 1.5 = 68.5 billion $ or 2 – 3.5 billion $). So the market share of machine translation is doubled. The figures above are approximately, to describe the idea.
In other words, even 1% improvement in quality allows large translation agencies to save a lot of money.
100% quality, or perfect machine translation in all areas, is not achievable in the near future. And every next percentage of quality improvement will be more difficult to achieve.
However this doesn’t prevent the machine translation market to reach 10% of the total translation market by 2023 (similar like DeepL took 10% of the Google market), as large companies are testing the API of different translators every day. And improving the quality of one of them by a 1% (for some language) will allow them to save hundreds of millions of U.S. dollars.
17) The main objective of VC is to make the 20-40x of the initial investment. 0.5% of the market share will sound not so good in a pitch to potential investors.
Thank you for your advice!
I have never been an expert in pitches and presentations. Also I couldn’t tell the text of the article and all the comments in 2 minutes of pitch, and no one else wanted to listen more.
After the phrase: “We are making a translator like Google”, people were losing interest ? Because of this, it was difficult to hire a team in the beginning and I had to look for people to outsource at commercial cost.
When the project started to be successful, everything became much easier.
It’s a story that can’t be told in an hour, because there are a lot of different issues that only people in the translation industry can understand.
18) I practice computer vision, but NLP has always been interesting. I would like to ask a few questions on the subject:
What is the minimum size of a data set for a language pair to get a good translation?
How do you filter data sets?
Do you have a separate model for each language pair or do you learn multi-models?
How long does it take (and what computer) to train 1 production model?
Have you ever tried using tensor processors like Google TPUs instead of shared GPUs?
- From our experience, 5 millions of quality translated lines are enough for a good dataset to translate general topics (like news). For each separate topic (medicine, real estate, metallurgy, etc.) it is necessary to retrain the model. This can be done iteratively by using the basic model.
- We have written our own filters, about 20 .
- A separate model for each pair. Tried a multi-model, but did not achieve a good result.
- From 1 to 2 weeks on 2 x RTX 2080 Ti on 1 language pair.
- We studied the question. It’s expensive and not worth to do.
19) You can’t compete companies like Google because they have huge R&D departments, super-computers and billiards of money. Making commercial project using opensource tools is a bad idea.
Our cashflow allow us to take the world’s top AI station from Nvidia DGX-2 with a capacity of 2 petaFlops (FP16) and get enough power to train the model with a quality better than in Google.
What is the computational power of 2 petaFlops?
For example let’s examine the history of DeepL startup.
In 2017 DeepL was a small company and offered translations into only 6 languages.
DeepL was positioned as a tool for professional translators to spend less time on proofreading documents after machine translation. Even a small change in the quality of translation saves a lot of money for translation companies. They constantly monitor the API for machine translation from different providers, because the quality is different in many language pairs and there is no single leader. In amount of languages Google has the most of all.
For demonstration of the quality translation, DeepL decided to run tests in some languages. (link below)
The quality rating was carried out by the blind testing method, when professional translators choose the best translation from Google, Microsoft, DeepL, Facebook. According to the results, DeepL won, the jury evaluated its translation as the most “literary” one.
How did it happen?
DeepL has a very interesting story. For many years they have owned the Linguee startup – the largest database of links to translated texts. Most likely, they have a huge number of data sets collected by parsers, and in order to train it, it is necessary to have a large computing power.
In 2017 DeepL published an article about the fact that they built a supercomputer in Iceland of 5 petaFlops (at that time it was the 23rd most productive in the world). To train the best quality model was only a matter of time. At that moment it seemed that even if we buy quality data sets, we will never be able to compete with them without such computer.
In March 2018 Nvidia released a DGX-2 computer with a size of a nightstand and a performance of 2 petaFlops (FP16), which can now be leased from $5000 / month (first 6 months, then $15000 / month).
With such computer you can train your neural models with big data sets quickly, as well as to keep a high load on API. This radically changes the balance of power of the entire Machine Learning market and allows small companies to compete with the giants in the field of large data.
This is now the best price/performance offer on the market.
If DeepL has beaten Google by 10 people, having a 5 petaFlops computer in 2017, and now I can rent a computer at 2 petaFlops and have 23 people in my team, why can’t I theoretically beat Google?
The openNMT (TensorFlow) project is not an academic project. This is the first opensource project for machine translation that has become the basis for many commercial startups, including Ebay and Booking to translate hotel descriptions and auctions. It will have a big support because it is already a large community.
In 2018 we all met in Paris. See the list of participants:
It’s not surprise that we did a great job with a small team. I just gathered the skilled engineers and had the money to implement the plan (similar to the story of DeepL). In total more than 30 people all over the world participated in my project, including department of NLP at the Polytechnic University of Valencia (Spain) – Sciling company. Our company is NordicWise LLC (Cyprus).
We made a presentation at the Machine Translation Association in Alicante, Spain, and were among the first who ported the translation on neural networks to phones. We wrote a scientific paper about it and a man from the team got PhD.
The guys from the NLP Harvard Department and other top specialists helped us.
And the fact that Google, Amazon and others have more opportunities is already questionable. As for capacity – they don’t have any advantages in 2019 (described above why). All new ideas are putting into public access.
For example, here:
We are not competitors to Google, Amazon, FB, Microsoft and other giants. They have approximately the same quality of translation, and they’ve all struggled with it. Each of them needs an advantage. They speedup this NLP market themselves for startups like DeepL. If they don’t help such companies like ours, nothing will happen, as the market entrance level is very complex. And they have no progress, even if each of them has 100 specialists. It’s all not obvious, that’s why I wrote such big article.
20) What you need ?
If you have some vision of how we can cooperate to win together – feel free to contact me.
Founder of Lingvanex