We are constantly training language models for our work. Our team uses dozens of different video cards chosen for different tasks: somewhere we need a powerful DGX station, and somewhere an old gaming card like RTX 2080Ti is enough. Choosing the optimal GPU for model training can significantly impact both the speed and cost-effectiveness of the process.
What's interesting is that there are quite a few articles on the internet with GPU comparison for machine learning, but very few focus on speed for language model training. Mostly only inference tests are found. When the new H100 chip was released, NVidia's report stated that it was up to nine times faster than A100 in training, but for our tasks, the new card was only 90% faster than the old one. By comparison, our cloud providers had a 2x price difference between these GPUs, so there was no point in switching to the new H100 to save money.
In addition to that, we took for a test a DGX station, which consists of 8 A100 80GB graphics cards and costs 10 thousand dollars per month. After the test, it became clear that the price/performance ratio of this station does not suit us at all and for this money, we can take 66 x RTX 3090, which in total will be much more useful.
Our translation language models have up to 500 million parameters (100 million to 300 million on average). It is possible that if we increase the number of parameters significantly, the price/performance ratio of DGX will be better. Currently, we do not train large language models that can translate between all languages in all variations at once, but use separate language models for each language pair, e.g. English-German. Each of such models takes from 120 to 300 Mb.
It is worth noting that different languages have different amounts of data on the Internet, and while. For example, for Spanish, you can find 500 million sentences with translations, but when training models for rarer languages like Tibetan, you need to choose a specific GPU for machine learning tasks based on the available data. To create a translation model from English to Spanish, we use a server with 4 x RTX 4500 and 256GB RAM. At the same time, the Tibetan language can be trained on RTX 2080 Ti with 16GB RAM, as it makes no sense to increase the complexity of the neural network and, as a result, to take a more powerful server with a small amount of data.
Selecting graphics processors and theoretical figures
Language model training took place on our internal Data Studio platform using the OpenNMT-tf framework. This phase included data preparation, model training, and model comparison with a reference translation. Using FP16 instead of FP32 during training allowed us to significantly reduce the training time of language models without degrading translation quality, but not all of our GPUs supported that.
When choosing a graphics processor, it is standard to consider such metrics as processing power (TFLOPS), video memory (VRAM), GPU benchmark results, library and framework support, budget, and other factors (graphics card size and form factor, power requirements, cooling, and compatibility with your system). When training text generation models, you should also keep in mind that different languages will consume different amounts of resources. For example, 1 byte is used to encode one character for Latin languages, 2 bytes for Cyrillic languages, and 3 bytes for languages containing hieroglyphs. Understanding what characteristics your graphics card will have has a significant impact on the speed of the learning process.
When training the models in terms of the GPUs used, the video cards were divided into two groups according to the period of use: early video cards, which were used to make the first measurements of learning speed, and cards currently in use. The main characteristics of these graphics cards can be found in Table 1 and Table 2, respectively.
Table 1 - Previously used graphics processors and their technical parameters
Number of GPUs | GPU | VRAM, G | CUDA | FP16, TFLOPS | FP32, TFLOPS |
---|---|---|---|---|---|
1 | Tesla V100-SXM2 | HBM2, 16 | 7.0 | 31.33 | 16.31 |
2 | Tesla V100-SXM2 | HBM2, 32 | 7.0 | 31.33 | 15.67 |
1 | RTX 4060 Ti | GDDR6, 8 | 8.9 | 22.06 | 22.06 |
1 | Nvidia A40 | GDDR6, 48 | 8.6 | 37.42 | 37.42 |
2 | Nvidia A40 | GDDR6, 96 | 8.6 | 37.42 | 37.42 |
1 | Nvidia A100 | HBM2, 40 | 8.0 | 77.97 | 19.49 |
1 | Nvidia A100 | HBM2, 80 | 8.0 | 77.97 | 19.49 |
1 | Nvidia RTX A6000 | GDDR6, 48 | 8.6 | 38.71 | 38.71 |
1 | Nvidia A10 | GDDR6, 24 | 8.6 | 31.24 | 31.24 |
8 | Nvidia A10 | GDDR6, 192 | 8.6 | 31.24 | 31.24 |
1 | Nvidia H100 | HBM3, 80 | 9.0 | 204.9 | 51.22 |
Notes
1. With CUDA greater than 7.0, using FP16 will give a boost in training speed, depending on the CUDA version and the characteristics of the graphics card itself.
2. If the specification of the graphics card indicates that the FP16 to FP32 performance ratio is greater than 1 to 1, then using mixed precision will be guaranteed to increase the training speed by the amount specified in the specification. For example, for Quadro RTX 6000 the FP16 TFLOPS value of 32.62 (2:1) will speed up the workout by at least two times (2.4 times in practice)
Table 2 - Currently used GPU models and their main characteristics
Number of GPUs in use | GPU | VRAM, G | CUDA | FP16, TFLOPS | FP32, TFLOPS |
---|---|---|---|---|---|
1 | Quadro RTX 6000 | GDDR6, 24 | 7.5 | 32.62 | 16.31 |
2 | Quadro RTX 6000 | GDDR6, 48 | 7.5 | 32.62 | 16.31 |
4 | Quadro RTX 6000 | GDDR6, 96 | 7.5 | 32.62 | 16.31 |
2 | Nvidia TITAN RTX | GDDR6, 48 | 7.5 | 32.62 | 16.31 |
4 | Nvidia RTX A4500 | GDDR6, 96 | 8.6 | 23.65 | 23.65 |
1 | Nvidia GeForce RTX 3090 | GDDR6X, 24 | 8.6 | 35.58 | 35.58 |
1 | Nvidia GeForce RTX 3070 | GDDR6, 8 | 8.6 | 20.31 | 20.31 |
* - values for FP16,TFLOPS and FP32,TFLOPS are taken from specifications per GPU
GPU training and testing process
The models were trained using a set of 18 GPUs. In the process of neural network training, we used numerous language pairs (more than a hundred languages). The GPU tests have helped identify which hardware performs best for specific tasks. During the training of our language pairs, the following neural network parameters were taken as a basis:
- vocab size = 30 000
- numunits = 768
- layers = 6
- heads = 16
- inner dimension = 4 096
Firstly, let's characterize the GPUs that belonged to the first group based on Table 1. The time in minutes and seconds spent on training the model at an approximate speed of 1,000 steps and a batch size multiple of 100,000 units will be taken as the basis for comparing the indicators.
We emphasize that for the first group, the speed measurements were performed with the use of the alignment mechanism and only using FP32. Without using this mechanism the learning speed on some servers can be much faster.
The alignment mechanism allows matching substrings in the base and translated text. It is needed to translate formatted text, such as web pages, when a substring in a sentence may be highlighted in a different font and should be translated with the highlighting.
Taking into account the above-mentioned parameters of the neural network, the best time from the first table was shown by the GPU Nvidia H100 with a learning time of 22 minutes, and the intermediate time was shown by the GPU of the same brand GeForce RTX 4060 Ti with a learning time of 72 minutes and the last place was taken by the GPU Tesla V100-SXM 2 with a learning time of 140 minutes.
There were also eight Nvidia A10 cards in the GPU test with a learning curve of 20 minutes and 28 seconds, two Nvidia A40 cards with a time of 56 minutes, and two Tesla V100-SXM cards that clocked in at 86 minutes. Simultaneous application of multiple cards of the same series of GPU can speed up the training process of the models and show almost the same time with GPUs that have higher capacities, but such a technique may not be financially and procedurally rational enough. The results of learning speed measurements can be observed in Table number 3.
Table 3 - Training time measurements on the previously used graphical maps
Using the alignment mechanism | |||
---|---|---|---|
Effective batch size = 100 000 | |||
FP 32 | |||
Number of GPUs in use | GPU | Approximate speed (min. sec), 1,000 steps | Batch size in use |
8 | Nvidia A10 | 20,28 | 6 250 |
1 | Nvidia H100 | 22 | 25 000 |
1 | A100 (80 Gb) | 40 | 25 000 |
1 | A100 (40 Gb) | 56 | 15 000 |
2 | Nvidia A40 | 56 | 12 500 |
1 | RTX A6000 | 68,25 | 12 500 |
1 | GeForce RTX 4060 Ti | 72 | 4 167 |
1 | Nvidia A40 | 82,08 | 12 500 |
2 | Tesla V100-SXM | 86 | 4 167 |
1 | Nvidia A10 | 104,50 | 5 000 |
1 | Tesla V100-SXM2 | 140 | 4 167 |
Next, let's carry out a comparative analysis of graphics gas pedals currently in use (Table 2). For this group of graphics processors, speed measurements were performed using the alignment mechanism, as well as using FP16 and FP32. Speed measurements including this mechanism and mixed precision will be presented below in Tables 4 and 5 respectively.
So, having measured the speed of GPUs from this table, we can say that the first place was taken by the RTX A4500 series GPU with a training time of 31 minutes, but it should be emphasized that such a speed of training models was obtained by increasing the number of units of the used GPU up to 4. Disregarding this fact, the training speed of the aforementioned GPU will be much higher, which will place it in the penultimate place in the final table.
The Quadro RTX 6000 series GPU with a learning time of 47 minutes is in second place. It should be noted that such a training speed is inversely conditioned by the number of units of the used processor, which is equal to four. Using only one such GPU would give a speed loss of about 3.2 times and consequently would be approximately 153 minutes and place it in last place.
The third line was taken by the TITAN RTX series GPU with a time of 75 minutes and 85 seconds. This learning speed score is due to the use of 2 processors, which reduced the training time of the model.
The unquestionable leader in terms of training speed in the number of one unit will definitely be the GeForce RTX 3090 series GPU with a time of 78 minutes and 26 seconds. Increasing the number of units of this GPU will accelerate the model training speed, which will clearly overtake all the above-mentioned GPU models. The data on model training time measurements can be seen in Table 4.
Table 4 - Comparative analysis of language model training speed on previously used GPUs
Using the alignment mechanism | |||
---|---|---|---|
Effective batch size = 100 000 | |||
FP 32 | |||
Number of GPUs in use | GPU | Approximate speed (min. sec), 1,000 steps | Batch size in use |
4 | Nvidia RTX A4500 | 31 | 5 000 |
4 | Quadro RTX 6000 | 47 | 6 250 |
2 | Nvidia TITAN RTX | 75,85 | 6 250 |
1 | GeForce RTX 3090 | 78,26 | 6 250 |
2 | Quadro RTX 6000 | 88 | 6 250 |
1 | GeForce RTX 3070 | 104,17 | 2 000 |
1 | Quadro RTX 6000 | 153 | 6 250 |
The following training speed measurements were performed using FP16. Compared to FP32, half-precision allows reducing the amount of memory consumed during model training and accelerate the computation on the GPU. The accuracy of the representation will be lower than with the use of FP32.
Measuring the training time of models using FP32 from the previous table, we can say that the training time of the neural network was reduced by almost two times. Based on the performance measurement results, we can observe from the machine learning GPU benchmarks in Table 4 that the positions of GPUs remained largely unchanged. The Quadro RTX 6000 series card moved up from the fifth position to the sixth one, beating the GeForce RTX 3090 GPU by 96 seconds. The final numbers are shown in Table 5.
Table 5 - Comparative analysis of language model training speed on previously used GPUs
Using the alignment mechanism | |||
---|---|---|---|
Effective batch size = 100 000 | |||
FP 16 | |||
Number of GPUs in use | GPU | Approximate speed (min. sec), 1,000 steps | Batch size in use |
4 | Nvidia RTX A4500 | 15,81 | 10 000 |
4 | Quadro RTX 6000 | 20,34 | 12 500 |
2 | Nvidia TITAN RTX | 32,68 | 6 250 |
2 | Quadro RTX 6000 | 37,93 | 10 000 |
1 | GeForce RTX 3090 | 38,89 | 10 000 |
1 | GeForce RTX 3070 | 48,51 | 2 500 |
1 | Quadro RTX 6000 | 52,56 | 10 000 |