Deep Learning GPU Benchmarks

We are constantly training language models for our work. Our team uses dozens of different video cards chosen for different tasks: somewhere we need a powerful DGX station, and somewhere an old gaming card like RTX 2080Ti is enough. Choosing the optimal GPU for model training can significantly impact both the speed and cost-effectiveness of the process.

What's interesting is that there are quite a few articles on the internet with GPU comparison for machine learning, but very few focus on speed for language model training. Mostly only inference tests are found. When the new H100 chip was released, NVidia's report stated that it was up to nine times faster than A100 in training, but for our tasks, the new card was only 90% faster than the old one. By comparison, our cloud providers had a 2x price difference between these GPUs, so there was no point in switching to the new H100 to save money.

In addition to that, we took for a test a DGX station, which consists of 8 A100 80GB graphics cards and costs 10 thousand dollars per month. After the test, it became clear that the price/performance ratio of this station does not suit us at all and for this money, we can take 66 x RTX 3090, which in total will be much more useful.

Our translation language models have up to 500 million parameters (100 million to 300 million on average). It is possible that if we increase the number of parameters significantly, the price/performance ratio of DGX will be better. Currently, we do not train large language models that can translate between all languages in all variations at once, but use separate language models for each language pair, e.g. English-German. Each of such models takes from 120 to 300 Mb.

It is worth noting that different languages have different amounts of data on the Internet, and while. For example, for Spanish, you can find 500 million sentences with translations, but when training models for rarer languages like Tibetan, you need to choose a specific GPU for machine learning tasks based on the available data. To create a translation model from English to Spanish, we use a server with 4 x RTX 4500 and 256GB RAM. At the same time, the Tibetan language can be trained on RTX 2080 Ti with 16GB RAM, as it makes no sense to increase the complexity of the neural network and, as a result, to take a more powerful server with a small amount of data.

Selecting graphics processors and theoretical figures

Language model training took place on our internal Data Studio platform using the OpenNMT-tf framework. This phase included data preparation, model training, and model comparison with a reference translation. Using FP16 instead of FP32 during training allowed us to significantly reduce the training time of language models without degrading translation quality, but not all of our GPUs supported that.

When choosing a graphics processor, it is standard to consider such metrics as processing power (TFLOPS), video memory (VRAM), GPU benchmark results, library and framework support, budget, and other factors (graphics card size and form factor, power requirements, cooling, and compatibility with your system). When training text generation models, you should also keep in mind that different languages will consume different amounts of resources. For example, 1 byte is used to encode one character for Latin languages, 2 bytes for Cyrillic languages, and 3 bytes for languages containing hieroglyphs. Understanding what characteristics your graphics card will have has a significant impact on the speed of the learning process.

When training the models in terms of the GPUs used, the video cards were divided into two groups according to the period of use: early video cards, which were used to make the first measurements of learning speed, and cards currently in use. The main characteristics of these graphics cards can be found in Table 1 and Table 2, respectively.

Table 1 - Previously used graphics processors and their technical parameters

Number of GPUs	GPU	VRAM, G	CUDA	FP16, TFLOPS	FP32, TFLOPS
1	Tesla V100-SXM2	HBM2, 16	7.0	31.33	16.31
2	Tesla V100-SXM2	HBM2, 32	7.0	31.33	15.67
1	RTX 4060 Ti	GDDR6, 8	8.9	22.06	22.06
1	Nvidia A40	GDDR6, 48	8.6	37.42	37.42
2	Nvidia A40	GDDR6, 96	8.6	37.42	37.42
1	Nvidia A100	HBM2, 40	8.0	77.97	19.49
1	Nvidia A100	HBM2, 80	8.0	77.97	19.49
1	Nvidia RTX A6000	GDDR6, 48	8.6	38.71	38.71
1	Nvidia A10	GDDR6, 24	8.6	31.24	31.24
8	Nvidia A10	GDDR6, 192	8.6	31.24	31.24
1	Nvidia H100	HBM3, 80	9.0	204.9	51.22

Notes
1. With CUDA greater than 7.0, using FP16 will give a boost in training speed, depending on the CUDA version and the characteristics of the graphics card itself.
2. If the specification of the graphics card indicates that the FP16 to FP32 performance ratio is greater than 1 to 1, then using mixed precision will be guaranteed to increase the training speed by the amount specified in the specification. For example, for Quadro RTX 6000 the FP16 TFLOPS value of 32.62 (2:1) will speed up the workout by at least two times (2.4 times in practice)

Table 2 - Currently used GPU models and their main characteristics

Number of GPUs in use	GPU	VRAM, G	CUDA	FP16, TFLOPS	FP32, TFLOPS
1	Quadro RTX 6000	GDDR6, 24	7.5	32.62	16.31
2	Quadro RTX 6000	GDDR6, 48	7.5	32.62	16.31
4	Quadro RTX 6000	GDDR6, 96	7.5	32.62	16.31
2	Nvidia TITAN RTX	GDDR6, 48	7.5	32.62	16.31
4	Nvidia RTX A4500	GDDR6, 96	8.6	23.65	23.65
1	Nvidia GeForce RTX 3090	GDDR6X, 24	8.6	35.58	35.58
1	Nvidia GeForce RTX 3070	GDDR6, 8	8.6	20.31	20.31

* - values for FP16,TFLOPS and FP32,TFLOPS are taken from specifications per GPU

GPU training and testing process

The models were trained using a set of 18 GPUs. In the process of neural network training, we used numerous language pairs (more than a hundred languages). The GPU tests have helped identify which hardware performs best for specific tasks. During the training of our language pairs, the following neural network parameters were taken as a basis:

vocab size = 30 000
numunits = 768
layers = 6
heads = 16
inner dimension = 4 096

Firstly, let's characterize the GPUs that belonged to the first group based on Table 1. The time in minutes and seconds spent on training the model at an approximate speed of 1,000 steps and a batch size multiple of 100,000 units will be taken as the basis for comparing the indicators.

We emphasize that for the first group, the speed measurements were performed with the use of the alignment mechanism and only using FP32. Without using this mechanism the learning speed on some servers can be much faster.

The alignment mechanism allows matching substrings in the base and translated text. It is needed to translate formatted text, such as web pages, when a substring in a sentence may be highlighted in a different font and should be translated with the highlighting.

Taking into account the above-mentioned parameters of the neural network, the best time from the first table was shown by the GPU Nvidia H100 with a learning time of 22 minutes, and the intermediate time was shown by the GPU of the same brand GeForce RTX 4060 Ti with a learning time of 72 minutes and the last place was taken by the GPU Tesla V100-SXM 2 with a learning time of 140 minutes.

There were also eight Nvidia A10 cards in the GPU test with a learning curve of 20 minutes and 28 seconds, two Nvidia A40 cards with a time of 56 minutes, and two Tesla V100-SXM cards that clocked in at 86 minutes. Simultaneous application of multiple cards of the same series of GPU can speed up the training process of the models and show almost the same time with GPUs that have higher capacities, but such a technique may not be financially and procedurally rational enough. The results of learning speed measurements can be observed in Table number 3.

Table 3 - Training time measurements on the previously used graphical maps

Using the alignment mechanism
Effective batch size = 100 000
FP 32
Number of GPUs in use	GPU	Approximate speed (min. sec), 1,000 steps	Batch size in use
8	Nvidia A10	20,28	6 250
1	Nvidia H100	22	25 000
1	A100 (80 Gb)	40	25 000
1	A100 (40 Gb)	56	15 000
2	Nvidia A40	56	12 500
1	RTX A6000	68,25	12 500
1	GeForce RTX 4060 Ti	72	4 167
1	Nvidia A40	82,08	12 500
2	Tesla V100-SXM	86	4 167
1	Nvidia A10	104,50	5 000
1	Tesla V100-SXM2	140	4 167

Next, let's carry out a comparative analysis of graphics gas pedals currently in use (Table 2). For this group of graphics processors, speed measurements were performed using the alignment mechanism, as well as using FP16 and FP32. Speed measurements including this mechanism and mixed precision will be presented below in Tables 4 and 5 respectively.

So, having measured the speed of GPUs from this table, we can say that the first place was taken by the RTX A4500 series GPU with a training time of 31 minutes, but it should be emphasized that such a speed of training models was obtained by increasing the number of units of the used GPU up to 4. Disregarding this fact, the training speed of the aforementioned GPU will be much higher, which will place it in the penultimate place in the final table.

The Quadro RTX 6000 series GPU with a learning time of 47 minutes is in second place. It should be noted that such a training speed is inversely conditioned by the number of units of the used processor, which is equal to four. Using only one such GPU would give a speed loss of about 3.2 times and consequently would be approximately 153 minutes and place it in last place.

The third line was taken by the TITAN RTX series GPU with a time of 75 minutes and 85 seconds. This learning speed score is due to the use of 2 processors, which reduced the training time of the model.

The unquestionable leader in terms of training speed in the number of one unit will definitely be the GeForce RTX 3090 series GPU with a time of 78 minutes and 26 seconds. Increasing the number of units of this GPU will accelerate the model training speed, which will clearly overtake all the above-mentioned GPU models. The data on model training time measurements can be seen in Table 4.

Table 4 - Comparative analysis of language model training speed on previously used GPUs

Using the alignment mechanism
Effective batch size = 100 000
FP 32
Number of GPUs in use	GPU	Approximate speed (min. sec), 1,000 steps	Batch size in use
4	Nvidia RTX A4500	31	5 000
4	Quadro RTX 6000	47	6 250
2	Nvidia TITAN RTX	75,85	6 250
1	GeForce RTX 3090	78,26	6 250
2	Quadro RTX 6000	88	6 250
1	GeForce RTX 3070	104,17	2 000
1	Quadro RTX 6000	153	6 250

The following training speed measurements were performed using FP16. Compared to FP32, half-precision allows reducing the amount of memory consumed during model training and accelerate the computation on the GPU. The accuracy of the representation will be lower than with the use of FP32.

Measuring the training time of models using FP32 from the previous table, we can say that the training time of the neural network was reduced by almost two times. Based on the performance measurement results, we can observe from the machine learning GPU benchmarks in Table 4 that the positions of GPUs remained largely unchanged. The Quadro RTX 6000 series card moved up from the fifth position to the sixth one, beating the GeForce RTX 3090 GPU by 96 seconds. The final numbers are shown in Table 5.

Table 5 - Comparative analysis of language model training speed on previously used GPUs

Using the alignment mechanism
Effective batch size = 100 000
FP 16
Number of GPUs in use	GPU	Approximate speed (min. sec), 1,000 steps	Batch size in use
4	Nvidia RTX A4500	15,81	10 000
4	Quadro RTX 6000	20,34	12 500
2	Nvidia TITAN RTX	32,68	6 250
2	Quadro RTX 6000	37,93	10 000
1	GeForce RTX 3090	38,89	10 000
1	GeForce RTX 3070	48,51	2 500
1	Quadro RTX 6000	52,56	10 000

Category

Deep Learning GPU Benchmarks

Selecting graphics processors and theoretical figures

GPU training and testing process

Frequently Asked Questions (FAQ)

Is it worth buying a GPU for deep learning?

Which GPU is best for deep learning?

Is AMD or NVIDIA better for deep learning?

Does GPU help in NLP?

More fascinating reads await

Best Free Apps for Slack

Speech Recognition Quality Comparison

Machine Translation in the Military Sphere

Category

Deep Learning GPU Benchmarks

Selecting graphics processors and theoretical figures

GPU training and testing process

Frequently Asked Questions (FAQ)

Is it worth buying a GPU for deep learning?

Which GPU is best for deep learning?

Is AMD or NVIDIA better for deep learning?

Does GPU help in NLP?

More fascinating reads await

Best Free Apps for Slack

Speech Recognition Quality Comparison

Machine Translation in the Military Sphere

Contact Us

Completed