Deep Learning GPU Benchmarks

Victoria Kripets

Victoria Kripets

Linguist

We are constantly training language models for our work. Our team uses dozens of different video cards chosen for different tasks: somewhere we need a powerful DGX station, and somewhere an old gaming card like RTX 2080Ti is enough. Choosing the optimal GPU for model training can significantly impact both the speed and cost-effectiveness of the process.

What's interesting is that there are quite a few articles on the internet with GPU comparison for machine learning, but very few focus on speed for language model training. Mostly only inference tests are found. When the new H100 chip was released, NVidia's report stated that it was up to nine times faster than A100 in training, but for our tasks, the new card was only 90% faster than the old one. By comparison, our cloud providers had a 2x price difference between these GPUs, so there was no point in switching to the new H100 to save money.

In addition to that, we took for a test a DGX station, which consists of 8 A100 80GB graphics cards and costs 10 thousand dollars per month. After the test, it became clear that the price/performance ratio of this station does not suit us at all and for this money, we can take 66 x RTX 3090, which in total will be much more useful.

Our translation language models have up to 500 million parameters (100 million to 300 million on average). It is possible that if we increase the number of parameters significantly, the price/performance ratio of DGX will be better. Currently, we do not train large language models that can translate between all languages in all variations at once, but use separate language models for each language pair, e.g. English-German. Each of such models takes from 120 to 300 Mb.

It is worth noting that different languages have different amounts of data on the Internet, and while. For example, for Spanish, you can find 500 million sentences with translations, but when training models for rarer languages like Tibetan, you need to choose a specific GPU for machine learning tasks based on the available data. To create a translation model from English to Spanish, we use a server with 4 x RTX 4500 and 256GB RAM. At the same time, the Tibetan language can be trained on RTX 2080 Ti with 16GB RAM, as it makes no sense to increase the complexity of the neural network and, as a result, to take a more powerful server with a small amount of data.

Selecting graphics processors and theoretical figures

Language model training took place on our internal Data Studio platform using the OpenNMT-tf framework. This phase included data preparation, model training, and model comparison with a reference translation. Using FP16 instead of FP32 during training allowed us to significantly reduce the training time of language models without degrading translation quality, but not all of our GPUs supported that.

When choosing a graphics processor, it is standard to consider such metrics as processing power (TFLOPS), video memory (VRAM), GPU benchmark results, library and framework support, budget, and other factors (graphics card size and form factor, power requirements, cooling, and compatibility with your system). When training text generation models, you should also keep in mind that different languages will consume different amounts of resources. For example, 1 byte is used to encode one character for Latin languages, 2 bytes for Cyrillic languages, and 3 bytes for languages containing hieroglyphs. Understanding what characteristics your graphics card will have has a significant impact on the speed of the learning process.

When training the models in terms of the GPUs used, the video cards were divided into two groups according to the period of use: early video cards, which were used to make the first measurements of learning speed, and cards currently in use. The main characteristics of these graphics cards can be found in Table 1 and Table 2, respectively.

Table 1 - Previously used graphics processors and their technical parameters
 

Number of GPUsGPUVRAM, GCUDAFP16,
TFLOPS
FP32,
TFLOPS
1Tesla V100-SXM2HBM2, 167.031.3316.31
2Tesla V100-SXM2HBM2, 327.031.3315.67
1RTX 4060 TiGDDR6, 88.922.0622.06
1Nvidia A40GDDR6, 488.637.4237.42
2Nvidia A40GDDR6, 968.637.4237.42
1Nvidia A100HBM2, 408.077.9719.49
1Nvidia A100HBM2, 808.077.9719.49
1Nvidia RTX A6000GDDR6, 488.638.7138.71
1Nvidia A10GDDR6, 248.631.2431.24
8Nvidia A10GDDR6, 1928.631.2431.24
1Nvidia H100HBM3, 809.0204.951.22


Notes
1. With CUDA greater than 7.0, using FP16 will give a boost in training speed, depending on the CUDA version and the characteristics of the graphics card itself.
2. If the specification of the graphics card indicates that the FP16 to FP32 performance ratio is greater than 1 to 1, then using mixed precision will be guaranteed to increase the training speed by the amount specified in the specification. For example, for Quadro RTX 6000 the FP16 TFLOPS value of 32.62 (2:1) will speed up the workout by at least two times (2.4 times in practice)

Table 2 - Currently used GPU models and their main characteristics
 

Number of GPUs in useGPUVRAM, GCUDAFP16,
TFLOPS
FP32,
TFLOPS
1Quadro RTX 6000GDDR6, 247.532.6216.31
2Quadro RTX 6000GDDR6, 487.532.6216.31
4Quadro RTX 6000GDDR6, 967.532.6216.31
2Nvidia TITAN RTXGDDR6, 487.532.6216.31
4Nvidia RTX A4500GDDR6, 968.623.6523.65
1Nvidia GeForce RTX 3090GDDR6X, 248.635.5835.58
1Nvidia GeForce RTX 3070GDDR6, 88.620.3120.31

* - values for FP16,TFLOPS and FP32,TFLOPS are taken from specifications per GPU

GPU training and testing process

The models were trained using a set of 18 GPUs. In the process of neural network training, we used numerous language pairs (more than a hundred languages). The GPU tests have helped identify which hardware performs best for specific tasks. During the training of our language pairs, the following neural network parameters were taken as a basis:
 

  • vocab size = 30 000
  • numunits = 768
  • layers = 6
  • heads = 16
  • inner dimension = 4 096


Firstly, let's characterize the GPUs that belonged to the first group based on Table 1. The time in minutes and seconds spent on training the model at an approximate speed of 1,000 steps and a batch size multiple of 100,000 units will be taken as the basis for comparing the indicators.

We emphasize that for the first group, the speed measurements were performed with the use of the alignment mechanism and only using FP32. Without using this mechanism the learning speed on some servers can be much faster.

The alignment mechanism allows matching substrings in the base and translated text. It is needed to translate formatted text, such as web pages, when a substring in a sentence may be highlighted in a different font and should be translated with the highlighting.

Taking into account the above-mentioned parameters of the neural network, the best time from the first table was shown by the GPU Nvidia H100 with a learning time of 22 minutes, and the intermediate time was shown by the GPU of the same brand GeForce RTX 4060 Ti with a learning time of 72 minutes and the last place was taken by the GPU Tesla V100-SXM 2 with a learning time of 140 minutes.

There were also eight Nvidia A10 cards in the GPU test with a learning curve of 20 minutes and 28 seconds, two Nvidia A40 cards with a time of 56 minutes, and two Tesla V100-SXM cards that clocked in at 86 minutes. Simultaneous application of multiple cards of the same series of GPU can speed up the training process of the models and show almost the same time with GPUs that have higher capacities, but such a technique may not be financially and procedurally rational enough. The results of learning speed measurements can be observed in Table number 3.

Table 3 - Training time measurements on the previously used graphical maps
 

Using the alignment mechanism
Effective batch size = 100 000
FP 32
Number of GPUs in useGPUApproximate speed (min. sec),
1,000 steps
Batch size in use
8Nvidia A1020,286 250
1Nvidia H1002225 000
1A100 (80 Gb)4025 000
1A100 (40 Gb)5615 000
2Nvidia A405612 500
1RTX A600068,2512 500
1GeForce RTX 4060 Ti724 167
1Nvidia A4082,0812 500
2Tesla V100-SXM864 167
1Nvidia A10104,505 000
1Tesla V100-SXM21404 167


Next, let's carry out a comparative analysis of graphics gas pedals currently in use (Table 2). For this group of graphics processors, speed measurements were performed using the alignment mechanism, as well as using FP16 and FP32. Speed measurements including this mechanism and mixed precision will be presented below in Tables 4 and 5 respectively.

So, having measured the speed of GPUs from this table, we can say that the first place was taken by the RTX A4500 series GPU with a training time of 31 minutes, but it should be emphasized that such a speed of training models was obtained by increasing the number of units of the used GPU up to 4. Disregarding this fact, the training speed of the aforementioned GPU will be much higher, which will place it in the penultimate place in the final table.

The Quadro RTX 6000 series GPU with a learning time of 47 minutes is in second place. It should be noted that such a training speed is inversely conditioned by the number of units of the used processor, which is equal to four. Using only one such GPU would give a speed loss of about 3.2 times and consequently would be approximately 153 minutes and place it in last place.

The third line was taken by the TITAN RTX series GPU with a time of 75 minutes and 85 seconds. This learning speed score is due to the use of 2 processors, which reduced the training time of the model.

The unquestionable leader in terms of training speed in the number of one unit will definitely be the GeForce RTX 3090 series GPU with a time of 78 minutes and 26 seconds. Increasing the number of units of this GPU will accelerate the model training speed, which will clearly overtake all the above-mentioned GPU models. The data on model training time measurements can be seen in Table 4.

Table 4 - Comparative analysis of language model training speed on previously used GPUs
 

Using the alignment mechanism
Effective batch size = 100 000
FP 32
Number of GPUs in useGPUApproximate speed (min. sec),
1,000 steps
Batch size in use
4Nvidia RTX A4500315 000
4Quadro RTX 6000476 250
2Nvidia TITAN RTX75,856 250
1GeForce RTX 309078,266 250
2Quadro RTX 6000886 250
1GeForce RTX 3070104,172 000
1Quadro RTX 60001536 250


The following training speed measurements were performed using FP16. Compared to FP32, half-precision allows reducing the amount of memory consumed during model training and accelerate the computation on the GPU. The accuracy of the representation will be lower than with the use of FP32.

Measuring the training time of models using FP32 from the previous table, we can say that the training time of the neural network was reduced by almost two times. Based on the performance measurement results, we can observe from the machine learning GPU benchmarks in Table 4 that the positions of GPUs remained largely unchanged. The Quadro RTX 6000 series card moved up from the fifth position to the sixth one, beating the GeForce RTX 3090 GPU by 96 seconds. The final numbers are shown in Table 5.

Table 5 - Comparative analysis of language model training speed on previously used GPUs
 

Using the alignment mechanism
Effective batch size = 100 000
FP 16
Number of GPUs in useGPUApproximate speed (min. sec),
1,000 steps
Batch size in use
4Nvidia RTX A450015,8110 000
4Quadro RTX 600020,3412 500
2Nvidia TITAN RTX32,686 250
2Quadro RTX 600037,9310 000
1GeForce RTX 309038,8910 000
1GeForce RTX 307048,512 500
1Quadro RTX 600052,5610 000

Frequently Asked Questions (FAQ)

Is it worth buying a GPU for deep learning?

Buying a GPU for deep learning can significantly enhance training speed and efficiency, making it a worthwhile investment for serious projects. However, the decision should consider factors like budget, specific use cases, and whether cloud solutions might be more cost-effective.

Which GPU is best for deep learning?

The NVIDIA A100 is often considered the top choice for deep learning, offering exceptional performance and memory for large models. For budget-conscious users, the NVIDIA RTX 3090 provides strong capabilities for training models effectively.

Is AMD or NVIDIA better for deep learning?

NVIDIA is generally preferred for deep learning due to its robust software ecosystem, which enhances performance and compatibility with popular frameworks. While AMD GPUs have improved, they still lag behind NVIDIA in terms of optimization and support for deep learning applications.

Does GPU help in NLP?

Yes, GPUs significantly accelerate neural network training in natural language processing (NLP) by handling parallel computations efficiently. This speed boost allows for faster experimentation and iteration, leading to improved model performance and reduced training times.

More fascinating reads await

The Inference Mechanism of a Trained Model in Sequence Generation

The Inference Mechanism of a Trained Model in Sequence Generation

October 31, 2024

Lingvanex vs. LibreTranslate

Lingvanex vs. LibreTranslate

October 31, 2024

Machine Translation for Legal Documents

Machine Translation for Legal Documents

October 11, 2024

Contact us

0/250
* Indicates required field

Your privacy is of utmost importance to us; your data will be used solely for contact purposes.

Email

Completed

Your request has been sent successfully

× 
Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site.

We also use third-party cookies that help us analyze how you use this website, store your preferences, and provide the content and advertisements that are relevant to you. These cookies will only be stored in your browser with your prior consent.

You can choose to enable or disable some or all of these cookies but disabling some of them may affect your browsing experience.

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Always Active

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Always Active

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Always Active

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Always Active

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.