41

I've seen discussions about the 'overhead' of a GPU, and that for 'small' networks, it may actually be faster to train on a CPU (or network of CPUs) than a GPU.

What is meant by 'small'?

For example, would a single-layer MLP with 100 hidden units be 'small'?

Does our definition of 'small' change for recurrent architectures?

Are there any other criteria that should be considered when deciding whether to train on CPU or GPU?

EDIT 1:

I just found a blog post (possibly outdated? It's from 2014):

"...Most network card[s] only work with memory that is registered with the CPU and so the GPU to GPU transfer between two nodes would be like this: GPU 1 to CPU 1 to Network Card 1 to Network Card 2 to CPU 2 to GPU 2. What this means is, if one chooses a slow network card then there might be no speedups over a single computer. Even with fast network cards, if the cluster is large, one does not even get speedups from GPUs when compared to CPUs as the GPUs just work too fast for the network cards to keep up with them.

This is the reason why many big companies like Google and Microsoft are using CPU rather than GPU clusters to train their big neural networks. "

So at some point, according to this post, it could have been faster to use CPUs. Is this still the case?

EDIT 2: Yes, that blog post may very well be outdated because:

Now it seems that GPUs within a node are connected via PCIe bus, so communication can happen at about 6GiB/s. (For example: https://www.youtube.com/watch?v=el1iSlP1uOs, about 35 minutes in). The speaker implies that this is faster than going from GPU1 to CPU to GPU2. It would mean the network card is no longer the bottleneck.

StatsSorceress
  • 2,021
  • 3
  • 16
  • 30

3 Answers3

35

Unlike some of the other answers, I would highly advice against always training on GPUs without any second thought. This is driven by the usage of deep learning methods on images and texts, where the data is very rich (e.g. a lot of pixels = a lot of variables) and the model similarly has many millions of parameters. For other domains, this might not be the case.

What is meant by 'small'? For example, would a single-layer MLP with 100 hidden units be 'small'?

Yes, that is definitely very small by modern standards. Unless you have a GPU suited perfectly for training (e.g. NVIDIA 1080 or NVIDIA Titan), I wouldn't be surprised to find that your CPU was faster.

Note that the complexity of your neural network also depends on your number of input features, not just the number of units in your hidden layer. If your hidden layer has 100 units and each observation in your dataset has 4 input features, then your network is tiny (~400 parameters). If each observation instead has 1M input features as in some medical/biotech contexts, then your network is pretty big in terms of number of parameters. For the remainder of my answer I'm assuming you have quite few input features pr. observation.

One good example I've found of comparing CPU vs. GPU performance was when I trained a poker bot using reinforcement learning. For reinforcement learning you often don't want that many layers in your neural network and we found that we only needed a few layers with few parameters. Moreover, the number of input features was quite low. Initially I trained on a GPU (NVIDIA Titan), but it was taking a long time as reinforcement learning requires a lot of iterations. Luckily, I found that training on my CPU instead made my training go 10x as fast! This is just to say that CPU's can sometimes be better for training.

Are there any other criteria that should be considered when deciding whether to train on CPU or GPU?

It's important to note that while on a GPU you will always want to fill up the entire GPU memory by increasing your batch size, that is not the case on the CPU. On the CPU an increase in batch size will increase the time pr. batch. Therefore, if it's important for you to have a very large batch size (e.g. due to a very noisy signal), it can be beneficial to use a GPU. I haven't experienced this in practice though and normally small batch sizes are preferred.

pir
  • 790
  • 5
  • 11
5

The CPU is the manager of the branch, he can do a bit of everything, but he is not great at much except delegating tasks. However, the GPU is a dedicated mathematician hiding in your machine. If you are doing any math heavy processes then you should use your GPU. Always.

If you are using any popular programming language for machine learning such as python or MATLAB it is a one-liner of code to tell your computer that you want the operations to run on your GPU.

You should also make sure to use all the cores of your machine. This means making use of parallel computing. Especially for neural networks where operations can be done independently, this is going to increase your speed immensely.

JahKnows
  • 9,086
  • 31
  • 45
3

I'll first reference some quotes from similar questions:

When it comes to matrix operations, you don't think twice, you always opt for GPUs. source

The parallel architecture in a GPU is well adapted for vector and matrix operations. source

So if you read through these questions, you'll see that they advise to use GPU regardless of the case; it will always provide some improvement.

The reason you may have read that 'small' networks should be trained with CPU, is because implementing GPU training for just a small network might take more time than simply training with CPU - that doesn't mean GPU will be slower.

A 100-hidden unit network is kind of small, i'd call it a small network relative to the big deep networks out there. Recurrent architectures (mostly) have more synapses thant feed forward networks, so a 100-hidden units RNN is 'bigger' than a 100-hidden unit FFN.

Thomas Wagenaar
  • 1,158
  • 8
  • 7