Neural network hardware considerations

This post will present some principles to be considered when choosing the hardware that will run some neural net computation, either for training models or making prediction using an existing model (i.e. inference).

Need more cores

Usually CPU are considered to not perform well enough when it comes to neural network computing and they are now outperformed by GPUs.

CPUs run faster then GPU but they are not design to perform many parallel operations simultaneously which is precisely what GPU are made for.

GPU are built to perform many operations in parallel quickly. They can perform basic independent operations without branching such as vector multiplication, addition, … Basically they favour high parallelisation degree and high memory throughput versus slower clock speed and less branching capabilities.

GPU fits nicely with neural network computation: Like images, feature vectors are usually too big to fit into CPU cache so having a  high memory throughput is a killer feature. Each neuron computation is independent from each other which limits branching.

GPU usage for neural network computation really exploded after the avent of GP-GPU or General-Purpose GPU. These GPUs allows to run arbitrary code and not only rendering subroutines. The most famous one being the CUDA from NVIDIA that allows to write GPU code in a C-like syntax.

GPU and CPU have different design and that reflects on the way we program them. CPU code tries to optimise cache usage to avoid memory access. GPU code doesn’t have such capabilities and it may be faster to compute a value twice than reading the previous result from memory. Therefore writing code optimised for GPU is not trivial.
E.g memory access needs to be synchronised among threads in order to improve efficiency. (group several thread memory access into one operation). This is why threads are grouped into warps and every thread in a warp perform the same operation.
This is how high-parallelism is achieved but if we need to perform different operations they must be performed sequentially (because the lack of branching capabilities).
Fortunately we don’t have to write GPU specific code. Instead we can use one library or framework who provides primitive operations (convolution, matrix multiplication, …) and write our models using these primitives. This is what Theano, Torch and TensorFlow, … provides. These libraries allow the models to be hardware independent as they can be configured to run either on CPU or GPU.

 Need more machines

Using GPU has improved things but it still happens that resources on a single machine are insufficient. In this case we need to distribute the workload across several machines.
There are 2 aspects to consider when distributing workload:
– data parallelism: each machine work on a different datapoint
– model parallelism: each machine work on a different part of the model
 The main algorithm to train a neural net is probably gradient descent. This is a sequential algorithm (each iteration depends on the previous one) and therefore it is difficult to parallelise. However it adapts to large amount of data (both in terms of features or samples) and there’re some version easier to parallelise.
E.g. Using stochastic gradient descent with a shared memory (or central server) to store the parameters. With this setting several cores (or nodes) read from the shared memory, then compute gradient descent and then update the parameters. All the parameters read/wirtes are lock free which means that the work of some cores is overwritten by others but overall progress is made.

Infer on small devices

Training a model can usually be done on big machine (or cluster of machines). However running the model is often done on much smaller devices (mobile phone) in which case we need a smaller model than the one we trained. This process is called model compression.
When a model is made actually made of several models (model composition). Running several models is expensive so we need to come up with a single model that generalises well enough (e.g. we can achieve this goal by finding a larger model – with more layer – and generalise with dropout).

Break it down

A big model can be broken down into several smaller model.

Let’s consider the case where we need to detect relatively rare events. We can have a cascade of small models being cheap to run. The first models will evict events which we are sure are negative events (models with a high-recall) and only the remaining events will go through the high-precision model.

E.g. If we want to detect fraud most of the transactions are not fraudulent therefore it’s good to evict most of the events early on using cheap computations. Only the fraudulent transactions will have to go through the whole model cascade.

It’s also worth notice that the overall precision of the cascade might be greater than the precision of the best model in the cascade.

Depending on the events or on the outcome of a first neural net we may choose to apply different models or use different parameters. A process similar to a decision tree.

Finally it’s possible to achieve good results (especially for inference) using fixed point precision number and/or a smaller number of bits (8 or 6 bits give good enough results) and consume much less power on a mobile device.