Machine learning applications widespread every day in many domains. One of today’s most powerful techniques is the neural network. This technique is employed in many applications such as image recognition, speech analysis and translation, self-driving cars, etc…

In fact such learning algorithms have been known for decades. But only recently it has become mainstream supported by the increase in computation power (GPU) and memory usage (SSD) which allow us to run these algorithms over billions of samples.

Neural network can represent a wide range of complex functions making it an algorithm of choice in many domains. However training such algorithms is complex and it’s only the recent increase in computation power and fast data access that allowed to exploit the full potential of this technique.

The term “neural network” is inspired from the biological domain as it’s supposed to mimic the way the brain works: A neuron receives some electrical signals as input , processes them and produces an output signal that is in turn fed into other neurons….

The biological metaphor stops here and neural network algorithm are just a way to combine simple functions to build more complex ones.

##### Neuron

Let’s dig a bit into what is a neural network and let’s start with exploring just a single neuron.

Basically a neuron is just a function that takes a set of inputs and transforms it into an output value. It can be seen as a little function:

\( y = \sigma(x^\top w + b)\)

where \(x\) is the input vector, \(w\) is the weight matrix and \(sigma\) is a non-linear function allowing us to model non-linear functions.

\(\sigma\) is called the activation function.

In some cases \(\sigma\) can be the identity function but in these case our neuron applies a linear transformation of the input and a network of such neurons is a linear transformation as well. In fact we need \(\sigma\) to be non linear to be able to model non-linear functions.

Typically \(\sigma\) is often the ReLU (Rectified Linear Unit) function or the sigmoid or softmax function.

- The
**ReLU**function is defined as \(y = 0\) when \(x < 0\), \(y = x\) when \(x >= 0\). - The
**sigmoid**function is defined as \(y = \frac{1}{1 + e^{-x}}\) - The
**softmax**function is defined as \(y_i = \frac{e^x_i}{\sum\limits_{m} e^x_j}\). The interesting fact about softmax is that the output components sum up to 1 allowing us to see the output as a probability.

###### Modelling the XOR function

The XOR function is a function that takes 2 inputs x1, x2 and return an output y. x1, x2 and y are either 0 or 1. When x1 and x2 are the same it returns 0 and when they’re different it returns 1.

x1 | x2 | y | |
---|---|---|---|

0 | 0 | 0 | |

0 | 1 | 1 | |

1 | 0 | 1 | |

1 | 1 | 0 |

If we try to model the XOR function using only linear transformation \(y = w_1 x_1 + w_2 x_2 + b\)we will find out that \(w = [0 0]^\top\) and \(b = 0.5\). Which means that y is constant for any value of x (\(y = 0.5\)).

If we want to model the XOR function we need to introduce a non linear function \(\sigma\). Let’s pick the ReLU function where \(y = 0\) when \(x < 0\) and \(y = x\) when \(x >= 0\).

We still can’t model the XOR function with a single neuron, so let’s introduce a new one neuron in the input layer and another one in the output layer such that our network looks like this:

Each column of neurons is called a layer. The first layer is the input layer and the last layer is the output layer. If a neural network has more than 2 layers, every layer between the input and output layer is called hidden layer. It’s called hidden because we have can’t see what its output is.

Now our input layer will transform our input value x into an output value h which becomes the input of the last layer.

With \(W = \begin{bmatrix}

1 & 1 \\

1 & 1

\end{bmatrix}\) and \(b = \begin{bmatrix}

0 \\

-1

\end{bmatrix}\) and \(X = \begin{bmatrix}

0 & 0 \\

0 & 1 \\

1 & 0 \\

1 & 1

\end{bmatrix}\) (X is the set of all possible inputs) we can see that \(h = x^\top w + b = \begin{bmatrix}

0 & -1 \\

1 & 0 \\

1 & 0 \\

2 & 1

\end{bmatrix}\)

Applying the ReLU function gives \(\begin{bmatrix}

0 & 0 \\

1 & 0 \\

1 & 0 \\

2 & 1

\end{bmatrix}\).

This matrix becomes the input of the output layer. With \(W = \begin{bmatrix}

1 \\

-2

\end{bmatrix}\) and \(b = 0\) it gives the output \(y = \begin{bmatrix}

0 \\

1 \\

1 \\

0

\end{bmatrix}\). Which is exactly what we wanted to achieve.

##### Network topology

A general network is defined by :

- \(L\) the number of layers (\(L-2\) hidden layers)
- \(N_i\) the number of neurons in layer i
- \(a^{(j)}_i\) the output of the neuron j in layer i
- \(\sigma_i\) the activation function used in layer i
- \(J\) the cost function used to train the network

The output of each layers forms the input of the next layer, therefore we can define \(A_{i+1}\) in terms of \(A_i\).

\( A_i = \begin{bmatrix}

a^{(0)}_i & a^{(1)}_i & \cdots & a^{(N_i)}_i

\end{bmatrix}\) of dimension \( 1 \times (N_i + 1) \)

\( W^{(i)} = \begin{bmatrix}

W^{(i)}_{01} & W^{(i)}_{11} & \cdots & W^{(i)}_{N_i 1} \\

W^{(i)}_{02} & W^{(i)}_{12} & \cdots & W^{(i)}_{N_i 2} \\

\vdots & \vdots & \ddots & \vdots \\

W^{(i)}_{0N_{i+1}} & W^{(i)}_{1N_{i+1}} & \cdots & W^{(i)}_{N_i N_{i+1}}

\end{bmatrix}\) of dimension \(N_{i+1} \times N_i + 1\) (\(N_i\) dimensions + 1 for the bias unit)

\(A_{i+1} = \sigma(A_i W^{(i)})\) of dimension \(1 \times N_{i+1}\)

##### Learning

This is great but so far we just “guess” the values which wasn’t too hard for this simple XOR example. In practice it’s not that easy we can have millions of training samples and many hidden layers such that it’s not possible for a human being to guess the values.

We need the machine to learn the correct values for W and b and for that we need a way to measure how good our model performs. This is where the cost function comes in. It’s purpose it’s to tell us how far we are from the expected outcome.

We can use the MSE (Means square error) but in neural network the negative log likelihood gives better results especially when the model uses the sigmoid or softmax activation function (The idea is that the log function “cancels” the exponential function).

We then iterate over the following steps to find learn the model parameters (i.e. the weights):

- Compute the output of the model using the forward propagation
- Compute the loss of the model
- Back propagate the error in the model
- Adjust the weights using gradient descent

###### Back propagation

The back propagation is what allows us to compute the gradient for each neuron in the network and adjust it’s weights accordingly. This is probably the most tricky part of this algorithm.

In fact the tricky part is to compute the gradient for every single node which involves computing the derivative of the cost of a node. And as the cost of a node is a composition of multiple functions this can become very tedious.

To achieve this we rely on the chain rule of derivation

\(\frac{\partial z}{\partial w} = \frac{\partial z}{\partial y} \frac{\partial y}{\partial x} \frac{\partial x}{\partial w}\)

If we then represents the cost function as a computational graph:

We can compute the derivative of the cost function starting from the output layer and going back all the way to the input layer, adjusting the weight as we go.

Let’s call \(G^{(2)}\) the gradient on \(U^{(2)}\) of the cross entropy operation. Then the back propagation will add \(H^\top G^{(2)}\) to the gradient on \(W^{(2)}\).

Then the back propagation algorithm continues walking the graph and computes \(\nabla_H J = G^{(2)} W^{(2)\top}\).

The next operation is the ReLU operation. The back propagation nulls out components of the gradient corresponding to entries in \(U^{(1)}\) that are less than 0. Let’s call this result \(G^{(1)}\).

Finally we reach \(W^{(1)}\) where the back propagation of the matrix multiplication operations yields to \(X^\top G^{(1)}\).

Once we know the gradient of a node we can adjust the weights by applying gradient descent (or any other algorithm) at that node.

Once all the weights have been updated we recompute the cost using the forward propagation, and then re-adjust the weights using the back propagation, ….