Today to conclude my series on neural network I am going to write down some guidelines and methodology for developing, testing and debugging a neural network.

As we will see (or as you already experienced) implementing a neural network is tricky and there is often a thin line between failure and success – between something that works great and something making absurd predictions.

The number of parameters we need to adjust is just great: from choosing the right algorithm, to tuning the model hyper-parameters, to improving the data, ….

In fact we need a good methodology and a solid understanding of how our model works and what is the impact of each of its parameters.

So let’s start with the methodology. We implement and improve our neural network by iterating over these 4 simple steps:

- Define your goals:
- which error metric to use?
- what is the target value for your error metric ?

- Create and end-to-end pipeline early
- The pipeline should be able to estimates the error metric

- Instrument each component in order to identify bottlenecks
- Iterate:
- add new data
- adjust hyper-parameters
- try new algorithms based on instrumentation

Let’s dig into each of these steps in more details.

### Define your goals

Before starting any implementation we need to be clear on what we’re trying to achieve. That’s why we need to define our error metric. Accuracy seems an obvious one but there might be more relevant metrics depending on what you’re trying to do.

For instance if you’re trying to detect something that occurs very rarely the accuracy is not a good fit. (You can achieve very good accuracy by predicting that the event never occurs). In this case it’s more common to measure the precision (fraction of detections that are correct) and recall (fraction of events that were detected) and plot them or combine them into an F-score metric. (check this post on confusion matrix as a reminder).

Also keep in mind that not all errors are equal. Incorrectly classifying a spam message as “not spam” is annoying for the user but incorrectly marking a valid message as spam is more problematic as the user might not see it. Same goes for disease detection, …. For such cases we want to apply different weights according to the misclassification error.

Another interesting metric is coverage. Let’s say that our system refuses to make a decision when the output is less than a given value (e.g. no decision if the output is less than 90 percent chance). Our goal might be to classify a given portion of our dataset – we want to be able to classify 95% of the data. This is our coverage.

Once the error metric is chosen it is equally important to set an objective value. (95% coverage, …). This value often depends on the domain of your application. You may find some reference on academic papers or by comparing with human performance on the same task.

### Create an end-to-end pipeline

Here are some advice on how to pick-up some default algorithm to get started with a reasonable setup.

First thing is to evaluate the complexity of your problem. Can it be solved using simple tool like logistic regression or does it require deep learning?

If it does then we need to choose the general category of the network:

- Feed-forward (fully connected): for classifying fixed size input vectors
- Convolutional (with ReLU units): for classifying topological data (e.g. images)
- Recurrent (LSTM or GRU): for classifying sequential data (e.g. time series).

SGD is a good optimisation algorithm that fits many cases. It is especially efficient when combined with momentum and decaying learning rate.

Regularisation can also be introduced right from the start. Early stopping and dropout (or even batch normalisation in some cases) are efficient techniques.

### Improve your model

#### Add more data

If your model doesn’t perform well enough you might be tempted to try another algorithm. If you started with a reasonable default model the first thing to check is wether we need (and can get) more data.

If the performance on the training data is poor it means the model doesn’t make good use of the existing data so there is no need to add more data yet. Instead we should increase the size of the model (add more layers and/or more units per layer) or experiment with different learning rates.

If it doesn’t help and you now have a large model with tuned hyper parameters (learning rate, regularisation, …) then the problem is probably the quality of the data. In this case we’re better off starting over and collecting cleaner data or data with a richer set of features.

If the performance of the training set is acceptable we need to look at the performance of the test test.

If they’re both acceptable … congratulations, you’re done. Otherwise if the test performance is worse than the training performance, gathering more data is probably the best thing to do.

In order to estimate how much data is needed you can plot the relationship between training set size and generalisation error. The needed amount of data usually varies exponentially. Try to double the data (if possible) for every experiment.

#### Tune the hyper-parameters

Hyper-parameters can influence both the model effectiveness and the required computation resources to train and run the model.

E.g. adding more units to the model increase the model capacity but also increases the computational resources needed to run and train the model.

There are 2 possible approaches to tune the hyper-parameters:

- Manual tuning
- Automatic tuning

##### Manual Tuning

Manual tuning requires a good understanding of the relationships between the hyper-parameters and training error, generalisation error and computing resources.

The goal is to adjust the effective capacity of the model to match the complexity of the problem.

The representational capacity of the model should be big enough (a model with more units can represent more complex functions).

The learning algorithm should be able to find good functions to minimise the training error. e.g. Weight decay prevents the learning algorithm to explore some possible functions.

The generalisation error as a function of one the hyper-parameters usually follow a U-shaped curve. On one side the capacity the hyper-parameters reduces the capacity of the model (it can’t find a good enough function – it underfits) and on the other side it performs very well on the training data but fails to generalise (it overfits). The sweet spots lies in between along the U-shape.

The learning rate is probably the first parameter to tune as it controls the effective capacity of the model in a more complex way than any other hyper-parameters and it follows a U-shaped curve.

To tune the other parameters it is useful to monitor both the training error and the test error to determine it the model is underfitting or overfitting.

The goal is to first minimise the training error so that the test error consists mainly of the generalisation error. Then try to reduce the generalisation error without increasing the training error too much.

Neural network require much more parameter tuning than other algorithm such as SVM or logistic regression. Manual tuning works great when one have a good enough starting point (from previous work on the same problem or from years of experience in the domain)

##### Automatic tuning

Hyper-parameters tuning can be considered as an optimisation problem where we try to find the best values for some parameters in order to reduce the validation error.

**Grid search**

When there is just a few parameters to optimise grid search is a simple and common approach.

The idea is to select a small set of values to explore and try all possible combination of these values and pick the best one (values are usually picked on a logarithmic scale).

The drawback is that it’s costly and can’t be used for more parameters.

###### Random search

Random search is similar to grid search but instead of selecting the values manually we define a marginal distribution for each hyper-parameter and pick the values randomly from the distributions.

This approach is usually more efficient as it tries out more values than the grid search.

###### Model-based

In the model based approach we try to reduce the validation error. We can follow the traditional gradient based approach when it’s possible to define a gradient.

In many cases the gradient is not available and a bayesian regression model is often used to estimate both the expected value of the validation error and the uncertainty around this expectation.

Automatic tuning can be costly in terms of required resources. However it is often first step to obtain a good starting point for manual tuning.

#### Debug your model

Debugging is notoriously difficult in machine learning as it’s not obvious weather it’s the algorithm that performs poorly or there’s a bug in the implementation.

Machine learning is an adaptative process so if one process is buggy the downstream processes can adapt and still achieve roughly acceptable performance.

For all these reasons we need a good strategy to debug a machine learning pipeline. Here are some guidelines:

- design a simple case for which we know the expected outcome
- exercise each part of the network in isolation.
- visualise the model in action
- visualise the worst mistakes (case where the model was highly confident but classified incorrectly)
- find the reasons about train and test error
- fit a tiny dataset