Weight decay regularisation


Most machine learning techniques follow a similar strategy:

  1. Get the best possible model on the training dataset
  2. Generalise by testing the model on the test dataset

The test dataset consists of data that are never used during training and it allows to test how the algorithm will perform over “not seen before” data.

Step 1 can be seen as addressing “underfitting” and step 2 as addressing “overfitting“.

We address “underfitting” by making sure our model can map over the training set with a “good enough” accuracy.

If we use a first degree linear regression over our training set we might not be able to fit the dataset correctly. In this case we suffer from underfitting.

\(f(x) = wx + b\)

In order to address this issue we may introduce higher-degree terms.

\(f(x) = \sum\limits_{i=0}^{deg} w_{i}x^i\)

Note: We introduce higher degree terms and the input-output mapping is no longer linear. however the optimisation problem remains linear as the cost function is linear with respect to the model parameters \(w\).

This allow to predict non-linear relations. However has we increase the polynomial degree we may end up with a model that performs greatly over the training dataset but poorly over the test data. That’s “overfitting“.

Using higher degree terms we can go through almost all of the datapoints in the training set. However the “shape” of this function is very specific to the training set and is not a good representation of the underlying distribution. (This is especially true as the training datasets is small).

We can alter our algorithm to favour lower degree terms or favour the lowest parameter values.

In linear regression this is done by introducing a correction term \(\lambda\) into the cost function \(J(w)\).

\(J(w) = MSE_{train} + \lambda w^\top w\)

The \(\lambda\) term is known as the regularisation term. It makes sure that the bigger the weights, the bigger the cost. As we minimise the cost function we favour lower weights. This technique is called “weight decay“.

\(\lambda\) is an “hyperparameter” of the linear regression. It is not a model parameter because its value is not optimised by the training process. Instead its value is chosen upfront and influences the optimisation process.

Similarly the degree of the polynomial used for linear regression is another hyperparameter. It is chosen upfront and influences the model “capacity” (i.e. The ability for the model to respond to complex datapoint distribution).