## How to split a dataset

In machine learning it is pretty obvious to me that you need to split your dataset into 2 parts:

• a training set that you can use to train your model and find optimal parameters
• a test set that you can use to test your trained model and see how well it generalises.

It is important that the test data is never used during the training phase. Using “unseen” data is what allows us to test how well our model generalises. It makes sure your model doesn’t overfit.
Continue reading “How to split a dataset”

## Weight decay regularisation

Most machine learning techniques follow a similar strategy:

2. Generalise by testing the model on the test dataset

The test dataset consists of data that are never used during training and it allows to test how the algorithm will perform over “not seen before” data.

With gradient descent we try to optimise a function that runs over the entire dataset. $$f$$ represents the “cost” over the entire dataset.

When working with big datasets this yield to complex function optimisation and slow computation time.

This is also a problem when dealing with streaming data as we need to wait for the stream to end (or to select a big enough batch of data from the stream) to run gradient descent.

Stochastic gradient descent is a variation of gradient descent where gradient descent is run over every single data point. For each entry in the dataset the parameters are updated.

If you want to predict something from your data, you need to put a strategy in place. I mean you need a way to measure how good your predictions are … and then try to make the best ones.

This is usually done by taking some data for which you already know the outcome and then measuring the difference from what your system predict and the actual outcome.

This difference is often referred to as the “cost function”. Once we have such a function our machine learning problem comes down to minimising our cost function.

One very simple way to find the minimum value(s) is called gradient descent. The basic idea is to make small steps along the gradient (the derivative of the function) until we reach a minimum.