Gradient descent

If you want to predict something from your data, you need to put a strategy in place. I mean you need a way to measure how good your predictions are … and then try to make the best ones.

This is usually done by taking some data for which you already know the outcome and then measuring the difference from what your system predict and the actual outcome.

This difference is often referred to as the “cost function”. Once we have such a function our machine learning problem comes down to minimising our cost function.

One very simple way to find the minimum value(s) is called gradient descent. The basic idea is to make small steps along the gradient (the derivative of the function) until we reach a minimum.

The idea behind this algorithm is that

\(
f(x + \epsilon) = f(x) + \epsilon f'(x)
\) when \(\epsilon\) is close to 0.

Using the sign of the derivative we can find the best direction to walk by. If the derivative is positive the function is increasing, so we’d like to walk to the left and when the derivative is negative the function is decreasing so we move to the right. This give the following pseudo-code:

while (\(\| f'(x)\| < \delta\)) do
\(x = x – \epsilon f'(x)\)
done

\(\delta\) allows to stop when the derivative is close enough to 0. When we are close enough to a minimum.

By Joris Gillis (Created with Maple 10) via Wikimedia Commons
By Joris Gillis (Created with Maple 10) via Wikimedia Commons

When the derivative is 0 we may have reach a minimum (or a maximum, a saddle point, or a flat region). We can usually figure out the situation using the second derivative \(f”(x)\).

The main problem with this approach is that it is subject to local minimums. We may end up in a local minimum that is not the optimal solution. We can try with bigger steps (increase \(\epsilon\)) but too large values of \(\epsilon\) might miss the minimum and having the gradient descent to never converge.

Moreover this approach is slow as we need to proceed by small steps. Usually we stop the algorithm after a given number of steps and use smaller and smaller steps as we get closer to a minimum.

The beauty of this algorithm lies in its simplicity as it can work on many complex functions (\(f\) needs to be continue to be differentiable).