When you train several models over a dataset you need a way to compare the model performances and choose the one that best suites your needs.

As we will see there are different ways to compare the results and then pick the best one.

Let’s start with what scores we can get out of the training process. Assuming we are running a classification model with 2 possible outcomes, then the model performance can be summarised with 4 figures known as the confusion matrix.

These 4 figures are:

**TP – True positive rate**: The number of samples correctly marked as positive**TN – True negative rate**: The number of samples correctly marked as negative**FP – False positive rate**: The number of samples incorrectly marked as positive (aka type 1 error)**FN – False negative rate**: The number of samples incorrectly marked as negative (aka type 2 error)

To remember it I think about medical trials where we want to test if a patient has a particular disease.

The medical test (e.g. blood sample analysis) is what our model predicts, it can be positive or negative. Then we find out if the test result is correct (true) or incorrect (false).

These figures are usually presented in a 2×2 matrix called the confusion matrix:

Prediction Positive | Prediction Negative | |
---|---|---|

Real Positive | TP – True Positive | FP – False Negative (type II error) |

Real Negative | FP – False Positive (type I error) | TN – True Negative |

This matrix summarises the performance of a model. Next we need a way to compare these matrices to find out the best model.

If our model is perfect the matrix should be diagonal with only the true positive and true negative values.

It’s most likely not going to happen … so how to choose? should we favour type I or type II error, or the matrix with the lowest error rate overall?

Well the answer is: it depends. It depends on what you are trying to predict.

E.g. In case of medical trials you probably want the lowest possible False Negative rate. (You want to avoid letting a person with the disease untreated). That might also mean a higher False Positive rate (in this case a person without the disease but a positive test could probably find out that she’s safe with further medical investigation).

Minimising the error II is easy: predict that all the patients are positive. No more type II error but the test becomes useless as every patient now need further medical investigation.

So how to deal with it? Well, from the confusion matrix there are many indicators or scores that can be computed:

Sensitivity – True Positive Rate | \(\frac{TP}{P}=\frac{TP}{TP + FN}\) |

Specificity – True Negative Rate | \(\frac{TN}{N}=\frac{TN}{TN + FP}\) |

Precision – Positive Predicted Value | \(\frac{TP}{TP + FP}\) |

Negative Predicted Value | \(\frac{TN}{TN + FN}\) |

Fall-out – False Positive Rate | \(\frac{FP}{N}=\frac{FP}{TN+FP}\) |

False Discovery Rate | \(\frac{FP}{TP + FP}\) |

Miss Rate – False Negative Rate | \(\frac{FN}{P}=\frac{FN}{TP + FN}\) |

Accuracy | \(\frac{TP + TN}{Total}\) |

F1 Score | \(\frac{2 TP}{2 TP + FP + FN}\) |

Depending on your needs you may pick up one of these scores to compare your models. In practice the accuracy, f1-score and sensitivity are the most frequently used.