My project at Insight was ultimately came down to binary classification problem, predicting the chance of fire on a block-by-block resolution in San Francisco. For each block, I had a collection
of features about the characteristics if the buildings within it, land values, crime and past fire report information. Given these features for one year, my model's task was the predict the
change of fire in that block in the following year.
To assess the performance of such a model, we need to understand classification metrics! These are just a subset of model performance metrics in general, which I will hopefully cover more of in a
future post. Classification metrics for machine learning models is a very well covered topic in the blogging space, and this post leans heavily on what I’ve learning from reading just a few of
the many excellent online resources about this subject.
It’s fundamentally important to be able to measure how well
a model is performing at the task its been assigned. This process is called validation and it is needed so that a model can be changed or improved if it is not performing to a satisfactory
standard. A validation metric is a scoring system used for this. To be useful a metric must
-
Be easy to interpret and communicate
-
Be easy to compare across models and model runs
-
Preferably be a single number, to help with the above
needs
There are many established metrics, each of which can be
appropriate in some situations but not others.
Classification
tasks
Metrics are arguably most diverse and complicated in the
case of classification tasks. When scoring a classifier we compare the model predictions from a test set to the actual, ground truth results. Lets start with some definitions
-
True positive: The model correctly predicts the class of
this example
-
False positive: The model predicts a negative result
when the result should be positive (type 1 error)
-
True negative: The model correctly predicts that this
example is not in this class
-
False negative: The model predicts a positive result for
this class when the result should be negative (type 2 error)
This chart from Wikipedia shows all the metrics that can be
created using combinations of the above four definitions. There are so many! We will dissect some of the components in an attempt to understand them.
The general idea is to come up with a metric - preferably a
single number - that can summarize the model performance with reference to the four definitions above.
Exactly which one should be used depends heavily on the
dataset and the questions being asked: In a cancer detection model, for example, the cost of a false negative (i.e. cancer is present but not detected) might be much higher than the cost
of a false positive.
The most intuitive summary metric is the accuracy, which is
defined as follows
accuracy = (true
positives + true negatives) / total examples
Accuracy is a good metric if the classes are balanced
(e.g. there are roughly equal numbers of examples in each class) and the
cost of the false positives and false negatives are roughly equal. If this is not the case the accuracy can be a misleading metric. Take for example a severely imbalanced classification problem
where we have 1000 negative examples and 10 positives.
A completely ignorant classifier that labels all the
examples as negative will achieve an accuracy score of
accuracy = (1000 + 0)/1010 = 99%
This high score is deceptive because the model is useless at
detecting positive results!
To cope with this problem, we can choose to penalize false
positives or false negatives. This will generate two alternative metrics
precision = true positives / (true positives + false positives)
Thus, if a model misclassifies many negatives as positives,
its precision will be hampered. A low precision does not necessarily mean bad performance if the cost of false positives is low. A second metric, which instead penalizes false negatives, is as
follows
recall = true
positives / (true positives + false negatives)
If the model misclassifies many positives as
negatives (meaning that it misses many positive results), then recall will be low. In the
case of my fire mode, for example, this amounts to not flagging blocks as destined for fire when they actually do end up experiencing fire.
Note that both precision and recall have their limitations
and can also ‘break’ in certain cases , just like accuracy.
Say for example that a model just classified everything as
positive - it would have perfect recall, but it would not be useful.
Similarly, a model that only correctly classifies one out of
many positive examples but does not misclassify any negative examples will have perfect precision but will also not be useful.
Furthermore, recall that a single metric is most useful for
classification tasks so that model performance can be reasonably compared. Accuracy and precision can be combined into a metric called F1 score, which is written as follows
f1 score
= 2*precison*recall/(precision +
recall)
This can be useful, but note that that assumes that
precision and recall should be equally weighted and this might not be a good interpretation in certain situations.
A further issue is that all of the aforementioned metric
require a threshold, which is a level of certainty above which the model will classify a positive element.
This is clearest in the binary classification example. If we
predict the probability that an example is true, we are free to choose a probability threshold above which the model will flag that example as true. Examples associated with a lower probability
score will be flagged as false. The most obvious choice of threshold is 0.5, but that might not always be appropriate. What we really need is a way of understanding performance across a range of
thesholds.
Precision-Recall (PR)
curves
Thankfully, we can understand the effect of choosing
different thesholds by plotting precision vs recall as a function of theshold. This creates a precision-recall (PR) curve, which might look a bit like the following (from this stackoverflow post).
Typically recall is plotted on the x axis and precision on
the y axis. As one moves along the x axis from left to right, the threshold value for which the precision and recall are calculated decreases from 1 to 0. At a threshold of 1 (or very close to 1), precision is high because the number of false
positives is small.
However the recall is low because many of the positives are
misclassified as negatives. At the other end of the scale, the model is flagging everything as positive and so its recall is 100%. In a balanced classification problem the precision is 50%, as
seen on the graph.
This is great because it allows us to understand the
performance of the model across the full range of thresholds but it does not provide a single, easy-to-interpret number that summarizes model performance, which is what we ultimately
seek.
ROC curves and the
AUC
Just when we thought this whole
classification metrics zoo was complex enough, lets throw in two more acronyms! The Receiver Operating Characteristic (ROC) curve is related to the PR curve and actually shows the same data but in
a slightly different way. Its plots true positive rate (also known as sensitivity or recall) against false positive rate
(also known as fallout).
True positive rate: The ratio of the number of true positives at that the model flags at
that threshold to the total number of positives in the test dataset
False positive rate: The ratio of the number of false positives the to the total number
of negatives in the dataset
Again, these are calculated as a function of
threshold.
This allows us to answer questions like
“If we accept that the model will flag 20% of the total number of
true negatives as positive, what proportion of the true positives will it catch?”
If the answer is more than 20%, then the model is
doing better than random chance.
The following example of an ROC curve comes this site, which provides a more in depth description of how these curves are
constructed.
The diagonal line down the center of the ROC curve indicates
the performance of a model that does no better than random chance.
The better the model, the closer the ROC curve becomes to a
box-shaped function. In the best case scenario, a false positive rate of 0 will produce a true positive rate of 100%, meaning that all of the positives are correctly flagged without the model
incorrectly labelling any true negatives.
This leads to the concept of Area Under the
Curve (AUC), which is a single metric measure of model skill. To interpret it, we can
see the ROC curve as describing the tradeoff between making both sorts of error. A skillful model will not have to make many mistakes in before it correctly classifies all the positive examples,
whereas an unskilled model will make many mistakes. The AUC quantifies this skill level. In addition, the distance between the diagonal and the curve at any point can be interpreted as the
probability of the model making an informed decision.
The ROC is informative, but must be treated with caution
where class imbalance is high because the absolute number of false positives may be higher than the curve suggests. The AUC score is insensitive to the absolute number of false positives - only
the proportion of the total.
Thus in general it's useful to plot both the PR curve and
ROC curve and then make interpretations based on both of them. It should be clear from the curves and AUC score in combination which models are performing better than others.
Thats all for now, but no doubt there will be more about ML
metrics in a future post!