Matthieu Vergne's Homepage

Last update: 19/11/2017 11:16:50

Can We Use Accuracy With Unbalanced Datasets?


We start from the confusion matrix in binary classification:

\(p = tp + fn\)
\(n = fp + tn\)
Predicted Positive
\(p' = tp + fp\)
\(tp\) \(fp\)
\(n' = fn + tn\)
\(fn\) \(tn\)

In this situation, accuracy corresponds to the number of correct results divided by the total number of elements in the dataset:

\[ ACC = \frac{tp+tn}{p+n} \]

As such, accuracy provides a nice measure because it allows to express a percentage of correct results. But it is known to be unreliable when the dataset is unbalanced. For example, one can produce a classifier which predicts only positive elements. Such a classifier would be correct for all the positive elements and wrong for all the negative ones:

Positive Negative
Predicted Positive \(p\) \(n\)
Negative \(0\) \(0\)

This is generally not considered to be a good classifier given the obvious bias towards positive elements. In such a condition, the accuracy formula becomes \(ACC = \frac{p}{p+n}\). The problem arises when the number of positive elements in the dataset far exceeds the number of negative elements, so \(p \gg n\), in which case the formula becomes \(ACC \approx \frac{p}{p} = 1\). Thus, if the dataset has many more positive elements, this biased classifier is able to achieve a high accuracy, and thus appear as a really good classifier despite its obvious bias. Similarly, with a dataset having much more negative elements than positive elements, a classifier predicting only negative elements can also achieve a high accuracy. This is what we call the accuracy paradox.


Can accuracy be designed to fit well with unbalanced datasets?


To answer this question, we start by considering that this measure is suited for balanced datasets. In other words, when half of the dataset is made of positive elements and the other half of negative elements, accuracy can be used to evaluate the ratio of correct results. Because the problem arises when this balance in the dataset is broken, we can assume that the initial formula is a particular case (balanced dataset only) of a more generic one (any dataset). Our objective is then to generalise this formula.

First, because the problem arises from the indifference between positive and negative cases, we separate them:

\[\begin{align*} ACC = & \frac{tp+tn}{p+n}\\ = & \frac{tp}{p+n} + \frac{tn}{p+n}\\ = & \frac{p}{p+n} \frac{tp}{p} + \frac{n}{p+n} \frac{tn}{n}\\ = & \frac{p}{p+n} \frac{tp}{p} + \frac{n+p-p}{p+n} \frac{tn}{n}\\ = & \frac{p}{p+n} \frac{tp}{p} + \left(\frac{p+n}{p+n} - \frac{p}{p+n}\right) \frac{tn}{n}\\ ACC = & \alpha \frac{tp}{p} + \left(1 - \alpha\right) \frac{tn}{n} & \left(\alpha = \frac{p}{p+n}\right)\\ \end{align*}\]

Now, we have three elements:

This shape still allows to speak clearly about an accuracy: the two first elements are clearly percentages of correct results, each on its own category, and the third element just applies a weighting policy. Thus, the unit is preserved and the final result is again a percentage of correct results, but this time on the overall dataset.

We see now that the current accuracy formula considers both positive and negative ratios, but applies to them a weight depending on the composition of the dataset: if it is balanced (\(p = n\)) then:

\[\begin{align*} \alpha = & \frac{p}{p+n}\\ = & \frac{p}{p+p}\\ \alpha = & \frac{1}{2}\\ \end{align*}\]

This is why the current formula is nice for balanced datasets: the weight of the positive elements (\(\alpha\)) is \(\frac{1}{2}\), while the weight for negative elements (\(1-\alpha\)) is also \(\frac{1}{2}\). Both are considered equally, which makes it unbiased. However, if \(p \gg n\), we obtain:

\[\begin{align*} \alpha = & \frac{p}{p+n}\\ \approx & \frac{p}{p}\\ \alpha \approx & 1\\ \end{align*}\]

Thus, we consider almost only the positive elements (\(\alpha \approx 1\)), while we almost ignore negative ones (\(1-\alpha \approx 0\)). Similarly, if \(n \gg p\), then \(\alpha \approx 0\) while \(1-\alpha \approx 1\), leading to consider mainly negative elements while ignoring positive ones.

To ensure a good balance between both, one needs to fix the weight such that it remains balanced, which means considering only \(\alpha = \frac{1}{2}\). In such a case, the formula becomes:

\[\begin{align*} ACC = & \frac{1}{2} \frac{tp}{p} + \left(1 - \frac{1}{2}\right) \frac{tn}{n}\\ = & \frac{1}{2} \frac{tp}{p} + \frac{1}{2} \frac{tn}{n}\\ ACC = & \frac{\frac{tp}{p} + \frac{tn}{n}}{2}\\ \end{align*}\]

In other words, a simple balanced average between the two ratios is the way to compute an unbiased accuracy. Indeed, if we use a classifier predicting only positive elements, we obtain:

\[\begin{align*} ACC = & \frac{\frac{tp}{p} + \frac{tn}{n}}{2}\\ = & \frac{\frac{p}{p} + \frac{0}{n}}{2}\\ = & \frac{1 + 0}{2}\\ ACC = & \frac{1}{2}\\ \end{align*}\]

This result is independent from the dataset configuration: if only postive elements are predicted, the classifier achieves an accuracy of \(\frac{1}{2}\). Same thing if the classifier predicts only negative elements. One may argue that \(\frac{1}{2}\) is still a high value for a classifier which is obviously biased. But if we consider random values, such that half of the elements are predicted as positive and the other half as negative, we obtain:

\[\begin{align*} ACC = & \frac{\frac{tp}{p} + \frac{tn}{n}}{2}\\ = & \frac{\frac{p/2}{p} + \frac{n/2}{n}}{2}\\ = & \frac{\frac{p}{2p} + \frac{n}{2n}}{2}\\ = & \frac{\frac{1}{2} + \frac{1}{2}}{2}\\ ACC = & \frac{1}{2}\\ \end{align*}\]

In other words, \(\frac{1}{2}\) is a limit which shows that we perform no better than a random classifier.

It is possible to obtain a value below this limit by using a reversed classifier. In particular, if the classifier is always wrong, so all predictions are false positives (\(fp\)) and false negatives (\(fn\)), then:

\[\begin{align*} ACC = & \frac{\frac{tp}{p} + \frac{tn}{n}}{2}\\ = & \frac{\frac{0}{p} + \frac{0}{n}}{2}\\ = & \frac{0 + 0}{2}\\ ACC = & 0\\ \end{align*}\]

So if a perfect classifier would have an accuracy of 1, a classifier which is always wrong has an accuracy of 0. In this condition, it makes sense to have an uninformative classifier (whether it is random or biased) with an accuracy of \(\frac{1}{2}\).


Accuracy can be designed to fit well with unbalanced datasets. All we need is to separate ratios of positive and negative predictions and give them the same weight, instead of a dataset-dependent weight:

\[ ACC = \alpha \frac{tp}{p} + \left(1 - \alpha\right) \frac{tn}{n} \quad \text{ with } \alpha = \frac{1}{2} \]

I don't know any publication about this formula, which we may call weighted accuracy, but it is shortly mentionned in (Brodersen et al., 2010) who focus on the balanced accuracy (when \(\alpha = \frac{1}{2}\)). Meanwhile, it is worth noting that it can be rewritten in terms of recall (\(R = \frac{tp}{p}\)) and specificity (\(S = \frac{tn}{n}\)), which are both used in ROC curves, as:

\[ ACC = \alpha R + \left(1 - \alpha\right) S \]

Additionally, if using a common weight \(\alpha = \frac{1}{2}\) allows to get a balanced accuracy, one can also use a different weighting policy to increase the impact of the most interesting aspect. For example, one can favor the identification of rare true positives by giving a higher weight to the ratio of positive elements (i.e. a measure closer to a pure recall). But from my point of view, it seems more interesting to use a balanced accuracy in the general case and use a dedicated measure (e.g. recall or specificity) when we want to focus on a specific aspect. At least, we avoid the need to arbitrarily tweak the weight \(\alpha\).

Moreover, if we generalize further by considering accuracy as an average over \(\frac{tp}{p}\) and \(\frac{tn}{n}\), then the current formula, with \(\alpha = \frac{1}{2}\), is an arithmetic average. It is then possible to imagine a geometric average \(ACC_g = \sqrt{\frac{tp}{p} \times \frac{tn}{n}}\). As shown by the figures below, if the former considers fully biased classifiers like uninformative ones (\(ACC = \frac{1}{2}\)), the latter considers them to be wrong (\(ACC_g = 0\)).

<None yet>