Measuring Performance
From FraudWiki
The following discusses how to measure the performance of a binary classifier in terms of the fraud detection problem. However, the principals are the same whatever the binary classification problem.
Contents |
Binary Classifiers and Thresholds
A binary classifier is simply one that only has two choices, in this case 'fraud' or 'legal'. All such classifiers work by producing a score. This score is sometimes interpreted as a probability if the score is between 0 and 1 but this must be done with care. In general the score is a measure of the confidence of the classifier that a particular transaction is legal or fraud. To decide which it is, the score is compared to a threshold value. If it is less than the threshold then it is deemed to be legal. If is greater, then it is fraud. Choosing the best threshold is often not easy and begs the question; is there an optimal threshold? Often the answer to this question is 'no' but there are circumstances where there is one; see Managing Alert Rates.
Thresholds and Performance
If we assume our classifier returns a score between 0 and 1 then if we choose a threshold of 0 this would result in all transactions being classified as fraud while a threshold of 1 would result in everything being classified as legal. We could argue that at threshold 0 all of the fraud is correctly classified but at the expense of a huge number of wrongly classified legal transactions. Equally, at threshold 1 all of the legal transactions would be correctly classified while the frauds would be wrongly classified as legal.
Fraud is usually very rare. If we assume that say 0.1% of the transactions to be classified are fraud then we could argue that even when the threshold is set to 1 and the classifier therefore missed all of the fraud that it is 99.9% accurate. The application of good performance metrics is obviously essential when evaluating such systems.
Terms and Definintions
If we think of a classifier that raises an alert whenever it classifies a transaction as fraudulent then we can define the sets of transactions below. The relationship between these sets can be nicely illustrated by a Venn diagram:
|
We have the set U of all transactions divided into the two disjoint sets of Legal L and Fraud F by the line. The set of Alerts A is the inner subset of U.
So we have
and
and equally
True Positive Fraction
We define the True Positive Fraction (TPF) as the fraction of real frauds that are correctly alerted by our classifier. So we have
In an ideal world we want the TPF to be 1.
False Positive Fraction
We define the False Positive Fraction (FPF) as the fraction of legals that are wrongly alerted by our classifier. So we have
In an ideal world we want the FPF to be 0.
False Positive Ratio
We define the False Poitive Ratio (FPR) as simply the ratio of False Positives to True Positives. So we have
In an ideal world we want the FPR to be zero.
Receiver Operating Characteristic (ROC) Curves
This is a technique for measuring the effectiveness of a classification process. It has its origins in signal processing but is now widely used in statistical medicine to determine the effectiveness of tests and treatments. There is also cited literature on the use of ROC curves in fraud detection.
Consider a classifier of fraud that returns a score between 0 and 1 which is a measure of evidence for fraud. Call this measure 's'. Then to classify a transaction we need to decide on a threshold 't' for 's' above which we decide it is fraud.
If we set 't' to be 0 then every transaction is classified as fraud, conversely, if we set it to 1 then no transactions are classified as fraud. So as we move the threshold from 0 to 1 we can observe the behaviour of various metrics like False Positive Ratio (FPR) and plot graphs of how these metrics change relative to each other.
The ROC curve is just such a graph and is a plot of True Positive Fraction (TPF) against False Positive Fraction (FPF) for different threshold values t. To the right below is a typical ROC curve.
You would not expect a classifier to produce results below graph (a) as this would indicate a bias towards getting it wrong, however it is not necessary to assume that a classifier's behaviour is always positive (see below)
In general, we can say that the greater the area under the ROC curve the better the classifier is at discriminating fraud.
The ROC curve has many useful properties, not least of which is it makes no assumptions about the underlying populations. Also, it measures the overall performance and not the performance at a specific threshold.
The GINI value
The GINI is defined as twice the area under the ROC curve minus one.
This is a more convenient measure as it has the property that when a classifier is very bad (ie random) then the GINI is 0.
So we have
and as
then
What is a bad classifier in terms of the ROC curve ?
A bad classifier is one that randomly classifies an event. If the overall probability of fraud is 0.001 (1 in 1000 are fraud) then we would expect the classifier to get it right with a probability of 0.001.
This is represented by the diagonal line on a ROC graph.
If a classifier has a bias towards getting it wrong then it too may be regarded as a good classifier… it is discriminating fraud. It will have a curve that will tend to be below the diagonal.
For a perfectly positive classifier ( A = 1 ) we have G = 1 . For a perfectly negative classifier ( A = 0 ) we have G = -1 .
However, if G < 0 ie A < 0.5 then the probability of fraud as determined by the classifier is not s (say) but 1 - s. (Its actually classifying legal).

and hence
,
and hence 
