Confusion Matrix

These classification metrics all come from the same confusion matrix, but they answer different questions. The practical question is not “Which metric is best?” It is “Which mistake is most expensive in this problem?”

The most important rule is:

Positive does not mean good. Negative does not mean bad.

In classification metrics, positive is just the target label for the thing we are counting. It means “this is the class we want the model to detect.” It does not mean the outcome is desirable.

For example, if we build a disease screening model, the positive class might be has disease. If we build a sales model, the positive class might be will buy a car. In both cases, recall still means the same thing: of the actual positive cases, how many did the model catch?

Before reading any metric, ask:

What is the positive class?

Everything else follows from that choice.

Four Outcomes

For a binary classifier, every prediction falls into one of four groups. In the tables below, rows are the actual classes and columns are the model predictions.

	Predicted positive	Predicted negative
Actually positive	True positive, `TP`	False negative, `FN`
Actually negative	False positive, `FP`	True negative, `TN`

Term	Plain English
`TP`	The model said positive, and it was actually positive.
`FN`	The model said negative, but it was actually positive.
`FP`	The model said positive, but it was actually negative.
`TN`	The model said negative, and it was actually negative.

Metric Definitions

Metric	Formula	Plain English
Recall	`TP / (TP + FN)`	Of all actual positives, how many did we catch?
Precision	`TP / (TP + FP)`	Of all predicted positives, how many were truly positive?
Accuracy	`(TP + TN) / (TP + FP + FN + TN)`	Of all cases, how many did we classify correctly?
F1 score	`2 * Precision * Recall / (Precision + Recall)`	A combined score balancing precision and recall.
Specificity	`TN / (TN + FP)`	Of all actual negatives, how many did we correctly reject?

These formulas assume the denominator is not zero. Precision is undefined if the model never predicts positive, recall is undefined if the evaluation set has no actual positives, and specificity is undefined if the evaluation set has no actual negatives. F1 is undefined if 2TP + FP + FN = 0, which means there are no true positives, false positives, or false negatives to evaluate for the positive class. Libraries sometimes return 0 for undefined cases by convention, so check the convention before comparing results.

The equivalent confusion-matrix form of F1 is:

F1 = 2TP / (2TP + FP + FN)

Recall

Recall is also called sensitivity or the true positive rate.

Recall asks: among the things that were truly positive, how many did the model find? High recall means few false negatives.

Use recall when missing positives is costly.

Positive class	High recall means
Has disease	We catch most sick patients.
Fraudulent transaction	We catch most fraud.
Spam email	We catch most spam.
Will default on a loan	We catch most risky borrowers.
Will buy a car	We identify most real buyers.

Recall is not good or bad by itself. It tells you how completely the model catches the positive class.

Precision

Precision asks: among the cases the model called positive, how many were actually positive? High precision means few false positives among predicted positives.

Use precision when false alarms are costly or when acting on a positive prediction consumes scarce resources.

Positive class	High precision means
Has disease	Most positive test results are truly sick patients.
Fraudulent transaction	Most flagged transactions are actually fraud.
Will default on a loan	Most flagged borrowers really would default.
Will buy a car	Most targeted customers really are buyers.

Recall:    Did we catch the actual positives?
Precision: Can we trust the predicted positives?

Specificity

Specificity is also called the true negative rate.

Specificity asks: among the things that were truly negative, how many did the model correctly leave as negative? High specificity means few false positives among actual negatives.

Positive class	Actual negative class	High specificity means
Has disease	Does not have disease	Healthy people are rarely told they may be sick.
Fraudulent transaction	Legitimate transaction	Legitimate transactions are rarely flagged as fraud.
Will default on a loan	Will not default on a loan	Safe borrowers are rarely flagged as risky.
Will buy a car	Will not buy a car	Non-buyers are rarely targeted as buyers.

Specificity is the negative-class counterpart to recall.

Recall:      Of actual positives, how many did we catch?
Specificity: Of actual negatives, how many did we correctly reject?

Specificity is related to the false positive rate: Specificity = 1 - FPR, where FPR = FP / (FP + TN).

Accuracy

Accuracy asks: what fraction of all predictions were correct? It is simple and useful when classes are reasonably balanced and false positives and false negatives have similar costs.

Accuracy can be misleading when one class is much more common than the other. If only 1% of people have a disease, a model that always predicts “no disease” gets 99% accuracy, but it has 0% recall for disease. The model is accurate in a shallow sense, but useless for finding the disease.

F1 Score

F1 combines precision and recall using the harmonic mean. It is high only when both precision and recall are high. If either one is poor, F1 drops quickly.

Use F1 when the positive class matters more than the negative class, the data is imbalanced, and you want one number that balances false positives and false negatives. But F1 does not use TN, so it ignores how well the model handles actual negatives. If the negative class matters a lot, also check specificity.

Also remember that F1 treats precision and recall as equally important. If recall matters more than precision, or precision matters more than recall, consider an F-beta score or a cost-based threshold instead of plain F1.

Example 1: Disease Testing

Suppose the positive class is has disease, and the negative class is does not have disease.

	Predicted positive	Predicted negative	Total actual
Actually positive	TP = 90	FN = 10	100
Actually negative	FP = 45	TN = 855	900
Total predicted	135	865	1000

Metric	Calculation	Value
Recall	`90 / (90 + 10)`	90.0%
Precision	`90 / (90 + 45)`	66.7%
Specificity	`855 / (855 + 45)`	95.0%
Accuracy	`(90 + 855) / 1000`	94.5%
F1	`2 * 90 / (2 * 90 + 45 + 10)`	76.6%

Interpretation: recall says the test catches 90% of sick patients. Precision says that when the test says “disease,” it is right about two-thirds of the time. Specificity says the test correctly clears 95% of healthy people. Accuracy is high, but it does not tell the whole story because most people in this dataset are healthy.

Precision in medical testing is also affected by prevalence. The same recall and specificity can produce lower precision when the disease is rare, because false positives may outnumber true positives.

Example 2: Car Sales, Positive Means “Will Buy”

Here the positive class is a good outcome: will buy a car. The negative class is will not buy a car.

	Predicted positive	Predicted negative	Total actual
Actually positive	TP = 70	FN = 30	100
Actually negative	FP = 90	TN = 810	900
Total predicted	160	840	1000

Metric	Calculation	Value
Recall	`70 / (70 + 30)`	70.0%
Precision	`70 / (70 + 90)`	43.8%
Specificity	`810 / (810 + 90)`	90.0%
Accuracy	`(70 + 810) / 1000`	88.0%
F1	`2 * 70 / (2 * 70 + 90 + 30)`	53.8%

Interpretation: recall says the campaign finds 70% of real buyers. Precision says fewer than half of targeted people actually buy. Specificity says the model correctly ignores most non-buyers. Accuracy looks decent because most people are non-buyers, but the sales team may still care more about precision if their time is limited.

Example 3: Car Loans, Positive Means “Will Default”

Now the positive class is a bad event: will default on a loan. The negative class is will not default on a loan.

	Predicted positive	Predicted negative	Total actual
Actually positive	TP = 30	FN = 10	40
Actually negative	FP = 20	TN = 940	960
Total predicted	50	950	1000

Metric	Calculation	Value
Recall	`30 / (30 + 10)`	75.0%
Precision	`30 / (30 + 20)`	60.0%
Specificity	`940 / (940 + 20)`	97.9%
Accuracy	`(30 + 940) / 1000`	97.0%
F1	`2 * 30 / (2 * 30 + 20 + 10)`	66.7%

Interpretation: recall says we caught 75% of people who would default. Precision says that among people flagged as risky, 60% really would default. Specificity says the model correctly accepts most people who would not default. Accuracy is 97%, but defaults are rare, so accuracy alone can hide risk.

This example shows why positive does not mean good. Here the positive class is bad, but recall still simply means: how much of the positive class did we catch?

Choosing the Right Metric

Goal	Metric to watch
Catch as many positives as possible	Recall
Make positive predictions trustworthy	Precision
Correctly reject negatives	Specificity
Measure total correctness	Accuracy
Balance precision and recall	F1

Situation	Usually prioritize	Why
Medical screening where missing disease is dangerous	High recall	False negatives are dangerous.
Confirmatory medical test where false alarms are harmful	High precision and high specificity	Positive results should be trustworthy, and healthy people should not be flagged unnecessarily.
Fraud detection where missed fraud is costly	High recall for fraud	The system should catch as much fraud as possible.
Fraud system where blocking legitimate customers is costly	High specificity, often with high precision	Legitimate transactions should rarely be flagged, and fraud alerts should be credible.
Sales campaign with limited sales team time	High precision	The contacted leads should be likely to convert.
Sales campaign where contacting extra people is cheap	High recall	The campaign can afford extra outreach to find more buyers.
Loan approval where defaults are costly	High recall for default	The lender wants to catch most likely defaults.
Loan approval where rejecting safe borrowers is costly	High specificity for default	The lender wants to avoid wrongly flagging safe borrowers as risky.

Thresholds Create Trade-Offs

Many classifiers output a score or probability. A threshold converts that score into a positive or negative prediction. Changing the threshold changes the confusion matrix, which changes the metrics.

Lower threshold: more predicted positives. Recall cannot decrease, specificity cannot increase, and precision often decreases if the added positives are lower-confidence cases.
Higher threshold: fewer predicted positives. Recall cannot increase, specificity cannot decrease, and precision often increases if the remaining positives are higher-confidence cases.

Recall and specificity move in opposite directions as the model becomes more or less aggressive about calling things positive. Precision is not guaranteed to be monotonic with the threshold; it depends on class prevalence and how well the model ranks examples. This is why a model should not be judged only at the default threshold of 0.5.

Multiclass Note

For multiclass classification, these metrics are usually computed using a one-vs-rest view for each class. For example, in a three-class classifier with classes A, B, and C, you can compute metrics for class A by treating A as positive and everything else as negative. Then repeat for B and C.

Macro average: compute the metric per class, then average each class equally. This is useful when minority classes matter.
Weighted average: compute the metric per class, then weight by class frequency. This follows the dataset distribution.
Micro average: pool the counts across classes first, then compute the metric globally. This can be dominated by common classes.

If rare classes matter, look at per-class metrics and macro averages. Weighted and micro averages can look strong even when rare classes are weak. In single-label multiclass classification, micro-averaged precision, micro-averaged recall, and micro-averaged F1 are equal to accuracy, so they may not add much beyond the accuracy number.

Common Traps

Trap 1: Thinking Positive Means Good

Wrong:

Positive = good
Negative = bad

Correct:

Positive = the class we are detecting
Negative = everything outside that class

Trap 2: Confusing Recall and Precision

Recall denominator:

TP + FN = all actual positives

Precision denominator:

TP + FP = all predicted positives

Memory aid:

Recall:    Did we find the real positives?
Precision: Were our positive predictions right?

Trap 3: Trusting Accuracy on Imbalanced Data

If one class is very common, accuracy can look high even when the model performs badly on the important class. Always compare accuracy with recall, precision, and specificity.

Trap 4: Looking at F1 Without Specificity

F1 ignores true negatives. If correctly handling negatives matters, F1 alone is not enough.

Trap 5: Forgetting the Threshold

A threshold is a policy choice. The “best” threshold depends on the cost of a false positive, the cost of a false negative, and the capacity of the people or systems acting on model predictions.

One-Screen Summary

Positive = the target class, not necessarily a good thing.

Recall      = TP / (TP + FN)
            = Of actual positives, how many did we catch?

Precision   = TP / (TP + FP)
            = Of predicted positives, how many were right?

Specificity = TN / (TN + FP)
            = Of actual negatives, how many did we reject correctly?

Accuracy    = (TP + TN) / Total
            = Of all cases, how many were correct?

F1          = 2 * Precision * Recall / (Precision + Recall)
            = One score balancing precision and recall.

Best mental model:

Recall:      catch the target
Precision:   trust the alarm
Specificity: avoid false alarms among non-targets
Accuracy:    total correctness
F1:          precision-recall balance

If you remember only one rule, remember this: pick the metric by naming the mistake you cannot afford.