These classification metrics all come from the same confusion matrix, but they answer different questions. The practical question is not “Which metric is best?” It is “Which mistake is most expensive in this problem?”
The most important rule is:
Positive does not mean good. Negative does not mean bad.
In classification metrics, positive is just the target label for the thing we are counting. It means “this is the class we want the model to detect.” It does not mean the outcome is desirable.
For example, if we build a disease screening model, the positive class might be has disease. If we build a sales model, the positive class might be will buy a car. In both cases, recall still means the same thing: of the actual positive cases, how many did the model catch?
Before reading any metric, ask:
What is the positive class?
Everything else follows from that choice.
Four Outcomes
For a binary classifier, every prediction falls into one of four groups. In the tables below, rows are the actual classes and columns are the model predictions.
| Predicted positive | Predicted negative | |
|---|---|---|
| Actually positive | True positive, TP |
False negative, FN |
| Actually negative | False positive, FP |
True negative, TN |
| Term | Plain English |
|---|---|
TP |
The model said positive, and it was actually positive. |
FN |
The model said negative, but it was actually positive. |
FP |
The model said positive, but it was actually negative. |
TN |
The model said negative, and it was actually negative. |
Metric Definitions
| Metric | Formula | Plain English |
|---|---|---|
| Recall | TP / (TP + FN) |
Of all actual positives, how many did we catch? |
| Precision | TP / (TP + FP) |
Of all predicted positives, how many were truly positive? |
| Accuracy | (TP + TN) / (TP + FP + FN + TN) |
Of all cases, how many did we classify correctly? |
| F1 score | 2 * Precision * Recall / (Precision + Recall) |
A combined score balancing precision and recall. |
| Specificity | TN / (TN + FP) |
Of all actual negatives, how many did we correctly reject? |
These formulas assume the denominator is not zero.
Precision is undefined if the model never predicts positive, recall is undefined if the evaluation set has no actual positives, and specificity is undefined if the evaluation set has no actual negatives.
F1 is undefined if 2TP + FP + FN = 0, which means there are no true positives, false positives, or false negatives to evaluate for the positive class.
Libraries sometimes return 0 for undefined cases by convention, so check the convention before comparing results.
The equivalent confusion-matrix form of F1 is:
F1 = 2TP / (2TP + FP + FN)
Recall
Recall is also called sensitivity or the true positive rate.
Recall asks: among the things that were truly positive, how many did the model find? High recall means few false negatives.
Use recall when missing positives is costly.
| Positive class | High recall means |
|---|---|
| Has disease | We catch most sick patients. |
| Fraudulent transaction | We catch most fraud. |
| Spam email | We catch most spam. |
| Will default on a loan | We catch most risky borrowers. |
| Will buy a car | We identify most real buyers. |
Recall is not good or bad by itself. It tells you how completely the model catches the positive class.
Precision
Precision asks: among the cases the model called positive, how many were actually positive? High precision means few false positives among predicted positives.
Use precision when false alarms are costly or when acting on a positive prediction consumes scarce resources.
| Positive class | High precision means |
|---|---|
| Has disease | Most positive test results are truly sick patients. |
| Fraudulent transaction | Most flagged transactions are actually fraud. |
| Will default on a loan | Most flagged borrowers really would default. |
| Will buy a car | Most targeted customers really are buyers. |
Recall: Did we catch the actual positives?
Precision: Can we trust the predicted positives?
Specificity
Specificity is also called the true negative rate.
Specificity asks: among the things that were truly negative, how many did the model correctly leave as negative? High specificity means few false positives among actual negatives.
| Positive class | Actual negative class | High specificity means |
|---|---|---|
| Has disease | Does not have disease | Healthy people are rarely told they may be sick. |
| Fraudulent transaction | Legitimate transaction | Legitimate transactions are rarely flagged as fraud. |
| Will default on a loan | Will not default on a loan | Safe borrowers are rarely flagged as risky. |
| Will buy a car | Will not buy a car | Non-buyers are rarely targeted as buyers. |
Specificity is the negative-class counterpart to recall.
Recall: Of actual positives, how many did we catch?
Specificity: Of actual negatives, how many did we correctly reject?
Specificity is related to the false positive rate: Specificity = 1 - FPR, where FPR = FP / (FP + TN).
Accuracy
Accuracy asks: what fraction of all predictions were correct? It is simple and useful when classes are reasonably balanced and false positives and false negatives have similar costs.
Accuracy can be misleading when one class is much more common than the other. If only 1% of people have a disease, a model that always predicts “no disease” gets 99% accuracy, but it has 0% recall for disease. The model is accurate in a shallow sense, but useless for finding the disease.
F1 Score
F1 combines precision and recall using the harmonic mean. It is high only when both precision and recall are high. If either one is poor, F1 drops quickly.
Use F1 when the positive class matters more than the negative class, the data is imbalanced, and you want one number that balances false positives and false negatives.
But F1 does not use TN, so it ignores how well the model handles actual negatives.
If the negative class matters a lot, also check specificity.
Also remember that F1 treats precision and recall as equally important.
If recall matters more than precision, or precision matters more than recall, consider an F-beta score or a cost-based threshold instead of plain F1.
Example 1: Disease Testing
Suppose the positive class is has disease, and the negative class is does not have disease.
| Predicted positive | Predicted negative | Total actual | |
|---|---|---|---|
| Actually positive | TP = 90 | FN = 10 | 100 |
| Actually negative | FP = 45 | TN = 855 | 900 |
| Total predicted | 135 | 865 | 1000 |
| Metric | Calculation | Value |
|---|---|---|
| Recall | 90 / (90 + 10) |
90.0% |
| Precision | 90 / (90 + 45) |
66.7% |
| Specificity | 855 / (855 + 45) |
95.0% |
| Accuracy | (90 + 855) / 1000 |
94.5% |
| F1 | 2 * 90 / (2 * 90 + 45 + 10) |
76.6% |
Interpretation: recall says the test catches 90% of sick patients. Precision says that when the test says “disease,” it is right about two-thirds of the time. Specificity says the test correctly clears 95% of healthy people. Accuracy is high, but it does not tell the whole story because most people in this dataset are healthy.
Precision in medical testing is also affected by prevalence. The same recall and specificity can produce lower precision when the disease is rare, because false positives may outnumber true positives.
Example 2: Car Sales, Positive Means “Will Buy”
Here the positive class is a good outcome: will buy a car. The negative class is will not buy a car.
| Predicted positive | Predicted negative | Total actual | |
|---|---|---|---|
| Actually positive | TP = 70 | FN = 30 | 100 |
| Actually negative | FP = 90 | TN = 810 | 900 |
| Total predicted | 160 | 840 | 1000 |
| Metric | Calculation | Value |
|---|---|---|
| Recall | 70 / (70 + 30) |
70.0% |
| Precision | 70 / (70 + 90) |
43.8% |
| Specificity | 810 / (810 + 90) |
90.0% |
| Accuracy | (70 + 810) / 1000 |
88.0% |
| F1 | 2 * 70 / (2 * 70 + 90 + 30) |
53.8% |
Interpretation: recall says the campaign finds 70% of real buyers. Precision says fewer than half of targeted people actually buy. Specificity says the model correctly ignores most non-buyers. Accuracy looks decent because most people are non-buyers, but the sales team may still care more about precision if their time is limited.
Example 3: Car Loans, Positive Means “Will Default”
Now the positive class is a bad event: will default on a loan. The negative class is will not default on a loan.
| Predicted positive | Predicted negative | Total actual | |
|---|---|---|---|
| Actually positive | TP = 30 | FN = 10 | 40 |
| Actually negative | FP = 20 | TN = 940 | 960 |
| Total predicted | 50 | 950 | 1000 |
| Metric | Calculation | Value |
|---|---|---|
| Recall | 30 / (30 + 10) |
75.0% |
| Precision | 30 / (30 + 20) |
60.0% |
| Specificity | 940 / (940 + 20) |
97.9% |
| Accuracy | (30 + 940) / 1000 |
97.0% |
| F1 | 2 * 30 / (2 * 30 + 20 + 10) |
66.7% |
Interpretation: recall says we caught 75% of people who would default. Precision says that among people flagged as risky, 60% really would default. Specificity says the model correctly accepts most people who would not default. Accuracy is 97%, but defaults are rare, so accuracy alone can hide risk.
This example shows why positive does not mean good. Here the positive class is bad, but recall still simply means: how much of the positive class did we catch?
Choosing the Right Metric
| Goal | Metric to watch |
|---|---|
| Catch as many positives as possible | Recall |
| Make positive predictions trustworthy | Precision |
| Correctly reject negatives | Specificity |
| Measure total correctness | Accuracy |
| Balance precision and recall | F1 |
| Situation | Usually prioritize | Why |
|---|---|---|
| Medical screening where missing disease is dangerous | High recall | False negatives are dangerous. |
| Confirmatory medical test where false alarms are harmful | High precision and high specificity | Positive results should be trustworthy, and healthy people should not be flagged unnecessarily. |
| Fraud detection where missed fraud is costly | High recall for fraud | The system should catch as much fraud as possible. |
| Fraud system where blocking legitimate customers is costly | High specificity, often with high precision | Legitimate transactions should rarely be flagged, and fraud alerts should be credible. |
| Sales campaign with limited sales team time | High precision | The contacted leads should be likely to convert. |
| Sales campaign where contacting extra people is cheap | High recall | The campaign can afford extra outreach to find more buyers. |
| Loan approval where defaults are costly | High recall for default | The lender wants to catch most likely defaults. |
| Loan approval where rejecting safe borrowers is costly | High specificity for default | The lender wants to avoid wrongly flagging safe borrowers as risky. |
Thresholds Create Trade-Offs
Many classifiers output a score or probability. A threshold converts that score into a positive or negative prediction. Changing the threshold changes the confusion matrix, which changes the metrics.
- Lower threshold: more predicted positives. Recall cannot decrease, specificity cannot increase, and precision often decreases if the added positives are lower-confidence cases.
- Higher threshold: fewer predicted positives. Recall cannot increase, specificity cannot decrease, and precision often increases if the remaining positives are higher-confidence cases.
Recall and specificity move in opposite directions as the model becomes more or less aggressive about calling things positive. Precision is not guaranteed to be monotonic with the threshold; it depends on class prevalence and how well the model ranks examples. This is why a model should not be judged only at the default threshold of 0.5.
Multiclass Note
For multiclass classification, these metrics are usually computed using a one-vs-rest view for each class. For example, in a three-class classifier with classes A, B, and C, you can compute metrics for class A by treating A as positive and everything else as negative. Then repeat for B and C.
- Macro average: compute the metric per class, then average each class equally. This is useful when minority classes matter.
- Weighted average: compute the metric per class, then weight by class frequency. This follows the dataset distribution.
- Micro average: pool the counts across classes first, then compute the metric globally. This can be dominated by common classes.
If rare classes matter, look at per-class metrics and macro averages. Weighted and micro averages can look strong even when rare classes are weak. In single-label multiclass classification, micro-averaged precision, micro-averaged recall, and micro-averaged F1 are equal to accuracy, so they may not add much beyond the accuracy number.
Common Traps
Trap 1: Thinking Positive Means Good
Wrong:
Positive = good
Negative = bad
Correct:
Positive = the class we are detecting
Negative = everything outside that class
Trap 2: Confusing Recall and Precision
Recall denominator:
TP + FN = all actual positives
Precision denominator:
TP + FP = all predicted positives
Memory aid:
Recall: Did we find the real positives?
Precision: Were our positive predictions right?
Trap 3: Trusting Accuracy on Imbalanced Data
If one class is very common, accuracy can look high even when the model performs badly on the important class. Always compare accuracy with recall, precision, and specificity.
Trap 4: Looking at F1 Without Specificity
F1 ignores true negatives. If correctly handling negatives matters, F1 alone is not enough.
Trap 5: Forgetting the Threshold
A threshold is a policy choice. The “best” threshold depends on the cost of a false positive, the cost of a false negative, and the capacity of the people or systems acting on model predictions.
One-Screen Summary
Positive = the target class, not necessarily a good thing.
Recall = TP / (TP + FN)
= Of actual positives, how many did we catch?
Precision = TP / (TP + FP)
= Of predicted positives, how many were right?
Specificity = TN / (TN + FP)
= Of actual negatives, how many did we reject correctly?
Accuracy = (TP + TN) / Total
= Of all cases, how many were correct?
F1 = 2 * Precision * Recall / (Precision + Recall)
= One score balancing precision and recall.
Best mental model:
Recall: catch the target
Precision: trust the alarm
Specificity: avoid false alarms among non-targets
Accuracy: total correctness
F1: precision-recall balance
If you remember only one rule, remember this: pick the metric by naming the mistake you cannot afford.