Lab
Classifier Threshold Lab
Drag a decision threshold across score distributions and watch precision, recall, and ROC curves update in real time.
Interactive threshold tuning
Select Scenario
Score Distribution
Confusion Matrix
| Predicted + | Predicted − | |
|---|---|---|
| Actual + | 0 | 0 |
| Actual − | 0 | 0 |
ROC Curve
AUC = 0.000Precision-Recall Curve
AUC = 0.000Controls
Fraction of positive samples (1–50%)
Amount of distribution overlap
Optimize for
How Threshold Tuning Works
Decision Threshold
A classifier outputs a score between 0 and 1. The threshold determines the cutoff: scores above it are labeled positive, below negative. Moving the threshold trades false positives for false negatives.
Precision vs Recall
Precision measures "of everything flagged, how much is real?" Recall measures "of everything real, how much did we catch?" You can rarely maximize both; raising one usually lowers the other.
Base Rate Effect
When positives are rare (low prevalence), even a good classifier generates many false positives relative to true positives. This is why the insider threat scenario is hard: 1% prevalence means the base rate dominates the confusion matrix.
What is a ROC Curve?
A ROC (Receiver Operating Characteristic) curve plots True Positive Rate (recall) on the y-axis against False Positive Rate on the x-axis as the threshold sweeps from 1 to 0. A perfect classifier hugs the top-left corner; the diagonal represents random guessing. The closer the curve bows toward the top-left, the better the classifier separates the two classes at every threshold.
AUC (Area Under the Curve)
AUC summarizes overall classifier quality in a single number from 0 to 1. An AUC of 0.5 means no better than chance; 1.0 means perfect separation. It measures the probability that the classifier ranks a random positive sample higher than a random negative one, independent of any particular threshold choice.
ROC vs PR Curves
ROC curves can look optimistic when classes are imbalanced because the large negative class keeps FPR low. Precision-recall curves expose poor performance on rare-event detection that ROC may hide. Always check both, especially at low prevalence.
The Confusion Matrix
The four cells of the matrix describe every possible outcome: True Positive (TP): correctly flagged as malicious, False Positive (FP): benign traffic wrongly flagged, False Negative (FN): missed real threat, True Negative (TN): correctly passed as benign. All metrics (precision, recall, F1) are derived from these four counts.
F1 Score
F1 is the harmonic mean of precision and recall: 2 × (P × R) / (P + R). Unlike the arithmetic mean, it penalizes extreme imbalance; if either precision or recall is very low, F1 drops sharply. This makes it a useful single metric when you need both values to be reasonable.
Challenge 1: The SOC is drowning
Scenario: Your SOC team processes 500 alerts/day. They can investigate at most 50. Set the phishing detection threshold so that no more than 5% of all samples are flagged positive.
Hint: Switch to the Phishing Detection scenario. Watch the FP Rate stat as you raise the threshold. When (prevalence × recall) + ((1 − prevalence) × FP Rate) drops below 0.05, you've hit the budget.
This mirrors real SOC capacity planning where alert volume must match analyst bandwidth.
Challenge 2: The needle in the haystack
Scenario: You're hunting insider threats (1% prevalence). The CISO demands recall ≥ 0.90. You cannot miss a real insider. Find the lowest threshold that meets this requirement.
Hint: Use the Insider Threat scenario. Watch the recall stat. Notice how many false positives you generate at 1% prevalence when recall is high. Now try the "Optimize for Recall" button and compare.
This is why base rate matters: at 1% prevalence, even 5% FPR swamps your true positives.
Challenge 3: The regulator is watching
Scenario: Your malware classifier feeds a blocking rule. Every false positive blocks a legitimate file and generates a customer complaint. The compliance team requires precision ≥ 0.95.
Hint: Use the Malware Family scenario. Push the threshold up until precision hits 0.95. Note how many real malware samples you miss (low recall). Now try adjusting the noise slider to see how a better feature set (less overlap) improves the trade-off.
High-precision requirements force you to accept lower recall, or invest in better features.
Security model
Everything runs in your browser. No data is sent to any server. Score distributions are generated client-side using a seeded pseudo-random number generator.