Lab

Classifier Threshold Lab

Drag a decision threshold across score distributions and watch precision, recall, and ROC curves update in real time.

JavaScript Required

The Classifier Threshold Lab requires JavaScript to run in your browser. Please enable JavaScript to use this tool.

Interactive threshold tuning

Pick a security scenario, drag the decision threshold, and watch how precision, recall, and the confusion matrix change in real time.

Select Scenario

Score Distribution

Threshold 0.500

Confusion Matrix

	Predicted +	Predicted −
Actual +	0	0
Actual −	0	0

Accuracy

0.000

Precision

0.000

Recall

0.000

F1 Score

0.000

FP Rate

0.000

ROC Curve

AUC = 0.000

Precision-Recall Curve

AUC = 0.000

Controls

Prevalence: 20%

Fraction of positive samples (1–50%)

Noise: 50%

Amount of distribution overlap

Optimize for

Balanced (max F1) Precision (≥ 0.95) Recall (≥ 0.95)

How Threshold Tuning Works

Decision Threshold

A classifier outputs a score between 0 and 1. The threshold determines the cutoff: scores above it are labeled positive, below negative. Moving the threshold trades false positives for false negatives.

Precision vs Recall

Precision measures "of everything flagged, how much is real?" Recall measures "of everything real, how much did we catch?" You can rarely maximize both; raising one usually lowers the other.

Base Rate Effect

When positives are rare (low prevalence), even a good classifier generates many false positives relative to true positives. This is why the insider threat scenario is hard: 1% prevalence means the base rate dominates the confusion matrix.

What is a ROC Curve?

A ROC (Receiver Operating Characteristic) curve plots True Positive Rate (recall) on the y-axis against False Positive Rate on the x-axis as the threshold sweeps from 1 to 0. A perfect classifier hugs the top-left corner; the diagonal represents random guessing. The closer the curve bows toward the top-left, the better the classifier separates the two classes at every threshold.

AUC (Area Under the Curve)

AUC summarizes overall classifier quality in a single number from 0 to 1. An AUC of 0.5 means no better than chance; 1.0 means perfect separation. It measures the probability that the classifier ranks a random positive sample higher than a random negative one, independent of any particular threshold choice.

ROC vs PR Curves

ROC curves can look optimistic when classes are imbalanced because the large negative class keeps FPR low. Precision-recall curves expose poor performance on rare-event detection that ROC may hide. Always check both, especially at low prevalence.

The Confusion Matrix

The four cells of the matrix describe every possible outcome: True Positive (TP): correctly flagged as malicious, False Positive (FP): benign traffic wrongly flagged, False Negative (FN): missed real threat, True Negative (TN): correctly passed as benign. All metrics (precision, recall, F1) are derived from these four counts.

F1 Score

F1 is the harmonic mean of precision and recall: 2 × (P × R) / (P + R). Unlike the arithmetic mean, it penalizes extreme imbalance; if either precision or recall is very low, F1 drops sharply. This makes it a useful single metric when you need both values to be reasonable.

Challenge 1: The SOC is drowning

Scenario: Your SOC team processes 500 alerts/day. They can investigate at most 50. Set the phishing detection threshold so that no more than 5% of all samples are flagged positive.

Hint: Switch to the Phishing Detection scenario. Watch the FP Rate stat as you raise the threshold. When (prevalence × recall) + ((1 − prevalence) × FP Rate) drops below 0.05, you've hit the budget.

This mirrors real SOC capacity planning where alert volume must match analyst bandwidth.

Challenge 2: The needle in the haystack

Scenario: You're hunting insider threats (1% prevalence). The CISO demands recall ≥ 0.90. You cannot miss a real insider. Find the lowest threshold that meets this requirement.

Hint: Use the Insider Threat scenario. Watch the recall stat. Notice how many false positives you generate at 1% prevalence when recall is high. Now try the "Optimize for Recall" button and compare.

This is why base rate matters: at 1% prevalence, even 5% FPR swamps your true positives.

Challenge 3: The regulator is watching

Scenario: Your malware classifier feeds a blocking rule. Every false positive blocks a legitimate file and generates a customer complaint. The compliance team requires precision ≥ 0.95.

Hint: Use the Malware Family scenario. Push the threshold up until precision hits 0.95. Note how many real malware samples you miss (low recall). Now try adjusting the noise slider to see how a better feature set (less overlap) improves the trade-off.

High-precision requirements force you to accept lower recall, or invest in better features.

Security model

Everything runs in your browser. No data is sent to any server. Score distributions are generated client-side using a seeded pseudo-random number generator.