The anomaly detection tutorial used Isolation Forest to flag suspicious Linux sessions. Isolation Forest is effective and interpretable, but each tree split is axis-aligned on a single feature at a time. When the signal is in the relationships between features, the combination of packet sizes, timing, and protocol flags that distinguish an SSH brute-force from normal SSH traffic, a model that learns those relationships from data can outperform a purely tree-based baseline.
Autoencoders learn to compress and reconstruct their input. When trained only on normal traffic, they learn the patterns of “normal.” Anomalous traffic, intrusions, scans, exfiltration, doesn’t fit those patterns and produces high reconstruction error, which becomes the anomaly score.
This tutorial trains an autoencoder on the NSL-KDD dataset (a cleaned version of the classic KDD Cup 99 network intrusion dataset), evaluates it against known attack types, and compares performance with Isolation Forest.
Autoencoder anomaly detection:
Normal traffic: Attack traffic:
┌───────┐ ┌──────┐ ┌───────┐ ┌───────┐ ┌──────┐ ┌───────┐
│ Input │───→│ Enc │───→│Decoded│ │ Input │───→│ Enc │───→│Decoded│
│ flow │ │ ode │ │ flow │ │ flow │ │ ode │ │ flow │
└───┬───┘ └──────┘ └───┬───┘ └───┬───┘ └──────┘ └───┬───┘
│ │ │ │
└──── compare ───────────┘ └──── compare ───────────┘
low error ✓ HIGH ERROR ✗
(learned pattern) (anomaly detected)What is an autoencoder?
An autoencoder is a neural network trained to output its own input. That sounds useless, but the trick is the bottleneck. The network has an encoder that compresses the input to a lower-dimensional representation, and a decoder that reconstructs the input from that compressed form. The network must learn the essential structure of the data to pass it through the bottleneck.
NSL-KDD starts with 41 raw features, but after one-hot encoding the categorical columns the input grows to roughly 120 dimensions. The bottleneck still forces a large compression ratio, which is what makes reconstruction error useful as an anomaly signal.
Architecture:
Input (~120 features after one-hot encoding)
↓
Encoder: input_dim → 64 → 32 → 8 (bottleneck)
↓
Decoder: 8 → 32 → 64 → input_dim
↓
Reconstructed output (~120 features)
Loss = MSE(input, output)For anomaly detection, we train only on normal data. The autoencoder learns to reconstruct normal patterns with low error. When we feed it attack traffic, the reconstruction error is high because the model has never seen those patterns.
Setting up the environment
python -m venv venv && source venv/bin/activate
pip install torch numpy pandas scikit-learn matplotlibDownloading NSL-KDD
mkdir -p ids-autoencoder/data
cd ids-autoencoder
# Download NSL-KDD dataset
wget -P data/ https://raw.githubusercontent.com/defcom17/NSL_KDD/master/KDDTrain+.txt
wget -P data/ https://raw.githubusercontent.com/defcom17/NSL_KDD/master/KDDTest+.txtUnderstanding the dataset
NSL-KDD contains network connection records with 41 features and a label (normal or one of 39 attack types). The attacks are grouped into four categories:
| Category | Examples | Description |
|---|---|---|
| DoS | neptune, smurf, back | Denial of service |
| Probe | portsweep, nmap, satan | Surveillance/scanning |
| R2L | ftp_write, spy, warezclient | Remote to local (unauthorized access) |
| U2R | buffer_overflow, rootkit, perl | User to root (privilege escalation) |
Loading and preprocessing
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
# Column names for NSL-KDD
COLUMNS = [
'duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes',
'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in',
'num_compromised', 'root_shell', 'su_attempted', 'num_root',
'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds',
'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate',
'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate',
'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count',
'dst_host_same_srv_rate', 'dst_host_diff_srv_rate',
'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate',
'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
'dst_host_srv_rerror_rate', 'label', 'difficulty_level',
]
# Attack type → category mapping
ATTACK_CATEGORIES = {
'normal': 'normal',
'back': 'DoS', 'land': 'DoS', 'neptune': 'DoS', 'pod': 'DoS',
'smurf': 'DoS', 'teardrop': 'DoS', 'mailbomb': 'DoS', 'apache2': 'DoS',
'processtable': 'DoS', 'udpstorm': 'DoS',
'ipsweep': 'Probe', 'nmap': 'Probe', 'portsweep': 'Probe', 'satan': 'Probe',
'mscan': 'Probe', 'saint': 'Probe',
'ftp_write': 'R2L', 'guess_passwd': 'R2L', 'imap': 'R2L', 'multihop': 'R2L',
'phf': 'R2L', 'spy': 'R2L', 'warezclient': 'R2L', 'warezmaster': 'R2L',
'sendmail': 'R2L', 'named': 'R2L', 'snmpgetattack': 'R2L', 'snmpguess': 'R2L',
'xlock': 'R2L', 'xsnoop': 'R2L', 'worm': 'R2L',
'buffer_overflow': 'U2R', 'loadmodule': 'U2R', 'perl': 'U2R', 'rootkit': 'U2R',
'httptunnel': 'U2R', 'ps': 'U2R', 'sqlattack': 'U2R', 'xterm': 'U2R',
}
def load_nslkdd(path):
df = pd.read_csv(path, names=COLUMNS, header=None)
df['category'] = df['label'].map(ATTACK_CATEGORIES).fillna('unknown')
df['is_attack'] = (df['category'] != 'normal').astype(int)
return df
train_df = load_nslkdd('data/KDDTrain+.txt')
test_df = load_nslkdd('data/KDDTest+.txt')
print(f'Training set: {len(train_df)} records')
print(f' Normal: {sum(train_df.is_attack == 0)}')
print(f' Attack: {sum(train_df.is_attack == 1)}')
print(f'\nTest set: {len(test_df)} records')
print(f' Normal: {sum(test_df.is_attack == 0)}')
print(f' Attack: {sum(test_df.is_attack == 1)}')Training set: 125973 records
Normal: 67343
Attack: 58630
Test set: 22544 records
Normal: 9711
Attack: 12833Encoding categorical features
Three features are categorical: protocol_type (tcp, udp, icmp), service (http, ftp, smtp, …), and flag (SF, REJ, RSTO, …). One-hot encode them, but fit the feature space on the training set only. That avoids leaking test-set categories into training.
def preprocess(train_df, test_df):
"""Encode categoricals without leaking test-set structure into training."""
categoricals = ['protocol_type', 'service', 'flag']
train_encoded = pd.get_dummies(train_df, columns=categoricals, dtype=float)
test_encoded = pd.get_dummies(test_df, columns=categoricals, dtype=float)
drop_cols = ['label', 'difficulty_level', 'category', 'is_attack']
feature_cols = [c for c in train_encoded.columns if c not in drop_cols]
# Align test columns to the training feature space.
# Unseen categorical values in the test set become all-zero indicator groups.
X_train = train_encoded[feature_cols].to_numpy(dtype=np.float32)
X_test = test_encoded.reindex(columns=feature_cols, fill_value=0).to_numpy(dtype=np.float32)
y_train = train_df['is_attack'].values
y_test = test_df['is_attack'].values
categories_test = test_df['category'].values
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
return X_train, y_train, X_test, y_test, categories_test, feature_cols, scaler
X_train, y_train, X_test, y_test, categories_test, feature_cols, scaler = preprocess(train_df, test_df)
print(f'Feature dimensions after encoding: {X_train.shape[1]}')Training on normal data only
The key distinction from supervised classification: we train the autoencoder only on normal traffic. It learns what “normal” looks like, and anything different triggers a high reconstruction error.
We also hold out a slice of normal traffic for threshold calibration. That validation split gives us a clean estimate of the upper tail of normal reconstruction error without peeking at the test set.
from sklearn.model_selection import train_test_split
# Filter training data to normal connections only
X_train_normal_all = X_train[y_train == 0]
X_train_normal, X_val_normal = train_test_split(
X_train_normal_all,
test_size=0.2,
random_state=42,
)
print(f'Training autoencoder on {len(X_train_normal)} normal samples')
print(f'Calibrating threshold on {len(X_val_normal)} held-out normal samples')Building the autoencoder
import torch
import torch.nn as nn
class NetworkAutoencoder(nn.Module):
def __init__(self, input_dim, encoding_dim=8):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, 64),
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, encoding_dim),
nn.ReLU(),
)
self.decoder = nn.Sequential(
nn.Linear(encoding_dim, 32),
nn.ReLU(),
nn.Linear(32, 64),
nn.ReLU(),
nn.Linear(64, input_dim),
# No activation — reconstruction should match scaled input (which can be negative)
)
def forward(self, x):
encoded = self.encoder(x)
decoded = self.decoder(encoded)
return decoded
def reconstruction_error(self, x):
"""Per-sample MSE reconstruction error."""
reconstructed = self.forward(x)
return torch.mean((x - reconstructed) ** 2, dim=1)Why this architecture works for anomaly detection
The bottleneck (8 dimensions) forces the autoencoder to learn a compressed representation of normal traffic. Network flows have regularities: HTTP traffic has characteristic byte counts, durations, and flag patterns; SSH has different but equally consistent patterns. The autoencoder captures these regularities across the one-hot-expanded feature space.
Attack traffic violates these regularities. A SYN flood has unusual flag patterns and zero response bytes. A port scan has many connections to different services with short durations. The autoencoder can’t reconstruct these patterns because it never learned them, resulting in high error.
Training
from torch.utils.data import DataLoader, TensorDataset
def train_autoencoder(X_normal, input_dim, epochs=50, batch_size=256, lr=1e-3):
torch.manual_seed(42)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Training on device: {device}')
X_tensor = torch.tensor(X_normal, dtype=torch.float32)
loader = DataLoader(TensorDataset(X_tensor), batch_size=batch_size, shuffle=True)
model = NetworkAutoencoder(input_dim).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
criterion = nn.MSELoss()
for epoch in range(epochs):
model.train()
total_loss = 0
for (batch,) in loader:
batch = batch.to(device)
reconstructed = model(batch)
loss = criterion(reconstructed, batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item() * len(batch)
avg_loss = total_loss / len(X_tensor)
if (epoch + 1) % 10 == 0:
print(f'Epoch {epoch+1:3d}/{epochs} loss={avg_loss:.6f}')
return model
model = train_autoencoder(X_train_normal, input_dim=X_train.shape[1])Exact numbers will vary across runs.
Epoch 10/50 loss=0.102622
Epoch 20/50 loss=0.052589
Epoch 30/50 loss=0.049837
Epoch 40/50 loss=0.036351
Epoch 50/50 loss=0.031179Setting the anomaly threshold
The threshold determines what reconstruction error counts as “anomalous.” Set it using the held-out normal validation data, not the test set. A percentile threshold captures the tail of the normal error distribution while keeping evaluation honest.
def compute_threshold(model, X_normal, percentile=95):
"""Set threshold at the Nth percentile of reconstruction error on validation-normal data."""
model.eval()
device = next(model.parameters()).device
with torch.no_grad():
X_tensor = torch.tensor(X_normal, dtype=torch.float32, device=device)
errors = model.reconstruction_error(X_tensor).cpu().numpy()
threshold = np.percentile(errors, percentile)
print(f'Reconstruction error stats (validation-normal data):')
print(f' Mean: {errors.mean():.6f}')
print(f' Std: {errors.std():.6f}')
print(f' P95: {np.percentile(errors, 95):.6f}')
print(f' P99: {np.percentile(errors, 99):.6f}')
print(f' Max: {errors.max():.6f}')
print(f'\nThreshold ({percentile}th percentile): {threshold:.6f}')
return threshold
threshold = compute_threshold(model, X_val_normal, percentile=99)Using the 99th percentile means about 1% of the held-out validation-normal split sits above the threshold. It does not guarantee a 1% false-positive rate on the test set or in production. Distribution shift, feature drift, and imperfect generalization usually move the real false-positive rate.
Evaluation
from sklearn.metrics import classification_report, roc_auc_score
def evaluate(model, X_test, y_test, categories, threshold):
model.eval()
device = next(model.parameters()).device
with torch.no_grad():
X_tensor = torch.tensor(X_test, dtype=torch.float32, device=device)
errors = model.reconstruction_error(X_tensor).cpu().numpy()
predictions = (errors > threshold).astype(int)
print('--- Overall Performance ---')
print(classification_report(y_test, predictions, target_names=['normal', 'attack']))
print(f'ROC AUC: {roc_auc_score(y_test, errors):.4f}')
# Per-category detection rates
print('\n--- Detection Rate by Attack Category ---')
for cat in ['DoS', 'Probe', 'R2L', 'U2R']:
mask = categories == cat
if mask.sum() == 0:
continue
detected = predictions[mask].sum()
total = mask.sum()
rate = detected / total
avg_error = errors[mask].mean()
print(f' {cat:6s}: {detected:5d}/{total:5d} detected ({rate:.1%}) avg_error={avg_error:.4f}')
# Normal (false positive rate)
normal_mask = categories == 'normal'
fp = predictions[normal_mask].sum()
print(f' Normal: {fp:5d}/{normal_mask.sum():5d} false positives ({fp/normal_mask.sum():.1%})')
return errors, predictions
errors, predictions = evaluate(model, X_test, y_test, categories_test, threshold)Exact numbers will vary across runs due to random weight initialization and training order.
--- Overall Performance ---
precision recall f1-score support
normal 0.58 0.99 0.73 9711
attack 0.98 0.47 0.63 12833
accuracy 0.69 22544
macro avg 0.78 0.73 0.68 22544
weighted avg 0.81 0.69 0.68 22544
ROC AUC: 0.9482
--- Detection Rate by Attack Category ---
DoS : 3787/ 7458 detected (50.8%) avg_error=0.8628
Probe : 1900/ 2421 detected (78.5%) avg_error=2.7032
R2L : 196/ 2754 detected (7.1%) avg_error=0.4897
U2R : 126/ 200 detected (63.0%) avg_error=4.1821
Normal: 143/ 9711 false positives (1.5%)Interpreting the results
Probe attacks are detected reasonably well (~78%) because they have distinctive network characteristics: unusual flag combinations, many short connections to different services. DoS detection varies more across runs; some DoS sub-types (like smurf) have extreme byte counts that stand out, while others overlap with normal traffic patterns once standardized.
R2L attacks are poorly detected (~7%) because they mimic normal connection patterns. A remote-to-local attack might look like a normal FTP session with slightly unusual commands. The network flow features don’t capture the payload-level differences that distinguish these attacks.
U2R detection is moderate (~63%). Privilege escalation attacks sometimes produce unusual feature combinations, but the signal is inconsistent. Like R2L, the real distinguishing behavior happens at the application layer, not in the flow metadata.
This is a fundamental limitation of network flow-based detection; it captures volumetric and behavioral anomalies but misses application-layer attacks that hide in normal-looking connections.
Comparing with Isolation Forest
from sklearn.ensemble import IsolationForest
# Train Isolation Forest on the same normal training split
iso_forest = IsolationForest(
n_estimators=200,
random_state=42,
n_jobs=-1,
)
iso_forest.fit(X_train_normal)
# Calibrate the threshold on the same held-out normal split used for the autoencoder
iso_val_scores = -iso_forest.decision_function(X_val_normal)
iso_threshold = np.percentile(iso_val_scores, 99)
# Score the test set
iso_scores = -iso_forest.decision_function(X_test) # higher = more anomalous
iso_preds = (iso_scores > iso_threshold).astype(int)
iso_auc = roc_auc_score(y_test, iso_scores)
print(f'Isolation Forest AUC: {iso_auc:.4f}')
print(f'Isolation Forest threshold (99th percentile of validation-normal scores): {iso_threshold:.6f}')
print(classification_report(y_test, iso_preds, target_names=['normal', 'attack']))This is the fairest comparison setup for this tutorial: both models train on the same normal-only split, both calibrate thresholds on the same held-out normal validation split, and both are evaluated on the same test set.
Representative calibrated comparison:
| Model | AUC | Typical pattern |
|---|---|---|
| Autoencoder | ~0.95 | Usually stronger on DoS and Probe, where correlated feature patterns matter |
| Isolation Forest | ~0.93-0.94 | Often competitive overall, but weaker on attacks that differ through combinations of features rather than single extreme values |
The autoencoder often has a modest edge once both models are calibrated the same way, particularly on DoS and Probe categories. But the gap is dataset-dependent and much smaller than a naive threshold comparison might suggest.
When to use which
| Factor | Autoencoder | Isolation Forest |
|---|---|---|
| Feature interactions | Learns them directly | Captured only indirectly through many tree splits |
| Training time | Slower (GPU helps) | Fast |
| Inference speed | Fast (single forward pass) | Fast |
| Interpretability | Harder (which features?) | Easier (path lengths) |
| Threshold tuning | Calibrate on validation-normal data | Calibrate on validation-normal data |
| Online learning | Possible (continue training) | Must retrain |
| Memory | Model weights (~100 KB) | All trees (~10 MB) |
For a SOC deployment, consider using both: Isolation Forest for fast triage, autoencoder for deeper analysis of flagged connections.
Visualizing the latent space
The autoencoder’s 8-dimensional encoding captures the essential structure of network traffic. Visualize it to understand what the model learned.
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
def visualize_latent_space(model, X_test, categories, n_samples=5000):
model.eval()
device = next(model.parameters()).device
# Subsample for visualization speed
indices = np.random.choice(len(X_test), n_samples, replace=False)
X_sub = torch.tensor(X_test[indices], dtype=torch.float32, device=device)
cats_sub = categories[indices]
with torch.no_grad():
encodings = model.encoder(X_sub).cpu().numpy()
tsne = TSNE(n_components=2, random_state=42, perplexity=50)
coords = tsne.fit_transform(encodings)
category_colors = {'normal': 'blue', 'DoS': 'red', 'Probe': 'orange', 'R2L': 'green', 'U2R': 'purple'}
plt.figure(figsize=(12, 8))
for cat, color in category_colors.items():
mask = cats_sub == cat
if mask.sum() > 0:
plt.scatter(coords[mask, 0], coords[mask, 1], c=color, label=cat, s=5, alpha=0.5)
plt.legend()
plt.title('Autoencoder Latent Space (t-SNE)')
plt.savefig('latent_space.png', dpi=150)
print('Saved latent_space.png')
visualize_latent_space(model, X_test, categories_test)Normal traffic typically forms a dense cluster. DoS and Probe attacks appear as separate clusters or outliers. R2L and U2R attacks overlap with normal, visually confirming why they’re hard to detect.
Limitations
Feature engineering matters more than model choice. The NSL-KDD features were carefully engineered by domain experts. On raw packet bytes or minimally processed flows, the autoencoder would need to be significantly larger and deeper. Feature quality determines the ceiling for any model.
Single-point detection. Each connection is scored independently. Real intrusions often span multiple connections (a scan followed by exploitation followed by exfiltration). Sequence-aware models (LSTMs, transformers) can capture these multi-step patterns, but require sequential data and are significantly more complex.
Dataset age. NSL-KDD reflects attack patterns from the late 1990s. Modern attacks look different, encrypted tunnels, living-off-the-land techniques, DNS exfiltration. Use CIC-IDS2017 or UNSW-NB15 for more realistic evaluation, though the methodology in this tutorial applies to any flow dataset.
Threshold sensitivity. The anomaly threshold is a single global value. Different attack types produce different reconstruction error ranges. An adaptive or per-service threshold would improve detection rates but adds complexity.
Next steps
This tutorial showed that autoencoders can learn what normal network traffic looks like and flag deviations as anomalies, with a modest edge over Isolation Forest on attacks that manifest through correlated feature patterns. The main gap is application-layer attacks that hide in normal-looking flows.
The next tutorial shifts from network flows to URL strings, using fine-tuned transformers to detect phishing URLs, a domain where the signal is in the text itself rather than in flow metadata.