Tutorial

Network Intrusion Detection with Autoencoders

Build a PyTorch autoencoder trained only on normal network flows to detect intrusions as high-reconstruction-error outliers, and compare with Isolation Forest.

7 min read intermediate

Prerequisites

  • Completion of the anomaly detection tutorial (Isolation Forest)
  • Basic Python and PyTorch knowledge
  • Familiarity with network flow concepts (source/dest IP, ports, protocols)
  • A machine with at least 8 GB of RAM

Part 4 of 10 in ML for Security

Table of Contents

The anomaly detection tutorial used Isolation Forest to flag suspicious Linux sessions. Isolation Forest is effective and interpretable, but each tree split is axis-aligned on a single feature at a time. When the signal is in the relationships between features, the combination of packet sizes, timing, and protocol flags that distinguish an SSH brute-force from normal SSH traffic, a model that learns those relationships from data can outperform a purely tree-based baseline.

Autoencoders learn to compress and reconstruct their input. When trained only on normal traffic, they learn the patterns of “normal.” Anomalous traffic, intrusions, scans, exfiltration, doesn’t fit those patterns and produces high reconstruction error, which becomes the anomaly score.

This tutorial trains an autoencoder on the NSL-KDD dataset (a cleaned version of the classic KDD Cup 99 network intrusion dataset), evaluates it against known attack types, and compares performance with Isolation Forest.

Autoencoder anomaly detection:

  Normal traffic:                     Attack traffic:
  ┌───────┐    ┌──────┐    ┌───────┐  ┌───────┐    ┌──────┐    ┌───────┐
  │ Input │───→│ Enc  │───→│Decoded│  │ Input │───→│ Enc  │───→│Decoded│
  │ flow  │    │ ode  │    │ flow  │  │ flow  │    │ ode  │    │ flow  │
  └───┬───┘    └──────┘    └───┬───┘  └───┬───┘    └──────┘    └───┬───┘
      │                        │          │                        │
      └──── compare ───────────┘          └──── compare ───────────┘
            low error ✓                         HIGH ERROR ✗
            (learned pattern)                   (anomaly detected)

What is an autoencoder?

An autoencoder is a neural network trained to output its own input. That sounds useless, but the trick is the bottleneck. The network has an encoder that compresses the input to a lower-dimensional representation, and a decoder that reconstructs the input from that compressed form. The network must learn the essential structure of the data to pass it through the bottleneck.

NSL-KDD starts with 41 raw features, but after one-hot encoding the categorical columns the input grows to roughly 120 dimensions. The bottleneck still forces a large compression ratio, which is what makes reconstruction error useful as an anomaly signal.

Architecture:

  Input (~120 features after one-hot encoding)

  Encoder: input_dim → 64 → 32 → 8 (bottleneck)

  Decoder: 8 → 32 → 64 → input_dim

  Reconstructed output (~120 features)

  Loss = MSE(input, output)

For anomaly detection, we train only on normal data. The autoencoder learns to reconstruct normal patterns with low error. When we feed it attack traffic, the reconstruction error is high because the model has never seen those patterns.

Setting up the environment

python -m venv venv && source venv/bin/activate
pip install torch numpy pandas scikit-learn matplotlib

Downloading NSL-KDD

mkdir -p ids-autoencoder/data
cd ids-autoencoder

# Download NSL-KDD dataset
wget -P data/ https://raw.githubusercontent.com/defcom17/NSL_KDD/master/KDDTrain+.txt
wget -P data/ https://raw.githubusercontent.com/defcom17/NSL_KDD/master/KDDTest+.txt

Understanding the dataset

NSL-KDD contains network connection records with 41 features and a label (normal or one of 39 attack types). The attacks are grouped into four categories:

CategoryExamplesDescription
DoSneptune, smurf, backDenial of service
Probeportsweep, nmap, satanSurveillance/scanning
R2Lftp_write, spy, warezclientRemote to local (unauthorized access)
U2Rbuffer_overflow, rootkit, perlUser to root (privilege escalation)

Loading and preprocessing

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Column names for NSL-KDD
COLUMNS = [
    'duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes',
    'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in',
    'num_compromised', 'root_shell', 'su_attempted', 'num_root',
    'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds',
    'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate',
    'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate',
    'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count',
    'dst_host_same_srv_rate', 'dst_host_diff_srv_rate',
    'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate',
    'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
    'dst_host_srv_rerror_rate', 'label', 'difficulty_level',
]

# Attack type → category mapping
ATTACK_CATEGORIES = {
    'normal': 'normal',
    'back': 'DoS', 'land': 'DoS', 'neptune': 'DoS', 'pod': 'DoS',
    'smurf': 'DoS', 'teardrop': 'DoS', 'mailbomb': 'DoS', 'apache2': 'DoS',
    'processtable': 'DoS', 'udpstorm': 'DoS',
    'ipsweep': 'Probe', 'nmap': 'Probe', 'portsweep': 'Probe', 'satan': 'Probe',
    'mscan': 'Probe', 'saint': 'Probe',
    'ftp_write': 'R2L', 'guess_passwd': 'R2L', 'imap': 'R2L', 'multihop': 'R2L',
    'phf': 'R2L', 'spy': 'R2L', 'warezclient': 'R2L', 'warezmaster': 'R2L',
    'sendmail': 'R2L', 'named': 'R2L', 'snmpgetattack': 'R2L', 'snmpguess': 'R2L',
    'xlock': 'R2L', 'xsnoop': 'R2L', 'worm': 'R2L',
    'buffer_overflow': 'U2R', 'loadmodule': 'U2R', 'perl': 'U2R', 'rootkit': 'U2R',
    'httptunnel': 'U2R', 'ps': 'U2R', 'sqlattack': 'U2R', 'xterm': 'U2R',
}

def load_nslkdd(path):
    df = pd.read_csv(path, names=COLUMNS, header=None)
    df['category'] = df['label'].map(ATTACK_CATEGORIES).fillna('unknown')
    df['is_attack'] = (df['category'] != 'normal').astype(int)
    return df

train_df = load_nslkdd('data/KDDTrain+.txt')
test_df = load_nslkdd('data/KDDTest+.txt')

print(f'Training set: {len(train_df)} records')
print(f'  Normal:  {sum(train_df.is_attack == 0)}')
print(f'  Attack:  {sum(train_df.is_attack == 1)}')
print(f'\nTest set: {len(test_df)} records')
print(f'  Normal:  {sum(test_df.is_attack == 0)}')
print(f'  Attack:  {sum(test_df.is_attack == 1)}')
Training set: 125973 records
  Normal:  67343
  Attack:  58630

Test set: 22544 records
  Normal:  9711
  Attack:  12833

Encoding categorical features

Three features are categorical: protocol_type (tcp, udp, icmp), service (http, ftp, smtp, …), and flag (SF, REJ, RSTO, …). One-hot encode them, but fit the feature space on the training set only. That avoids leaking test-set categories into training.

def preprocess(train_df, test_df):
    """Encode categoricals without leaking test-set structure into training."""
    categoricals = ['protocol_type', 'service', 'flag']
    train_encoded = pd.get_dummies(train_df, columns=categoricals, dtype=float)
    test_encoded = pd.get_dummies(test_df, columns=categoricals, dtype=float)

    drop_cols = ['label', 'difficulty_level', 'category', 'is_attack']
    feature_cols = [c for c in train_encoded.columns if c not in drop_cols]

    # Align test columns to the training feature space.
    # Unseen categorical values in the test set become all-zero indicator groups.
    X_train = train_encoded[feature_cols].to_numpy(dtype=np.float32)
    X_test = test_encoded.reindex(columns=feature_cols, fill_value=0).to_numpy(dtype=np.float32)
    y_train = train_df['is_attack'].values
    y_test = test_df['is_attack'].values
    categories_test = test_df['category'].values

    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    return X_train, y_train, X_test, y_test, categories_test, feature_cols, scaler

X_train, y_train, X_test, y_test, categories_test, feature_cols, scaler = preprocess(train_df, test_df)
print(f'Feature dimensions after encoding: {X_train.shape[1]}')

Training on normal data only

The key distinction from supervised classification: we train the autoencoder only on normal traffic. It learns what “normal” looks like, and anything different triggers a high reconstruction error.

We also hold out a slice of normal traffic for threshold calibration. That validation split gives us a clean estimate of the upper tail of normal reconstruction error without peeking at the test set.

from sklearn.model_selection import train_test_split

# Filter training data to normal connections only
X_train_normal_all = X_train[y_train == 0]
X_train_normal, X_val_normal = train_test_split(
    X_train_normal_all,
    test_size=0.2,
    random_state=42,
)

print(f'Training autoencoder on {len(X_train_normal)} normal samples')
print(f'Calibrating threshold on {len(X_val_normal)} held-out normal samples')

Building the autoencoder

import torch
import torch.nn as nn

class NetworkAutoencoder(nn.Module):
    def __init__(self, input_dim, encoding_dim=8):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, encoding_dim),
            nn.ReLU(),
        )

        self.decoder = nn.Sequential(
            nn.Linear(encoding_dim, 32),
            nn.ReLU(),
            nn.Linear(32, 64),
            nn.ReLU(),
            nn.Linear(64, input_dim),
            # No activation — reconstruction should match scaled input (which can be negative)
        )

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

    def reconstruction_error(self, x):
        """Per-sample MSE reconstruction error."""
        reconstructed = self.forward(x)
        return torch.mean((x - reconstructed) ** 2, dim=1)

Why this architecture works for anomaly detection

The bottleneck (8 dimensions) forces the autoencoder to learn a compressed representation of normal traffic. Network flows have regularities: HTTP traffic has characteristic byte counts, durations, and flag patterns; SSH has different but equally consistent patterns. The autoencoder captures these regularities across the one-hot-expanded feature space.

Attack traffic violates these regularities. A SYN flood has unusual flag patterns and zero response bytes. A port scan has many connections to different services with short durations. The autoencoder can’t reconstruct these patterns because it never learned them, resulting in high error.

Training

from torch.utils.data import DataLoader, TensorDataset

def train_autoencoder(X_normal, input_dim, epochs=50, batch_size=256, lr=1e-3):
    torch.manual_seed(42)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f'Training on device: {device}')

    X_tensor = torch.tensor(X_normal, dtype=torch.float32)
    loader = DataLoader(TensorDataset(X_tensor), batch_size=batch_size, shuffle=True)

    model = NetworkAutoencoder(input_dim).to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.MSELoss()

    for epoch in range(epochs):
        model.train()
        total_loss = 0
        for (batch,) in loader:
            batch = batch.to(device)
            reconstructed = model(batch)
            loss = criterion(reconstructed, batch)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item() * len(batch)

        avg_loss = total_loss / len(X_tensor)
        if (epoch + 1) % 10 == 0:
            print(f'Epoch {epoch+1:3d}/{epochs}  loss={avg_loss:.6f}')

    return model

model = train_autoencoder(X_train_normal, input_dim=X_train.shape[1])

Exact numbers will vary across runs.

Epoch  10/50  loss=0.102622
Epoch  20/50  loss=0.052589
Epoch  30/50  loss=0.049837
Epoch  40/50  loss=0.036351
Epoch  50/50  loss=0.031179

Setting the anomaly threshold

The threshold determines what reconstruction error counts as “anomalous.” Set it using the held-out normal validation data, not the test set. A percentile threshold captures the tail of the normal error distribution while keeping evaluation honest.

def compute_threshold(model, X_normal, percentile=95):
    """Set threshold at the Nth percentile of reconstruction error on validation-normal data."""
    model.eval()
    device = next(model.parameters()).device
    with torch.no_grad():
        X_tensor = torch.tensor(X_normal, dtype=torch.float32, device=device)
        errors = model.reconstruction_error(X_tensor).cpu().numpy()

    threshold = np.percentile(errors, percentile)
    print(f'Reconstruction error stats (validation-normal data):')
    print(f'  Mean:   {errors.mean():.6f}')
    print(f'  Std:    {errors.std():.6f}')
    print(f'  P95:    {np.percentile(errors, 95):.6f}')
    print(f'  P99:    {np.percentile(errors, 99):.6f}')
    print(f'  Max:    {errors.max():.6f}')
    print(f'\nThreshold ({percentile}th percentile): {threshold:.6f}')

    return threshold

threshold = compute_threshold(model, X_val_normal, percentile=99)

Using the 99th percentile means about 1% of the held-out validation-normal split sits above the threshold. It does not guarantee a 1% false-positive rate on the test set or in production. Distribution shift, feature drift, and imperfect generalization usually move the real false-positive rate.

Evaluation

from sklearn.metrics import classification_report, roc_auc_score

def evaluate(model, X_test, y_test, categories, threshold):
    model.eval()
    device = next(model.parameters()).device
    with torch.no_grad():
        X_tensor = torch.tensor(X_test, dtype=torch.float32, device=device)
        errors = model.reconstruction_error(X_tensor).cpu().numpy()

    predictions = (errors > threshold).astype(int)

    print('--- Overall Performance ---')
    print(classification_report(y_test, predictions, target_names=['normal', 'attack']))
    print(f'ROC AUC: {roc_auc_score(y_test, errors):.4f}')

    # Per-category detection rates
    print('\n--- Detection Rate by Attack Category ---')
    for cat in ['DoS', 'Probe', 'R2L', 'U2R']:
        mask = categories == cat
        if mask.sum() == 0:
            continue
        detected = predictions[mask].sum()
        total = mask.sum()
        rate = detected / total
        avg_error = errors[mask].mean()
        print(f'  {cat:6s}: {detected:5d}/{total:5d} detected ({rate:.1%})  avg_error={avg_error:.4f}')

    # Normal (false positive rate)
    normal_mask = categories == 'normal'
    fp = predictions[normal_mask].sum()
    print(f'  Normal: {fp:5d}/{normal_mask.sum():5d} false positives ({fp/normal_mask.sum():.1%})')

    return errors, predictions

errors, predictions = evaluate(model, X_test, y_test, categories_test, threshold)

Exact numbers will vary across runs due to random weight initialization and training order.

--- Overall Performance ---
              precision    recall  f1-score   support

      normal       0.58      0.99      0.73      9711
      attack       0.98      0.47      0.63     12833

    accuracy                           0.69     22544
   macro avg       0.78      0.73      0.68     22544
weighted avg       0.81      0.69      0.68     22544

ROC AUC: 0.9482

--- Detection Rate by Attack Category ---
  DoS   :  3787/ 7458 detected (50.8%)  avg_error=0.8628
  Probe :  1900/ 2421 detected (78.5%)  avg_error=2.7032
  R2L   :   196/ 2754 detected (7.1%)  avg_error=0.4897
  U2R   :   126/  200 detected (63.0%)  avg_error=4.1821
  Normal:   143/ 9711 false positives (1.5%)

Interpreting the results

Probe attacks are detected reasonably well (~78%) because they have distinctive network characteristics: unusual flag combinations, many short connections to different services. DoS detection varies more across runs; some DoS sub-types (like smurf) have extreme byte counts that stand out, while others overlap with normal traffic patterns once standardized.

R2L attacks are poorly detected (~7%) because they mimic normal connection patterns. A remote-to-local attack might look like a normal FTP session with slightly unusual commands. The network flow features don’t capture the payload-level differences that distinguish these attacks.

U2R detection is moderate (~63%). Privilege escalation attacks sometimes produce unusual feature combinations, but the signal is inconsistent. Like R2L, the real distinguishing behavior happens at the application layer, not in the flow metadata.

This is a fundamental limitation of network flow-based detection; it captures volumetric and behavioral anomalies but misses application-layer attacks that hide in normal-looking connections.

Comparing with Isolation Forest

from sklearn.ensemble import IsolationForest

# Train Isolation Forest on the same normal training split
iso_forest = IsolationForest(
    n_estimators=200,
    random_state=42,
    n_jobs=-1,
)
iso_forest.fit(X_train_normal)

# Calibrate the threshold on the same held-out normal split used for the autoencoder
iso_val_scores = -iso_forest.decision_function(X_val_normal)
iso_threshold = np.percentile(iso_val_scores, 99)

# Score the test set
iso_scores = -iso_forest.decision_function(X_test)  # higher = more anomalous
iso_preds = (iso_scores > iso_threshold).astype(int)

iso_auc = roc_auc_score(y_test, iso_scores)
print(f'Isolation Forest AUC: {iso_auc:.4f}')
print(f'Isolation Forest threshold (99th percentile of validation-normal scores): {iso_threshold:.6f}')
print(classification_report(y_test, iso_preds, target_names=['normal', 'attack']))

This is the fairest comparison setup for this tutorial: both models train on the same normal-only split, both calibrate thresholds on the same held-out normal validation split, and both are evaluated on the same test set.

Representative calibrated comparison:

ModelAUCTypical pattern
Autoencoder~0.95Usually stronger on DoS and Probe, where correlated feature patterns matter
Isolation Forest~0.93-0.94Often competitive overall, but weaker on attacks that differ through combinations of features rather than single extreme values

The autoencoder often has a modest edge once both models are calibrated the same way, particularly on DoS and Probe categories. But the gap is dataset-dependent and much smaller than a naive threshold comparison might suggest.

When to use which

FactorAutoencoderIsolation Forest
Feature interactionsLearns them directlyCaptured only indirectly through many tree splits
Training timeSlower (GPU helps)Fast
Inference speedFast (single forward pass)Fast
InterpretabilityHarder (which features?)Easier (path lengths)
Threshold tuningCalibrate on validation-normal dataCalibrate on validation-normal data
Online learningPossible (continue training)Must retrain
MemoryModel weights (~100 KB)All trees (~10 MB)

For a SOC deployment, consider using both: Isolation Forest for fast triage, autoencoder for deeper analysis of flagged connections.

Visualizing the latent space

The autoencoder’s 8-dimensional encoding captures the essential structure of network traffic. Visualize it to understand what the model learned.

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

def visualize_latent_space(model, X_test, categories, n_samples=5000):
    model.eval()
    device = next(model.parameters()).device

    # Subsample for visualization speed
    indices = np.random.choice(len(X_test), n_samples, replace=False)
    X_sub = torch.tensor(X_test[indices], dtype=torch.float32, device=device)
    cats_sub = categories[indices]

    with torch.no_grad():
        encodings = model.encoder(X_sub).cpu().numpy()

    tsne = TSNE(n_components=2, random_state=42, perplexity=50)
    coords = tsne.fit_transform(encodings)

    category_colors = {'normal': 'blue', 'DoS': 'red', 'Probe': 'orange', 'R2L': 'green', 'U2R': 'purple'}
    plt.figure(figsize=(12, 8))
    for cat, color in category_colors.items():
        mask = cats_sub == cat
        if mask.sum() > 0:
            plt.scatter(coords[mask, 0], coords[mask, 1], c=color, label=cat, s=5, alpha=0.5)
    plt.legend()
    plt.title('Autoencoder Latent Space (t-SNE)')
    plt.savefig('latent_space.png', dpi=150)
    print('Saved latent_space.png')

visualize_latent_space(model, X_test, categories_test)

Normal traffic typically forms a dense cluster. DoS and Probe attacks appear as separate clusters or outliers. R2L and U2R attacks overlap with normal, visually confirming why they’re hard to detect.

Limitations

Feature engineering matters more than model choice. The NSL-KDD features were carefully engineered by domain experts. On raw packet bytes or minimally processed flows, the autoencoder would need to be significantly larger and deeper. Feature quality determines the ceiling for any model.

Single-point detection. Each connection is scored independently. Real intrusions often span multiple connections (a scan followed by exploitation followed by exfiltration). Sequence-aware models (LSTMs, transformers) can capture these multi-step patterns, but require sequential data and are significantly more complex.

Dataset age. NSL-KDD reflects attack patterns from the late 1990s. Modern attacks look different, encrypted tunnels, living-off-the-land techniques, DNS exfiltration. Use CIC-IDS2017 or UNSW-NB15 for more realistic evaluation, though the methodology in this tutorial applies to any flow dataset.

Threshold sensitivity. The anomaly threshold is a single global value. Different attack types produce different reconstruction error ranges. An adaptive or per-service threshold would improve detection rates but adds complexity.

Next steps

This tutorial showed that autoencoders can learn what normal network traffic looks like and flag deviations as anomalies, with a modest edge over Isolation Forest on attacks that manifest through correlated feature patterns. The main gap is application-layer attacks that hide in normal-looking flows.

The next tutorial shifts from network flows to URL strings, using fine-tuned transformers to detect phishing URLs, a domain where the signal is in the text itself rather than in flow metadata.