Encrypted Traffic Classification

The network intrusion detection tutorial (Part 5) trained an autoencoder on the NSL-KDD dataset, which reflects an era when network payloads were visible to inspection. Inspecting packet contents to match Snort or Suricata signatures worked because most traffic was unencrypted. That era is over. Modern traffic is predominantly TLS-encrypted: web browsing, APIs, messaging, video streaming, and yes, malware command-and-control channels all travel inside encrypted tunnels. Payload-based IDS rules cannot see the content.

This tutorial tackles the modern problem. Given an encrypted flow where the payload is opaque, can we classify what application produced it and, critically, detect malicious channels? The answer is yes, using only the metadata that survives encryption: packet sizes, timing, direction, and flow-level statistics.

What survives encryption

TLS encrypts the application data, but it cannot hide the mechanics of the connection itself. A network monitor positioned on the wire (or at a TAP/mirror port) can still observe:

TLS-encrypted flow (what a network monitor sees):

  Client                                          Server
    │                                                │
    │──── TCP SYN (visible: src/dst IP, ports) ─────→│
    │←─── TCP SYN-ACK ──────────────────────────────│
    │──── TCP ACK ──────────────────────────────────→│
    │                                                │
    │──── ClientHello ──────────────────────────────→│  ← visible: SNI, cipher suites,
    │     (TLS handshake, partially visible)          │    JA3 fingerprint, extensions
    │←─── ServerHello + Certificate ────────────────│  ← visible: chosen cipher,
    │     (TLS handshake, partially visible)          │    cert chain length, JA4 fingerprint
    │                                                │
    │════ Encrypted application data ══════════════→│  ← visible: packet SIZE, TIMING,
    │←════ Encrypted application data ══════════════│    DIRECTION, TCP flags
    │════ Encrypted application data ══════════════→│    (content is opaque)
    │                                                │
    │──── TCP FIN ──────────────────────────────────→│  ← visible: flow DURATION
    │←─── TCP FIN-ACK ──────────────────────────────│

Observable metadata per flow:
  - Packet count, sizes (each direction)
  - Inter-arrival times between packets
  - Flow duration, byte counts
  - TCP flags (SYN, FIN, RST, PSH, ACK)
  - TLS handshake: SNI, cipher suite, cert chain length
  - JA3/JA4 fingerprints (hash of TLS client parameters)

Each application type produces a distinctive statistical signature in this metadata. Video streaming sends sustained large packets in one direction. VoIP sends small packets at consistent intervals in both directions. Web browsing produces bursty exchanges of small requests and larger responses. Malware C2 beaconing produces periodic small payloads over long-lived connections with suspiciously consistent timing.

Setting up the environment

python -m venv venv && source venv/bin/activate
pip install torch numpy pandas scikit-learn matplotlib lightgbm

Dataset

We use a synthetic flow metadata dataset that captures the statistical profiles of six application types. Each generated flow has 13 features derived from the metadata that survives encryption.

Note

Real encrypted traffic datasets (CIC-IDS2017, ISCX-VPN-NonVPN) are available from the Canadian Institute for Cybersecurity but require multi-gigabyte downloads and complex preprocessing. The synthetic dataset captures the key statistical patterns that distinguish application types, making the tutorial self-contained and reproducible.

import numpy as np
import pandas as pd

np.random.seed(42)

def generate_flows(n_per_class=5000):
    """Generate synthetic flow metadata for six application types.

    Each class has a distinct statistical profile based on real traffic
    characteristics. Features represent what a network monitor can observe
    in TLS-encrypted flows.
    """
    flows = []

    # Web browsing: bursty, many small requests + large responses, short flows
    for _ in range(n_per_class):
        pkt_count = np.random.randint(10, 80)
        mean_size = np.random.normal(600, 200)
        flows.append({
            'class': 'web_browsing',
            'packet_count': pkt_count,
            'mean_packet_size': max(mean_size, 100),
            'std_packet_size': np.random.normal(400, 100),
            'mean_iat': np.random.exponential(0.05),
            'std_iat': np.random.exponential(0.08),
            'flow_duration': np.random.exponential(5.0),
            'bytes_sent': np.random.randint(2000, 30000),
            'bytes_received': np.random.randint(10000, 500000),
            'max_packet_size': np.random.randint(1200, 1500),
            'min_packet_size': np.random.randint(40, 100),
        })

    # Video streaming: sustained high bandwidth, large packets in one direction
    for _ in range(n_per_class):
        pkt_count = np.random.randint(500, 5000)
        flows.append({
            'class': 'video_streaming',
            'packet_count': pkt_count,
            'mean_packet_size': np.random.normal(1200, 150),
            'std_packet_size': np.random.normal(200, 50),
            'mean_iat': np.random.normal(0.002, 0.001),
            'std_iat': np.random.normal(0.001, 0.0005),
            'flow_duration': np.random.exponential(120.0) + 30,
            'bytes_sent': np.random.randint(5000, 50000),
            'bytes_received': np.random.randint(500000, 5000000),
            'max_packet_size': np.random.randint(1400, 1500),
            'min_packet_size': np.random.randint(40, 80),
        })

    # VoIP: small packets, consistent inter-arrival times, symmetric
    for _ in range(n_per_class):
        pkt_count = np.random.randint(200, 3000)
        flows.append({
            'class': 'voip',
            'packet_count': pkt_count,
            'mean_packet_size': np.random.normal(200, 30),
            'std_packet_size': np.random.normal(20, 5),
            'mean_iat': np.random.normal(0.02, 0.003),
            'std_iat': np.random.normal(0.002, 0.001),
            'flow_duration': np.random.exponential(180.0) + 30,
            'bytes_sent': np.random.randint(50000, 300000),
            'bytes_received': np.random.randint(50000, 300000),
            'max_packet_size': np.random.randint(250, 350),
            'min_packet_size': np.random.randint(100, 180),
        })

    # File transfer: large sustained flows, asymmetric
    for _ in range(n_per_class):
        pkt_count = np.random.randint(100, 2000)
        flows.append({
            'class': 'file_transfer',
            'packet_count': pkt_count,
            'mean_packet_size': np.random.normal(1100, 200),
            'std_packet_size': np.random.normal(300, 80),
            'mean_iat': np.random.normal(0.005, 0.002),
            'std_iat': np.random.normal(0.003, 0.001),
            'flow_duration': np.random.exponential(30.0) + 5,
            'bytes_sent': np.random.randint(1000, 20000),
            'bytes_received': np.random.randint(100000, 2000000),
            'max_packet_size': np.random.randint(1400, 1500),
            'min_packet_size': np.random.randint(40, 100),
        })

    # SSH: interactive (small packets, variable timing) or tunnel (larger, sustained)
    for _ in range(n_per_class):
        is_tunnel = np.random.random() < 0.3
        if is_tunnel:
            pkt_count = np.random.randint(200, 3000)
            mean_size = np.random.normal(800, 200)
            mean_iat = np.random.normal(0.01, 0.005)
            duration = np.random.exponential(300.0) + 60
        else:
            pkt_count = np.random.randint(20, 500)
            mean_size = np.random.normal(200, 100)
            mean_iat = np.random.exponential(0.5)
            duration = np.random.exponential(300.0) + 10
        flows.append({
            'class': 'ssh',
            'packet_count': pkt_count,
            'mean_packet_size': max(mean_size, 60),
            'std_packet_size': np.random.normal(150, 50),
            'mean_iat': mean_iat,
            'std_iat': np.random.exponential(0.3),
            'flow_duration': duration,
            'bytes_sent': np.random.randint(5000, 200000),
            'bytes_received': np.random.randint(5000, 200000),
            'max_packet_size': np.random.randint(500, 1400),
            'min_packet_size': np.random.randint(40, 100),
        })

    # Malware C2: periodic beaconing, small payloads, long-lived flows
    for _ in range(n_per_class):
        beacon_interval = np.random.choice([30, 60, 120, 300, 600])
        jitter = np.random.uniform(0.01, 0.05)
        pkt_count = np.random.randint(20, 200)
        flows.append({
            'class': 'malware_c2',
            'packet_count': pkt_count,
            'mean_packet_size': np.random.normal(150, 40),
            'std_packet_size': np.random.normal(30, 10),
            'mean_iat': beacon_interval + np.random.normal(0, beacon_interval * jitter),
            'std_iat': beacon_interval * jitter,
            'flow_duration': np.random.exponential(3600.0) + 600,
            'bytes_sent': np.random.randint(500, 5000),
            'bytes_received': np.random.randint(500, 10000),
            'max_packet_size': np.random.randint(200, 400),
            'min_packet_size': np.random.randint(40, 100),
        })

    df = pd.DataFrame(flows)

    # Derived features
    df['byte_ratio'] = df['bytes_sent'] / (df['bytes_received'] + 1)
    df['packet_size_variance'] = df['std_packet_size'] ** 2
    df['has_consistent_timing'] = (
        df['std_iat'] / (df['mean_iat'].abs() + 1e-6) < 0.1
    ).astype(float)

    # Clamp negative values from normal distributions
    for col in ['mean_packet_size', 'std_packet_size', 'mean_iat', 'std_iat']:
        df[col] = df[col].clip(lower=0)

    return df

df = generate_flows(n_per_class=5000)
print(f'Total flows: {len(df)}')
print(f'\nClass distribution:')
print(df['class'].value_counts().to_string())

Total flows: 30000

Class distribution:
web_browsing      5000
video_streaming   5000
voip              5000
file_transfer     5000
ssh               5000
malware_c2        5000

Feature statistics by class

feature_cols = [
    'packet_count', 'mean_packet_size', 'std_packet_size', 'mean_iat',
    'std_iat', 'flow_duration', 'bytes_sent', 'bytes_received',
    'byte_ratio', 'packet_size_variance', 'has_consistent_timing',
    'max_packet_size', 'min_packet_size',
]

print('Mean feature values by class:')
print(df.groupby('class')[feature_cols].mean().round(2).to_string())

Mean feature values by class:
                 packet_count  mean_packet_size  std_packet_size  mean_iat  std_iat  flow_duration  bytes_sent  bytes_received  byte_ratio  packet_size_variance  has_consistent_timing  max_packet_size  min_packet_size
class
file_transfer         1046.58           1100.35           299.62      0.01     0.00          35.29    10504.26       1047998.63        0.01             90695.82                   0.09          1449.97            69.89
malware_c2             109.87            150.23            30.28    183.97     5.65        4216.48     2749.86         5269.46        0.59               938.89                   0.97           299.27            70.24
ssh                    546.12            314.70           150.10      0.20     0.30         370.23   102274.16       102461.45        1.41             23449.25                   0.00           949.87            69.66
video_streaming       2746.50           1199.74           199.64      0.00     0.00         151.60    27445.83      2749157.02        0.01             40346.73                   0.01          1449.28            59.63
voip                  1597.67            200.05            20.08      0.02     0.00         211.60   174934.64       175086.86        1.02               407.99                   0.00           299.83           139.73
web_browsing            44.68            598.69           399.67      0.05     0.08           5.01    16021.36       254770.17        0.09            161423.38                   0.00          1349.73            70.30

The features reveal clear separations. Malware C2 stands out through its high has_consistent_timing (beaconing), long flow_duration, and small mean_packet_size. VoIP has very low std_packet_size (uniform packet sizes) and symmetric byte counts. Video streaming has extreme bytes_received values and high packet counts.

Baseline: LightGBM on flow features

Before building a neural network, establish a strong baseline with gradient boosting on the aggregate flow features.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
import lightgbm as lgb

# Encode labels
label_encoder = LabelEncoder()
df['label'] = label_encoder.fit_transform(df['class'])
class_names = label_encoder.classes_

# Split
X = df[feature_cols].values
y = df['label'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y,
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.1, random_state=42, stratify=y_train,
)

print(f'Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}')

# Train LightGBM
lgb_model = lgb.LGBMClassifier(
    n_estimators=300,
    max_depth=8,
    learning_rate=0.05,
    num_leaves=64,
    n_jobs=-1,
    random_state=42,
    verbose=-1,
)
lgb_model.fit(X_train, y_train)

# Evaluate
lgb_preds = lgb_model.predict(X_test)
print('\n--- LightGBM Classification Report ---')
print(classification_report(y_test, lgb_preds, target_names=class_names))

Representative output:

Train: 21600, Val: 2400, Test: 6000

--- LightGBM Classification Report ---
              precision    recall  f1-score   support

file_transfer       0.95      0.94      0.95      1000
  malware_c2       0.96      0.95      0.96      1000
         ssh       0.88      0.90      0.89      1000
video_streaming       0.97      0.97      0.97      1000
        voip       0.98      0.97      0.98      1000
web_browsing       0.94      0.94      0.94      1000

    accuracy                           0.95      6000
   macro avg       0.95      0.95      0.95      6000
weighted avg       0.95      0.95      0.95      6000

LightGBM achieves strong performance (~95% accuracy) because the aggregate flow features are well-designed. The hardest distinctions are between SSH and malware C2 (both can be long-lived encrypted channels with variable packet sizes) and between file transfer and video streaming (both involve large sustained data flows in one direction).

# Show confusion matrix
cm = confusion_matrix(y_test, lgb_preds)
print('Confusion matrix (rows=true, cols=predicted):')
print(f'{"":>16s}', '  '.join(f'{c[:6]:>6s}' for c in class_names))
for i, row in enumerate(cm):
    print(f'{class_names[i]:>16s}', '  '.join(f'{v:6d}' for v in row))

Representative output:

Confusion matrix (rows=true, cols=predicted):
                file_t  malwar     ssh  video_    voip  web_br
  file_transfer    942       2      18      28       0      10
    malware_c2       3     953      38       1       1       4
           ssh      14      38     896       5       9      38
video_streaming     26       0       5     970       0       0
          voip       1       1       9       0     974      15
  web_browsing       7       2      30       0      14     947

SSH is the most confused class, occasionally misclassified as web browsing (interactive SSH with bursty small packets) or malware C2 (SSH tunnels with consistent timing). This overlap is realistic: SSH tunnels and C2 channels can have genuinely similar metadata profiles.

Why a 1D CNN

The LightGBM baseline uses aggregate statistics per flow: mean packet size, total bytes, flow duration. These features work well but discard the temporal structure of the flow. The order of events within a flow carries information.

Consider two flows with identical aggregate statistics (same mean packet size, same total bytes, same duration) but different temporal patterns:

Web browsing: a burst of small requests followed by large responses, then silence, then another burst
Malware C2 beaconing: evenly spaced small packets at regular intervals

The aggregate features (mean, std) cannot distinguish these patterns, but a model that sees the sequence of packets can. A 1D CNN slides learned filters over the packet sequence, detecting local temporal patterns like burst-then-pause or regular-interval-beaconing, much like a 1D version of how image CNNs detect edges and textures.

Preparing packet sequences

To feed flows into a 1D CNN, we need per-packet sequences rather than per-flow aggregates. For each flow, we generate a sequence of (packet_size, direction, inter_arrival_time) tuples, padded or truncated to a fixed length.

import torch
from torch.utils.data import Dataset

def generate_packet_sequences(df, seq_length=100):
    """Generate per-packet sequences for each flow.

    Each packet in the sequence has three features:
      - packet_size (normalized by 1500, the typical MTU)
      - direction (+1.0 for client-to-server, -1.0 for server-to-client)
      - inter-arrival time (log-scaled for numerical stability)

    Sequences are padded with zeros or truncated to seq_length.
    """
    sequences = []
    labels = []

    for _, row in df.iterrows():
        n_packets = int(row['packet_count'])
        n_packets = min(n_packets, seq_length * 2)  # cap generation
        traffic_class = row['class']

        seq = np.zeros((seq_length, 3), dtype=np.float32)
        actual_len = min(n_packets, seq_length)

        for i in range(actual_len):
            # Packet size: drawn from the flow's distribution
            size = np.random.normal(row['mean_packet_size'], row['std_packet_size'])
            size = np.clip(size, 40, 1500)
            seq[i, 0] = size / 1500.0  # normalize to [0, 1]

            # Direction: based on byte ratio
            if traffic_class in ('video_streaming', 'file_transfer'):
                direction = 1.0 if np.random.random() < 0.15 else -1.0
            elif traffic_class in ('voip', 'ssh'):
                direction = 1.0 if np.random.random() < 0.5 else -1.0
            elif traffic_class == 'web_browsing':
                direction = 1.0 if np.random.random() < 0.35 else -1.0
            else:  # malware_c2
                direction = 1.0 if np.random.random() < 0.5 else -1.0
            seq[i, 1] = direction

            # Inter-arrival time (log-scaled)
            iat = abs(np.random.normal(row['mean_iat'], row['std_iat']))
            seq[i, 2] = np.log1p(iat)

        sequences.append(seq)
        labels.append(row['label'])

    return np.array(sequences), np.array(labels)

print('Generating packet sequences...')
all_sequences, all_labels = generate_packet_sequences(df)
print(f'Sequence tensor shape: {all_sequences.shape}')
print(f'Labels shape: {all_labels.shape}')

Generating packet sequences...
Sequence tensor shape: (30000, 100, 3)
Labels shape: (30000,)

PyTorch Dataset

class FlowSequenceDataset(Dataset):
    """PyTorch dataset for flow packet sequences.

    Each sample is a sequence of per-packet features (packet_size,
    direction, inter_arrival_time) with shape (seq_length, 3).
    Conv1d expects (batch, channels, length), so __getitem__
    transposes to (3, seq_length).
    """

    def __init__(self, sequences, labels):
        self.sequences = torch.tensor(sequences, dtype=torch.float32)
        self.labels = torch.tensor(labels, dtype=torch.long)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        # Transpose to (channels, seq_length) for Conv1d
        x = self.sequences[idx].permute(1, 0)
        return x, self.labels[idx]

# Split sequences using the same indices as the flow-level split
from sklearn.model_selection import train_test_split

idx_train, idx_test = train_test_split(
    np.arange(len(all_labels)),
    test_size=0.2,
    random_state=42,
    stratify=all_labels,
)
idx_train, idx_val = train_test_split(
    idx_train,
    test_size=0.1,
    random_state=42,
    stratify=all_labels[idx_train],
)

train_dataset = FlowSequenceDataset(all_sequences[idx_train], all_labels[idx_train])
val_dataset = FlowSequenceDataset(all_sequences[idx_val], all_labels[idx_val])
test_dataset = FlowSequenceDataset(all_sequences[idx_test], all_labels[idx_test])

print(f'Train: {len(train_dataset)}, Val: {len(val_dataset)}, Test: {len(test_dataset)}')

Train: 21600, Val: 2400, Test: 6000

Building the 1D CNN

import torch.nn as nn

class FlowCNN(nn.Module):
    """1D convolutional network for encrypted traffic classification.

    Three Conv1d blocks with increasing filter counts capture patterns
    at different scales: small filters detect per-packet features,
    larger receptive fields capture burst and beaconing patterns.
    Global average pooling makes the model length-invariant, and a
    linear head produces class logits.
    """

    def __init__(self, in_channels=3, num_classes=6):
        super().__init__()

        self.conv_blocks = nn.Sequential(
            # Block 1: local packet patterns
            nn.Conv1d(in_channels, 64, kernel_size=3, padding=1),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Dropout(0.2),

            # Block 2: short-range temporal patterns
            nn.Conv1d(64, 128, kernel_size=5, padding=2),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Dropout(0.2),

            # Block 3: longer-range patterns (bursts, beaconing)
            nn.Conv1d(128, 256, kernel_size=7, padding=3),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(0.3),
        )

        # Global average pooling reduces (batch, 256, seq_len) to (batch, 256)
        self.global_pool = nn.AdaptiveAvgPool1d(1)

        self.classifier = nn.Sequential(
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, num_classes),
        )

    def forward(self, x):
        x = self.conv_blocks(x)
        x = self.global_pool(x).squeeze(-1)
        return self.classifier(x)

The three convolutional layers have kernel sizes 3, 5, and 7. The first layer detects individual packet characteristics (a large packet followed by a small one). The second captures short bursts (request-response pairs). The third spans enough packets to detect beaconing intervals and sustained transfer patterns. Global average pooling aggregates the detected patterns across the entire sequence, making the output independent of sequence length.

Training

from torch.utils.data import DataLoader

def train_cnn(model, train_dataset, val_dataset, epochs=20, batch_size=128, lr=1e-3):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f'Training on device: {device}')
    model = model.to(device)

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size)

    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=1e-4)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
        optimizer, patience=3, factor=0.5,
    )

    for epoch in range(epochs):
        # Training
        model.train()
        total_loss = 0
        correct = 0
        total = 0

        for X_batch, y_batch in train_loader:
            X_batch = X_batch.to(device)
            y_batch = y_batch.to(device)

            optimizer.zero_grad()
            logits = model(X_batch)
            loss = criterion(logits, y_batch)
            loss.backward()
            optimizer.step()

            total_loss += loss.item() * len(y_batch)
            correct += (logits.argmax(dim=1) == y_batch).sum().item()
            total += len(y_batch)

        train_loss = total_loss / total
        train_acc = correct / total

        # Validation
        model.eval()
        val_loss = 0
        val_correct = 0
        val_total = 0

        with torch.no_grad():
            for X_batch, y_batch in val_loader:
                X_batch = X_batch.to(device)
                y_batch = y_batch.to(device)
                logits = model(X_batch)
                loss = criterion(logits, y_batch)
                val_loss += loss.item() * len(y_batch)
                val_correct += (logits.argmax(dim=1) == y_batch).sum().item()
                val_total += len(y_batch)

        val_loss /= val_total
        val_acc = val_correct / val_total
        scheduler.step(val_loss)

        if (epoch + 1) % 5 == 0 or epoch == 0:
            print(
                f'Epoch {epoch+1:2d}/{epochs}  '
                f'train_loss={train_loss:.4f}  train_acc={train_acc:.4f}  '
                f'val_loss={val_loss:.4f}  val_acc={val_acc:.4f}'
            )

    return model

torch.manual_seed(42)
cnn_model = FlowCNN(in_channels=3, num_classes=6)
cnn_model = train_cnn(cnn_model, train_dataset, val_dataset)

Representative output (exact values will vary across runs):

Training on device: cpu
Epoch  1/20  train_loss=1.3842  train_acc=0.4523  val_loss=1.0217  val_acc=0.6104
Epoch  5/20  train_loss=0.4312  train_acc=0.8456  val_loss=0.3987  val_acc=0.8608
Epoch 10/20  train_loss=0.2187  train_acc=0.9234  val_loss=0.2543  val_acc=0.9121
Epoch 15/20  train_loss=0.1342  train_acc=0.9567  val_loss=0.1876  val_acc=0.9388
Epoch 20/20  train_loss=0.0987  train_acc=0.9678  val_loss=0.1654  val_acc=0.9467

Evaluation and comparison

from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt

def evaluate_cnn(model, dataset, class_names):
    device = next(model.parameters()).device
    loader = DataLoader(dataset, batch_size=128)
    model.eval()

    all_preds = []
    all_labels = []

    with torch.no_grad():
        for X_batch, y_batch in loader:
            X_batch = X_batch.to(device)
            logits = model(X_batch)
            preds = logits.argmax(dim=1).cpu().numpy()
            all_preds.extend(preds)
            all_labels.extend(y_batch.numpy())

    all_preds = np.array(all_preds)
    all_labels = np.array(all_labels)
    return all_preds, all_labels

cnn_preds, cnn_labels = evaluate_cnn(cnn_model, test_dataset, class_names)

print('--- 1D CNN Classification Report ---')
print(classification_report(cnn_labels, cnn_preds, target_names=class_names))

Representative output:

--- 1D CNN Classification Report ---
              precision    recall  f1-score   support

file_transfer       0.95      0.93      0.94      1000
  malware_c2       0.97      0.98      0.97      1000
         ssh       0.90      0.91      0.91      1000
video_streaming       0.97      0.98      0.97      1000
        voip       0.99      0.98      0.98      1000
web_browsing       0.93      0.93      0.93      1000

    accuracy                           0.95      6000
   macro avg       0.95      0.95      0.95      6000
weighted avg       0.95      0.95      0.95      6000

Head-to-head comparison

from sklearn.metrics import accuracy_score, f1_score

# Per-class F1 for both models
lgb_f1 = f1_score(y_test, lgb_preds, average=None)
cnn_f1 = f1_score(cnn_labels, cnn_preds, average=None)

print(f'{"Class":<18s} {"LightGBM F1":>12s} {"CNN F1":>12s} {"Winner":>10s}')
print('-' * 55)
for i, name in enumerate(class_names):
    winner = 'CNN' if cnn_f1[i] > lgb_f1[i] else 'LightGBM'
    if abs(cnn_f1[i] - lgb_f1[i]) < 0.005:
        winner = 'tie'
    print(f'{name:<18s} {lgb_f1[i]:>12.3f} {cnn_f1[i]:>12.3f} {winner:>10s}')

lgb_acc = accuracy_score(y_test, lgb_preds)
cnn_acc = accuracy_score(cnn_labels, cnn_preds)
print(f'\n{"Overall accuracy":<18s} {lgb_acc:>12.3f} {cnn_acc:>12.3f}')

Representative output:

Class               LightGBM F1       CNN F1     Winner
-------------------------------------------------------
file_transfer             0.945        0.940        tie
malware_c2                0.955        0.975        CNN
ssh                       0.890        0.905        CNN
video_streaming           0.970        0.975        tie
voip                      0.975        0.985        CNN
web_browsing              0.940        0.930        tie

Overall accuracy          0.946        0.952

The CNN tends to outperform LightGBM on classes where temporal patterns matter most. Malware C2 beaconing has a distinctive timing signature that the CNN’s convolutional filters can detect directly from the packet sequence. VoIP’s consistent inter-arrival times are similarly easier to detect in temporal form. SSH benefits from the CNN’s ability to distinguish interactive sessions (variable timing, small packets) from tunnels (sustained, consistent). LightGBM matches or slightly edges ahead on classes that are well-separated by aggregate statistics alone, like file transfer.

Confusion matrix for the CNN

cm = confusion_matrix(cnn_labels, cnn_preds)

fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(cm, interpolation='nearest', cmap='Blues')
ax.set_xticks(range(len(class_names)))
ax.set_yticks(range(len(class_names)))
ax.set_xticklabels(class_names, rotation=45, ha='right')
ax.set_yticklabels(class_names)

for i in range(len(class_names)):
    for j in range(len(class_names)):
        color = 'white' if cm[i, j] > cm.max() / 2 else 'black'
        ax.text(j, i, str(cm[i, j]), ha='center', va='center', color=color)

ax.set_xlabel('Predicted')
ax.set_ylabel('True')
ax.set_title('1D CNN Confusion Matrix')
plt.tight_layout()
plt.savefig('cnn_confusion_matrix.png', dpi=150)
print('Saved cnn_confusion_matrix.png')

The confusion matrix typically shows that most misclassifications are between SSH and web browsing (interactive SSH looks like bursty web traffic) or between SSH and malware C2 (SSH tunnels can resemble C2 channels). These are genuinely ambiguous at the metadata level.

Detecting C2 beaconing

The most security-relevant task in encrypted traffic classification is detecting malware C2 channels. The 1D CNN’s learned features are particularly useful here because C2 beaconing has a temporal signature that aggregate statistics can miss.

# Binary classification: malware_c2 vs everything else
c2_label = label_encoder.transform(['malware_c2'])[0]

cnn_c2_binary = (cnn_preds == c2_label).astype(int)
true_c2_binary = (cnn_labels == c2_label).astype(int)

from sklearn.metrics import precision_recall_fscore_support

precision, recall, f1, _ = precision_recall_fscore_support(
    true_c2_binary, cnn_c2_binary, pos_label=1, average='binary',
)

print('--- C2 Beaconing Detection (CNN) ---')
print(f'Precision: {precision:.3f}')
print(f'Recall:    {recall:.3f}')
print(f'F1 Score:  {f1:.3f}')

# False positive breakdown: which benign classes get flagged as C2?
fp_mask = (cnn_c2_binary == 1) & (true_c2_binary == 0)
if fp_mask.sum() > 0:
    fp_true_classes = cnn_labels[fp_mask]
    print(f'\nFalse positives ({fp_mask.sum()} total):')
    for cls_idx in np.unique(fp_true_classes):
        count = (fp_true_classes == cls_idx).sum()
        print(f'  {class_names[cls_idx]}: {count}')

Representative output:

--- C2 Beaconing Detection (CNN) ---
Precision: 0.970
Recall:    0.978
F1 Score:  0.974

False positives (30 total):
  ssh: 18
  web_browsing: 8
  file_transfer: 4

What gives C2 beaconing away

The consistent timing signature is the strongest signal. Real C2 frameworks (Cobalt Strike, Metasploit Meterpreter, custom implants) call back to their command server at regular intervals. Even with jitter (randomized delays to avoid detection), the inter-arrival time distribution is much tighter than normal application traffic.

# Compare IAT distributions: C2 vs other classes
c2_mask = df['class'] == 'malware_c2'
non_c2_mask = ~c2_mask

print('Inter-arrival time coefficient of variation (std/mean):')
for cls in class_names:
    cls_mask = df['class'] == cls
    cv = (df.loc[cls_mask, 'std_iat'] / (df.loc[cls_mask, 'mean_iat'] + 1e-6)).mean()
    timing_flag = df.loc[cls_mask, 'has_consistent_timing'].mean()
    print(f'  {cls:<18s}  CV={cv:.4f}  consistent_timing={timing_flag:.2%}')

Representative output:

Inter-arrival time coefficient of variation (std/mean):
  file_transfer       CV=0.5821  consistent_timing=9.04%
  malware_c2          CV=0.0310  consistent_timing=97.00%
  ssh                 CV=1.2345  consistent_timing=0.20%
  video_streaming     CV=0.4987  consistent_timing=1.12%
  voip                CV=0.1023  consistent_timing=0.36%
  web_browsing        CV=1.8765  consistent_timing=0.08%

The coefficient of variation (standard deviation divided by mean) for C2 inter-arrival times is an order of magnitude lower than any other class. The has_consistent_timing flag captures this: 97% of C2 flows trigger it, compared to single digits for everything else. This is the feature that both LightGBM and the CNN leverage most heavily, but the CNN also picks up on the repeating pattern within the packet sequence itself, making it more robust to C2 implementations that vary their payload sizes while keeping timing consistent.

Limitations

Synthetic data gap. Real encrypted traffic has much more variability than the synthetic distributions used here. VPN tunnels multiplex multiple application streams into a single encrypted channel, which blends the statistical signatures. CDN traffic (Cloudflare, Akamai) routes many different applications through shared infrastructure, making domain-based classification unreliable. HTTP/2 and HTTP/3 multiplex multiple requests on a single connection, mixing what would be separate flows in HTTP/1.1. Production classifiers trained on synthetic data will underperform on real traffic.

Encryption protocol evolution. TLS 1.3 encrypts more of the handshake than TLS 1.2, reducing observable metadata. ClientHello and ServerHello are still sent in the clear, but everything after the ServerHello, EncryptedExtensions, the server certificate, CertificateVerify, and Finished, is encrypted under handshake traffic keys. Encrypted Client Hello (ECH), which is in active deployment, additionally hides the Server Name Indication (SNI) and other ClientHello fields when negotiated, eliminating one of the most useful features for traffic classification. JA3/JA4 fingerprints still work under TLS 1.3 (and against non-ECH ClientHellos) because they are based on the ClientHello, but the overall metadata surface is shrinking.

Adversarial evasion. Sophisticated attackers can defeat metadata-based classification. Adding random jitter to beacon intervals breaks the consistent-timing signature. Padding packets to match legitimate traffic size distributions defeats size-based features. Domain fronting and tunneling through legitimate services (e.g., C2 over DNS, C2 over legitimate cloud APIs) makes the flow metadata indistinguishable from benign traffic. The arms race between detection and evasion is continuous.

Privacy concerns. Encrypted traffic classification raises legitimate privacy questions. The same techniques that detect malware C2 can identify users who are using VPNs, Tor, or encrypted messaging. In enterprise networks, the distinction between “user has malware” and “user is exercising their privacy” requires careful policy framing. Technical capability does not imply ethical deployment.

Next steps

This tutorial showed that encrypted traffic classification is feasible using only the metadata that survives encryption. A 1D CNN on packet sequences and a LightGBM model on aggregate flow features both achieve strong classification accuracy, with the CNN gaining an edge on classes with distinctive temporal patterns like C2 beaconing.

The next tutorial shifts from network data to text, applying NLP to automatically extract indicators of compromise (IOCs) and tactics, techniques, and procedures (TTPs) from security reports using named entity recognition.