The network intrusion detection tutorial (Part 5) trained an autoencoder on the NSL-KDD dataset, which reflects an era when network payloads were visible to inspection. Inspecting packet contents to match Snort or Suricata signatures worked because most traffic was unencrypted. That era is over. Modern traffic is predominantly TLS-encrypted: web browsing, APIs, messaging, video streaming, and yes, malware command-and-control channels all travel inside encrypted tunnels. Payload-based IDS rules cannot see the content.
This tutorial tackles the modern problem. Given an encrypted flow where the payload is opaque, can we classify what application produced it and, critically, detect malicious channels? The answer is yes, using only the metadata that survives encryption: packet sizes, timing, direction, and flow-level statistics.
What survives encryption
TLS encrypts the application data, but it cannot hide the mechanics of the connection itself. A network monitor positioned on the wire (or at a TAP/mirror port) can still observe:
TLS-encrypted flow (what a network monitor sees):
Client Server
│ │
│──── TCP SYN (visible: src/dst IP, ports) ─────→│
│←─── TCP SYN-ACK ──────────────────────────────│
│──── TCP ACK ──────────────────────────────────→│
│ │
│──── ClientHello ──────────────────────────────→│ ← visible: SNI, cipher suites,
│ (TLS handshake, partially visible) │ JA3 fingerprint, extensions
│←─── ServerHello + Certificate ────────────────│ ← visible: chosen cipher,
│ (TLS handshake, partially visible) │ cert chain length, JA4 fingerprint
│ │
│════ Encrypted application data ══════════════→│ ← visible: packet SIZE, TIMING,
│←════ Encrypted application data ══════════════│ DIRECTION, TCP flags
│════ Encrypted application data ══════════════→│ (content is opaque)
│ │
│──── TCP FIN ──────────────────────────────────→│ ← visible: flow DURATION
│←─── TCP FIN-ACK ──────────────────────────────│
Observable metadata per flow:
- Packet count, sizes (each direction)
- Inter-arrival times between packets
- Flow duration, byte counts
- TCP flags (SYN, FIN, RST, PSH, ACK)
- TLS handshake: SNI, cipher suite, cert chain length
- JA3/JA4 fingerprints (hash of TLS client parameters)Each application type produces a distinctive statistical signature in this metadata. Video streaming sends sustained large packets in one direction. VoIP sends small packets at consistent intervals in both directions. Web browsing produces bursty exchanges of small requests and larger responses. Malware C2 beaconing produces periodic small payloads over long-lived connections with suspiciously consistent timing.
Setting up the environment
python -m venv venv && source venv/bin/activate
pip install torch numpy pandas scikit-learn matplotlib lightgbmDataset
We use a synthetic flow metadata dataset that captures the statistical profiles of six application types. Each generated flow has 13 features derived from the metadata that survives encryption.
Note
Real encrypted traffic datasets (CIC-IDS2017, ISCX-VPN-NonVPN) are available from the Canadian Institute for Cybersecurity but require multi-gigabyte downloads and complex preprocessing. The synthetic dataset captures the key statistical patterns that distinguish application types, making the tutorial self-contained and reproducible.
import numpy as np
import pandas as pd
np.random.seed(42)
def generate_flows(n_per_class=5000):
"""Generate synthetic flow metadata for six application types.
Each class has a distinct statistical profile based on real traffic
characteristics. Features represent what a network monitor can observe
in TLS-encrypted flows.
"""
flows = []
# Web browsing: bursty, many small requests + large responses, short flows
for _ in range(n_per_class):
pkt_count = np.random.randint(10, 80)
mean_size = np.random.normal(600, 200)
flows.append({
'class': 'web_browsing',
'packet_count': pkt_count,
'mean_packet_size': max(mean_size, 100),
'std_packet_size': np.random.normal(400, 100),
'mean_iat': np.random.exponential(0.05),
'std_iat': np.random.exponential(0.08),
'flow_duration': np.random.exponential(5.0),
'bytes_sent': np.random.randint(2000, 30000),
'bytes_received': np.random.randint(10000, 500000),
'max_packet_size': np.random.randint(1200, 1500),
'min_packet_size': np.random.randint(40, 100),
})
# Video streaming: sustained high bandwidth, large packets in one direction
for _ in range(n_per_class):
pkt_count = np.random.randint(500, 5000)
flows.append({
'class': 'video_streaming',
'packet_count': pkt_count,
'mean_packet_size': np.random.normal(1200, 150),
'std_packet_size': np.random.normal(200, 50),
'mean_iat': np.random.normal(0.002, 0.001),
'std_iat': np.random.normal(0.001, 0.0005),
'flow_duration': np.random.exponential(120.0) + 30,
'bytes_sent': np.random.randint(5000, 50000),
'bytes_received': np.random.randint(500000, 5000000),
'max_packet_size': np.random.randint(1400, 1500),
'min_packet_size': np.random.randint(40, 80),
})
# VoIP: small packets, consistent inter-arrival times, symmetric
for _ in range(n_per_class):
pkt_count = np.random.randint(200, 3000)
flows.append({
'class': 'voip',
'packet_count': pkt_count,
'mean_packet_size': np.random.normal(200, 30),
'std_packet_size': np.random.normal(20, 5),
'mean_iat': np.random.normal(0.02, 0.003),
'std_iat': np.random.normal(0.002, 0.001),
'flow_duration': np.random.exponential(180.0) + 30,
'bytes_sent': np.random.randint(50000, 300000),
'bytes_received': np.random.randint(50000, 300000),
'max_packet_size': np.random.randint(250, 350),
'min_packet_size': np.random.randint(100, 180),
})
# File transfer: large sustained flows, asymmetric
for _ in range(n_per_class):
pkt_count = np.random.randint(100, 2000)
flows.append({
'class': 'file_transfer',
'packet_count': pkt_count,
'mean_packet_size': np.random.normal(1100, 200),
'std_packet_size': np.random.normal(300, 80),
'mean_iat': np.random.normal(0.005, 0.002),
'std_iat': np.random.normal(0.003, 0.001),
'flow_duration': np.random.exponential(30.0) + 5,
'bytes_sent': np.random.randint(1000, 20000),
'bytes_received': np.random.randint(100000, 2000000),
'max_packet_size': np.random.randint(1400, 1500),
'min_packet_size': np.random.randint(40, 100),
})
# SSH: interactive (small packets, variable timing) or tunnel (larger, sustained)
for _ in range(n_per_class):
is_tunnel = np.random.random() < 0.3
if is_tunnel:
pkt_count = np.random.randint(200, 3000)
mean_size = np.random.normal(800, 200)
mean_iat = np.random.normal(0.01, 0.005)
duration = np.random.exponential(300.0) + 60
else:
pkt_count = np.random.randint(20, 500)
mean_size = np.random.normal(200, 100)
mean_iat = np.random.exponential(0.5)
duration = np.random.exponential(300.0) + 10
flows.append({
'class': 'ssh',
'packet_count': pkt_count,
'mean_packet_size': max(mean_size, 60),
'std_packet_size': np.random.normal(150, 50),
'mean_iat': mean_iat,
'std_iat': np.random.exponential(0.3),
'flow_duration': duration,
'bytes_sent': np.random.randint(5000, 200000),
'bytes_received': np.random.randint(5000, 200000),
'max_packet_size': np.random.randint(500, 1400),
'min_packet_size': np.random.randint(40, 100),
})
# Malware C2: periodic beaconing, small payloads, long-lived flows
for _ in range(n_per_class):
beacon_interval = np.random.choice([30, 60, 120, 300, 600])
jitter = np.random.uniform(0.01, 0.05)
pkt_count = np.random.randint(20, 200)
flows.append({
'class': 'malware_c2',
'packet_count': pkt_count,
'mean_packet_size': np.random.normal(150, 40),
'std_packet_size': np.random.normal(30, 10),
'mean_iat': beacon_interval + np.random.normal(0, beacon_interval * jitter),
'std_iat': beacon_interval * jitter,
'flow_duration': np.random.exponential(3600.0) + 600,
'bytes_sent': np.random.randint(500, 5000),
'bytes_received': np.random.randint(500, 10000),
'max_packet_size': np.random.randint(200, 400),
'min_packet_size': np.random.randint(40, 100),
})
df = pd.DataFrame(flows)
# Derived features
df['byte_ratio'] = df['bytes_sent'] / (df['bytes_received'] + 1)
df['packet_size_variance'] = df['std_packet_size'] ** 2
df['has_consistent_timing'] = (
df['std_iat'] / (df['mean_iat'].abs() + 1e-6) < 0.1
).astype(float)
# Clamp negative values from normal distributions
for col in ['mean_packet_size', 'std_packet_size', 'mean_iat', 'std_iat']:
df[col] = df[col].clip(lower=0)
return df
df = generate_flows(n_per_class=5000)
print(f'Total flows: {len(df)}')
print(f'\nClass distribution:')
print(df['class'].value_counts().to_string())Total flows: 30000
Class distribution:
web_browsing 5000
video_streaming 5000
voip 5000
file_transfer 5000
ssh 5000
malware_c2 5000Feature statistics by class
feature_cols = [
'packet_count', 'mean_packet_size', 'std_packet_size', 'mean_iat',
'std_iat', 'flow_duration', 'bytes_sent', 'bytes_received',
'byte_ratio', 'packet_size_variance', 'has_consistent_timing',
'max_packet_size', 'min_packet_size',
]
print('Mean feature values by class:')
print(df.groupby('class')[feature_cols].mean().round(2).to_string())Mean feature values by class:
packet_count mean_packet_size std_packet_size mean_iat std_iat flow_duration bytes_sent bytes_received byte_ratio packet_size_variance has_consistent_timing max_packet_size min_packet_size
class
file_transfer 1046.58 1100.35 299.62 0.01 0.00 35.29 10504.26 1047998.63 0.01 90695.82 0.09 1449.97 69.89
malware_c2 109.87 150.23 30.28 183.97 5.65 4216.48 2749.86 5269.46 0.59 938.89 0.97 299.27 70.24
ssh 546.12 314.70 150.10 0.20 0.30 370.23 102274.16 102461.45 1.41 23449.25 0.00 949.87 69.66
video_streaming 2746.50 1199.74 199.64 0.00 0.00 151.60 27445.83 2749157.02 0.01 40346.73 0.01 1449.28 59.63
voip 1597.67 200.05 20.08 0.02 0.00 211.60 174934.64 175086.86 1.02 407.99 0.00 299.83 139.73
web_browsing 44.68 598.69 399.67 0.05 0.08 5.01 16021.36 254770.17 0.09 161423.38 0.00 1349.73 70.30The features reveal clear separations. Malware C2 stands out through its high has_consistent_timing (beaconing), long flow_duration, and small mean_packet_size. VoIP has very low std_packet_size (uniform packet sizes) and symmetric byte counts. Video streaming has extreme bytes_received values and high packet counts.
Baseline: LightGBM on flow features
Before building a neural network, establish a strong baseline with gradient boosting on the aggregate flow features.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
import lightgbm as lgb
# Encode labels
label_encoder = LabelEncoder()
df['label'] = label_encoder.fit_transform(df['class'])
class_names = label_encoder.classes_
# Split
X = df[feature_cols].values
y = df['label'].values
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y,
)
X_train, X_val, y_train, y_val = train_test_split(
X_train, y_train, test_size=0.1, random_state=42, stratify=y_train,
)
print(f'Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}')
# Train LightGBM
lgb_model = lgb.LGBMClassifier(
n_estimators=300,
max_depth=8,
learning_rate=0.05,
num_leaves=64,
n_jobs=-1,
random_state=42,
verbose=-1,
)
lgb_model.fit(X_train, y_train)
# Evaluate
lgb_preds = lgb_model.predict(X_test)
print('\n--- LightGBM Classification Report ---')
print(classification_report(y_test, lgb_preds, target_names=class_names))Representative output:
Train: 21600, Val: 2400, Test: 6000
--- LightGBM Classification Report ---
precision recall f1-score support
file_transfer 0.95 0.94 0.95 1000
malware_c2 0.96 0.95 0.96 1000
ssh 0.88 0.90 0.89 1000
video_streaming 0.97 0.97 0.97 1000
voip 0.98 0.97 0.98 1000
web_browsing 0.94 0.94 0.94 1000
accuracy 0.95 6000
macro avg 0.95 0.95 0.95 6000
weighted avg 0.95 0.95 0.95 6000LightGBM achieves strong performance (~95% accuracy) because the aggregate flow features are well-designed. The hardest distinctions are between SSH and malware C2 (both can be long-lived encrypted channels with variable packet sizes) and between file transfer and video streaming (both involve large sustained data flows in one direction).
# Show confusion matrix
cm = confusion_matrix(y_test, lgb_preds)
print('Confusion matrix (rows=true, cols=predicted):')
print(f'{"":>16s}', ' '.join(f'{c[:6]:>6s}' for c in class_names))
for i, row in enumerate(cm):
print(f'{class_names[i]:>16s}', ' '.join(f'{v:6d}' for v in row))Representative output:
Confusion matrix (rows=true, cols=predicted):
file_t malwar ssh video_ voip web_br
file_transfer 942 2 18 28 0 10
malware_c2 3 953 38 1 1 4
ssh 14 38 896 5 9 38
video_streaming 26 0 5 970 0 0
voip 1 1 9 0 974 15
web_browsing 7 2 30 0 14 947SSH is the most confused class, occasionally misclassified as web browsing (interactive SSH with bursty small packets) or malware C2 (SSH tunnels with consistent timing). This overlap is realistic: SSH tunnels and C2 channels can have genuinely similar metadata profiles.
Why a 1D CNN
The LightGBM baseline uses aggregate statistics per flow: mean packet size, total bytes, flow duration. These features work well but discard the temporal structure of the flow. The order of events within a flow carries information.
Consider two flows with identical aggregate statistics (same mean packet size, same total bytes, same duration) but different temporal patterns:
- Web browsing: a burst of small requests followed by large responses, then silence, then another burst
- Malware C2 beaconing: evenly spaced small packets at regular intervals
The aggregate features (mean, std) cannot distinguish these patterns, but a model that sees the sequence of packets can. A 1D CNN slides learned filters over the packet sequence, detecting local temporal patterns like burst-then-pause or regular-interval-beaconing, much like a 1D version of how image CNNs detect edges and textures.
Preparing packet sequences
To feed flows into a 1D CNN, we need per-packet sequences rather than per-flow aggregates. For each flow, we generate a sequence of (packet_size, direction, inter_arrival_time) tuples, padded or truncated to a fixed length.
import torch
from torch.utils.data import Dataset
def generate_packet_sequences(df, seq_length=100):
"""Generate per-packet sequences for each flow.
Each packet in the sequence has three features:
- packet_size (normalized by 1500, the typical MTU)
- direction (+1.0 for client-to-server, -1.0 for server-to-client)
- inter-arrival time (log-scaled for numerical stability)
Sequences are padded with zeros or truncated to seq_length.
"""
sequences = []
labels = []
for _, row in df.iterrows():
n_packets = int(row['packet_count'])
n_packets = min(n_packets, seq_length * 2) # cap generation
traffic_class = row['class']
seq = np.zeros((seq_length, 3), dtype=np.float32)
actual_len = min(n_packets, seq_length)
for i in range(actual_len):
# Packet size: drawn from the flow's distribution
size = np.random.normal(row['mean_packet_size'], row['std_packet_size'])
size = np.clip(size, 40, 1500)
seq[i, 0] = size / 1500.0 # normalize to [0, 1]
# Direction: based on byte ratio
if traffic_class in ('video_streaming', 'file_transfer'):
direction = 1.0 if np.random.random() < 0.15 else -1.0
elif traffic_class in ('voip', 'ssh'):
direction = 1.0 if np.random.random() < 0.5 else -1.0
elif traffic_class == 'web_browsing':
direction = 1.0 if np.random.random() < 0.35 else -1.0
else: # malware_c2
direction = 1.0 if np.random.random() < 0.5 else -1.0
seq[i, 1] = direction
# Inter-arrival time (log-scaled)
iat = abs(np.random.normal(row['mean_iat'], row['std_iat']))
seq[i, 2] = np.log1p(iat)
sequences.append(seq)
labels.append(row['label'])
return np.array(sequences), np.array(labels)
print('Generating packet sequences...')
all_sequences, all_labels = generate_packet_sequences(df)
print(f'Sequence tensor shape: {all_sequences.shape}')
print(f'Labels shape: {all_labels.shape}')Generating packet sequences...
Sequence tensor shape: (30000, 100, 3)
Labels shape: (30000,)PyTorch Dataset
class FlowSequenceDataset(Dataset):
"""PyTorch dataset for flow packet sequences.
Each sample is a sequence of per-packet features (packet_size,
direction, inter_arrival_time) with shape (seq_length, 3).
Conv1d expects (batch, channels, length), so __getitem__
transposes to (3, seq_length).
"""
def __init__(self, sequences, labels):
self.sequences = torch.tensor(sequences, dtype=torch.float32)
self.labels = torch.tensor(labels, dtype=torch.long)
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
# Transpose to (channels, seq_length) for Conv1d
x = self.sequences[idx].permute(1, 0)
return x, self.labels[idx]
# Split sequences using the same indices as the flow-level split
from sklearn.model_selection import train_test_split
idx_train, idx_test = train_test_split(
np.arange(len(all_labels)),
test_size=0.2,
random_state=42,
stratify=all_labels,
)
idx_train, idx_val = train_test_split(
idx_train,
test_size=0.1,
random_state=42,
stratify=all_labels[idx_train],
)
train_dataset = FlowSequenceDataset(all_sequences[idx_train], all_labels[idx_train])
val_dataset = FlowSequenceDataset(all_sequences[idx_val], all_labels[idx_val])
test_dataset = FlowSequenceDataset(all_sequences[idx_test], all_labels[idx_test])
print(f'Train: {len(train_dataset)}, Val: {len(val_dataset)}, Test: {len(test_dataset)}')Train: 21600, Val: 2400, Test: 6000Building the 1D CNN
import torch.nn as nn
class FlowCNN(nn.Module):
"""1D convolutional network for encrypted traffic classification.
Three Conv1d blocks with increasing filter counts capture patterns
at different scales: small filters detect per-packet features,
larger receptive fields capture burst and beaconing patterns.
Global average pooling makes the model length-invariant, and a
linear head produces class logits.
"""
def __init__(self, in_channels=3, num_classes=6):
super().__init__()
self.conv_blocks = nn.Sequential(
# Block 1: local packet patterns
nn.Conv1d(in_channels, 64, kernel_size=3, padding=1),
nn.BatchNorm1d(64),
nn.ReLU(),
nn.Dropout(0.2),
# Block 2: short-range temporal patterns
nn.Conv1d(64, 128, kernel_size=5, padding=2),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.Dropout(0.2),
# Block 3: longer-range patterns (bursts, beaconing)
nn.Conv1d(128, 256, kernel_size=7, padding=3),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.Dropout(0.3),
)
# Global average pooling reduces (batch, 256, seq_len) to (batch, 256)
self.global_pool = nn.AdaptiveAvgPool1d(1)
self.classifier = nn.Sequential(
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, num_classes),
)
def forward(self, x):
x = self.conv_blocks(x)
x = self.global_pool(x).squeeze(-1)
return self.classifier(x)The three convolutional layers have kernel sizes 3, 5, and 7. The first layer detects individual packet characteristics (a large packet followed by a small one). The second captures short bursts (request-response pairs). The third spans enough packets to detect beaconing intervals and sustained transfer patterns. Global average pooling aggregates the detected patterns across the entire sequence, making the output independent of sequence length.
Training
from torch.utils.data import DataLoader
def train_cnn(model, train_dataset, val_dataset, epochs=20, batch_size=128, lr=1e-3):
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Training on device: {device}')
model = model.to(device)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
optimizer, patience=3, factor=0.5,
)
for epoch in range(epochs):
# Training
model.train()
total_loss = 0
correct = 0
total = 0
for X_batch, y_batch in train_loader:
X_batch = X_batch.to(device)
y_batch = y_batch.to(device)
optimizer.zero_grad()
logits = model(X_batch)
loss = criterion(logits, y_batch)
loss.backward()
optimizer.step()
total_loss += loss.item() * len(y_batch)
correct += (logits.argmax(dim=1) == y_batch).sum().item()
total += len(y_batch)
train_loss = total_loss / total
train_acc = correct / total
# Validation
model.eval()
val_loss = 0
val_correct = 0
val_total = 0
with torch.no_grad():
for X_batch, y_batch in val_loader:
X_batch = X_batch.to(device)
y_batch = y_batch.to(device)
logits = model(X_batch)
loss = criterion(logits, y_batch)
val_loss += loss.item() * len(y_batch)
val_correct += (logits.argmax(dim=1) == y_batch).sum().item()
val_total += len(y_batch)
val_loss /= val_total
val_acc = val_correct / val_total
scheduler.step(val_loss)
if (epoch + 1) % 5 == 0 or epoch == 0:
print(
f'Epoch {epoch+1:2d}/{epochs} '
f'train_loss={train_loss:.4f} train_acc={train_acc:.4f} '
f'val_loss={val_loss:.4f} val_acc={val_acc:.4f}'
)
return model
torch.manual_seed(42)
cnn_model = FlowCNN(in_channels=3, num_classes=6)
cnn_model = train_cnn(cnn_model, train_dataset, val_dataset)Representative output (exact values will vary across runs):
Training on device: cpu
Epoch 1/20 train_loss=1.3842 train_acc=0.4523 val_loss=1.0217 val_acc=0.6104
Epoch 5/20 train_loss=0.4312 train_acc=0.8456 val_loss=0.3987 val_acc=0.8608
Epoch 10/20 train_loss=0.2187 train_acc=0.9234 val_loss=0.2543 val_acc=0.9121
Epoch 15/20 train_loss=0.1342 train_acc=0.9567 val_loss=0.1876 val_acc=0.9388
Epoch 20/20 train_loss=0.0987 train_acc=0.9678 val_loss=0.1654 val_acc=0.9467Evaluation and comparison
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
def evaluate_cnn(model, dataset, class_names):
device = next(model.parameters()).device
loader = DataLoader(dataset, batch_size=128)
model.eval()
all_preds = []
all_labels = []
with torch.no_grad():
for X_batch, y_batch in loader:
X_batch = X_batch.to(device)
logits = model(X_batch)
preds = logits.argmax(dim=1).cpu().numpy()
all_preds.extend(preds)
all_labels.extend(y_batch.numpy())
all_preds = np.array(all_preds)
all_labels = np.array(all_labels)
return all_preds, all_labels
cnn_preds, cnn_labels = evaluate_cnn(cnn_model, test_dataset, class_names)
print('--- 1D CNN Classification Report ---')
print(classification_report(cnn_labels, cnn_preds, target_names=class_names))Representative output:
--- 1D CNN Classification Report ---
precision recall f1-score support
file_transfer 0.95 0.93 0.94 1000
malware_c2 0.97 0.98 0.97 1000
ssh 0.90 0.91 0.91 1000
video_streaming 0.97 0.98 0.97 1000
voip 0.99 0.98 0.98 1000
web_browsing 0.93 0.93 0.93 1000
accuracy 0.95 6000
macro avg 0.95 0.95 0.95 6000
weighted avg 0.95 0.95 0.95 6000Head-to-head comparison
from sklearn.metrics import accuracy_score, f1_score
# Per-class F1 for both models
lgb_f1 = f1_score(y_test, lgb_preds, average=None)
cnn_f1 = f1_score(cnn_labels, cnn_preds, average=None)
print(f'{"Class":<18s} {"LightGBM F1":>12s} {"CNN F1":>12s} {"Winner":>10s}')
print('-' * 55)
for i, name in enumerate(class_names):
winner = 'CNN' if cnn_f1[i] > lgb_f1[i] else 'LightGBM'
if abs(cnn_f1[i] - lgb_f1[i]) < 0.005:
winner = 'tie'
print(f'{name:<18s} {lgb_f1[i]:>12.3f} {cnn_f1[i]:>12.3f} {winner:>10s}')
lgb_acc = accuracy_score(y_test, lgb_preds)
cnn_acc = accuracy_score(cnn_labels, cnn_preds)
print(f'\n{"Overall accuracy":<18s} {lgb_acc:>12.3f} {cnn_acc:>12.3f}')Representative output:
Class LightGBM F1 CNN F1 Winner
-------------------------------------------------------
file_transfer 0.945 0.940 tie
malware_c2 0.955 0.975 CNN
ssh 0.890 0.905 CNN
video_streaming 0.970 0.975 tie
voip 0.975 0.985 CNN
web_browsing 0.940 0.930 tie
Overall accuracy 0.946 0.952The CNN tends to outperform LightGBM on classes where temporal patterns matter most. Malware C2 beaconing has a distinctive timing signature that the CNN’s convolutional filters can detect directly from the packet sequence. VoIP’s consistent inter-arrival times are similarly easier to detect in temporal form. SSH benefits from the CNN’s ability to distinguish interactive sessions (variable timing, small packets) from tunnels (sustained, consistent). LightGBM matches or slightly edges ahead on classes that are well-separated by aggregate statistics alone, like file transfer.
Confusion matrix for the CNN
cm = confusion_matrix(cnn_labels, cnn_preds)
fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(cm, interpolation='nearest', cmap='Blues')
ax.set_xticks(range(len(class_names)))
ax.set_yticks(range(len(class_names)))
ax.set_xticklabels(class_names, rotation=45, ha='right')
ax.set_yticklabels(class_names)
for i in range(len(class_names)):
for j in range(len(class_names)):
color = 'white' if cm[i, j] > cm.max() / 2 else 'black'
ax.text(j, i, str(cm[i, j]), ha='center', va='center', color=color)
ax.set_xlabel('Predicted')
ax.set_ylabel('True')
ax.set_title('1D CNN Confusion Matrix')
plt.tight_layout()
plt.savefig('cnn_confusion_matrix.png', dpi=150)
print('Saved cnn_confusion_matrix.png')The confusion matrix typically shows that most misclassifications are between SSH and web browsing (interactive SSH looks like bursty web traffic) or between SSH and malware C2 (SSH tunnels can resemble C2 channels). These are genuinely ambiguous at the metadata level.
Detecting C2 beaconing
The most security-relevant task in encrypted traffic classification is detecting malware C2 channels. The 1D CNN’s learned features are particularly useful here because C2 beaconing has a temporal signature that aggregate statistics can miss.
# Binary classification: malware_c2 vs everything else
c2_label = label_encoder.transform(['malware_c2'])[0]
cnn_c2_binary = (cnn_preds == c2_label).astype(int)
true_c2_binary = (cnn_labels == c2_label).astype(int)
from sklearn.metrics import precision_recall_fscore_support
precision, recall, f1, _ = precision_recall_fscore_support(
true_c2_binary, cnn_c2_binary, pos_label=1, average='binary',
)
print('--- C2 Beaconing Detection (CNN) ---')
print(f'Precision: {precision:.3f}')
print(f'Recall: {recall:.3f}')
print(f'F1 Score: {f1:.3f}')
# False positive breakdown: which benign classes get flagged as C2?
fp_mask = (cnn_c2_binary == 1) & (true_c2_binary == 0)
if fp_mask.sum() > 0:
fp_true_classes = cnn_labels[fp_mask]
print(f'\nFalse positives ({fp_mask.sum()} total):')
for cls_idx in np.unique(fp_true_classes):
count = (fp_true_classes == cls_idx).sum()
print(f' {class_names[cls_idx]}: {count}')Representative output:
--- C2 Beaconing Detection (CNN) ---
Precision: 0.970
Recall: 0.978
F1 Score: 0.974
False positives (30 total):
ssh: 18
web_browsing: 8
file_transfer: 4What gives C2 beaconing away
The consistent timing signature is the strongest signal. Real C2 frameworks (Cobalt Strike, Metasploit Meterpreter, custom implants) call back to their command server at regular intervals. Even with jitter (randomized delays to avoid detection), the inter-arrival time distribution is much tighter than normal application traffic.
# Compare IAT distributions: C2 vs other classes
c2_mask = df['class'] == 'malware_c2'
non_c2_mask = ~c2_mask
print('Inter-arrival time coefficient of variation (std/mean):')
for cls in class_names:
cls_mask = df['class'] == cls
cv = (df.loc[cls_mask, 'std_iat'] / (df.loc[cls_mask, 'mean_iat'] + 1e-6)).mean()
timing_flag = df.loc[cls_mask, 'has_consistent_timing'].mean()
print(f' {cls:<18s} CV={cv:.4f} consistent_timing={timing_flag:.2%}')Representative output:
Inter-arrival time coefficient of variation (std/mean):
file_transfer CV=0.5821 consistent_timing=9.04%
malware_c2 CV=0.0310 consistent_timing=97.00%
ssh CV=1.2345 consistent_timing=0.20%
video_streaming CV=0.4987 consistent_timing=1.12%
voip CV=0.1023 consistent_timing=0.36%
web_browsing CV=1.8765 consistent_timing=0.08%The coefficient of variation (standard deviation divided by mean) for C2 inter-arrival times is an order of magnitude lower than any other class. The has_consistent_timing flag captures this: 97% of C2 flows trigger it, compared to single digits for everything else. This is the feature that both LightGBM and the CNN leverage most heavily, but the CNN also picks up on the repeating pattern within the packet sequence itself, making it more robust to C2 implementations that vary their payload sizes while keeping timing consistent.
Limitations
Synthetic data gap. Real encrypted traffic has much more variability than the synthetic distributions used here. VPN tunnels multiplex multiple application streams into a single encrypted channel, which blends the statistical signatures. CDN traffic (Cloudflare, Akamai) routes many different applications through shared infrastructure, making domain-based classification unreliable. HTTP/2 and HTTP/3 multiplex multiple requests on a single connection, mixing what would be separate flows in HTTP/1.1. Production classifiers trained on synthetic data will underperform on real traffic.
Encryption protocol evolution. TLS 1.3 encrypts more of the handshake than TLS 1.2, reducing observable metadata. ClientHello and ServerHello are still sent in the clear, but everything after the ServerHello, EncryptedExtensions, the server certificate, CertificateVerify, and Finished, is encrypted under handshake traffic keys. Encrypted Client Hello (ECH), which is in active deployment, additionally hides the Server Name Indication (SNI) and other ClientHello fields when negotiated, eliminating one of the most useful features for traffic classification. JA3/JA4 fingerprints still work under TLS 1.3 (and against non-ECH ClientHellos) because they are based on the ClientHello, but the overall metadata surface is shrinking.
Adversarial evasion. Sophisticated attackers can defeat metadata-based classification. Adding random jitter to beacon intervals breaks the consistent-timing signature. Padding packets to match legitimate traffic size distributions defeats size-based features. Domain fronting and tunneling through legitimate services (e.g., C2 over DNS, C2 over legitimate cloud APIs) makes the flow metadata indistinguishable from benign traffic. The arms race between detection and evasion is continuous.
Privacy concerns. Encrypted traffic classification raises legitimate privacy questions. The same techniques that detect malware C2 can identify users who are using VPNs, Tor, or encrypted messaging. In enterprise networks, the distinction between “user has malware” and “user is exercising their privacy” requires careful policy framing. Technical capability does not imply ethical deployment.
Next steps
This tutorial showed that encrypted traffic classification is feasible using only the metadata that survives encryption. A 1D CNN on packet sequences and a LightGBM model on aggregate flow features both achieve strong classification accuracy, with the CNN gaining an edge on classes with distinctive temporal patterns like C2 beaconing.
The next tutorial shifts from network data to text, applying NLP to automatically extract indicators of compromise (IOCs) and tactics, techniques, and procedures (TTPs) from security reports using named entity recognition.