Over the course of this series, we’ve built classifiers for malware detection using neural network embeddings (Part 3), network intrusion detection with autoencoders (Part 4), phishing URL detection with fine-tuned transformers (Part 5), and DNS exfiltration detection with sequence models (Part 6). Each of those tutorials mentioned adversarial robustness as a limitation. This tutorial confronts that limitation directly: what happens when an attacker knows your model exists and actively tries to evade it?
The adversarial threat model
White-box vs. black-box attacks. In a white-box attack, the adversary has full access to the model (architecture, weights, gradients). In a black-box attack, the adversary can only query the model and observe outputs. Real-world scenarios fall between these extremes.
Attacker’s goal. Flip the prediction from malicious to benign (an evasion attack). The attacker has a malicious sample the classifier correctly flags and wants to modify it so the classifier misses it.
Perturbation constraints. The modified binary must still execute the same payload. Some features are easy to perturb (file size, section names) and others are impossible to change without breaking the malware.
Why security ML is uniquely vulnerable. In image classification, adversarial examples are a research curiosity. In security, the attacker is adversarial by definition. Adversarial robustness is a core requirement, not a bonus.
Setting up the environment
pip install torch numpy pandas scikit-learn matplotlibBuilding the target classifier
We’ll generate a synthetic feature dataset rather than requiring the full EMBER dataset from Part 3.
Note
This synthetic dataset is a stand-in for the EMBER-trained model from Part 3. The adversarial techniques apply identically to the full model. The synthetic setup guarantees reproducibility without multi-gigabyte downloads.
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
np.random.seed(42)
torch.manual_seed(42)
# 20K samples, 100 features. Malicious samples differ on features 0-29.
n_half = 10000
n_features = 100
benign = np.random.randn(n_half, n_features).astype(np.float32)
malicious = np.random.randn(n_half, n_features).astype(np.float32)
malicious[:, :15] += 1.5
malicious[:, 15:30] -= 1.0
X = np.vstack([benign, malicious])
y = np.array([0] * n_half + [1] * n_half, dtype=np.float32)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
class MalwareClassifier(nn.Module):
"""Feedforward classifier for malware detection on tabular features."""
def __init__(self, input_dim=100, hidden_dims=None):
super().__init__()
if hidden_dims is None:
hidden_dims = [128, 64, 32]
layers = []
prev_dim = input_dim
for h in hidden_dims:
layers.extend([nn.Linear(prev_dim, h), nn.ReLU(), nn.Dropout(0.2)])
prev_dim = h
layers.append(nn.Linear(prev_dim, 1))
self.network = nn.Sequential(*layers)
def forward(self, x):
return self.network(x).squeeze(-1)
def train_model(model, X_train, y_train, epochs=20, lr=1e-3, batch_size=256):
"""Train the classifier."""
loader = DataLoader(TensorDataset(torch.FloatTensor(X_train),
torch.FloatTensor(y_train)), batch_size=batch_size, shuffle=True)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
criterion = nn.BCEWithLogitsLoss()
for epoch in range(epochs):
model.train()
epoch_loss = 0.0
for xb, yb in loader:
optimizer.zero_grad()
loss = criterion(model(xb), yb)
loss.backward()
optimizer.step()
epoch_loss += loss.item() * len(xb)
if (epoch + 1) % 5 == 0:
print(f'Epoch {epoch+1:3d}/{epochs} loss={epoch_loss / len(X_train):.4f}')
model = MalwareClassifier()
train_model(model, X_train, y_train)
model.eval()
with torch.no_grad():
preds = (torch.sigmoid(model(torch.FloatTensor(X_test))).numpy() > 0.5).astype(int)
print(classification_report(y_test, preds, target_names=['benign', 'malicious']))Representative output:
Epoch 5/20 loss=0.2741
Epoch 10/20 loss=0.1134
Epoch 15/20 loss=0.0682
Epoch 20/20 loss=0.0498
--- Test Set Performance ---
precision recall f1-score support
benign 0.96 0.95 0.95 2000
malicious 0.95 0.96 0.96 2000
accuracy 0.95 4000
macro avg 0.95 0.96 0.95 4000
weighted avg 0.95 0.96 0.95 4000The model achieves roughly 95% accuracy: high enough to be useful, low enough that perturbations can flip predictions.
White-box attacks: FGSM
The Fast Gradient Sign Method (FGSM) by Goodfellow et al. (2014) computes the gradient of the loss with respect to the input, taking the true label as the target, and steps along that gradient: x_adv = x + epsilon * sign(grad_x(loss)). The + sign matters: we are doing gradient ascent on the loss with respect to the true label, which by definition pushes the prediction away from the true class. epsilon controls perturbation magnitude.
def fgsm_attack(model, x, epsilon):
"""FGSM: perturb a malicious sample to evade detection."""
x_tensor = torch.FloatTensor(x.reshape(1, -1)).requires_grad_(True)
model.eval()
# Loss is computed against the TRUE label (malicious=1.0); we then ascend
# this loss to push the prediction away from the true class.
loss = nn.BCEWithLogitsLoss()(model(x_tensor), torch.FloatTensor([1.0]))
loss.backward()
x_adv = x_tensor.data + epsilon * x_tensor.grad.data.sign()
with torch.no_grad():
adv_prob = torch.sigmoid(model(x_adv)).item()
return x_adv.detach().numpy().flatten(), adv_prob < 0.5, adv_prob
def measure_evasion_rate(model, X_test, y_test, epsilon, attack_fn):
"""Fraction of correctly-classified malicious samples that the attack flips."""
model.eval()
with torch.no_grad():
probs = torch.sigmoid(model(torch.FloatTensor(X_test))).numpy()
indices = np.where((y_test == 1) & (probs > 0.5))[0]
if len(indices) == 0:
return 0.0
return sum(attack_fn(model, X_test[i], epsilon)[1] for i in indices) / len(indices)
# Demo on a single sample
malicious_indices = np.where(y_test == 1)[0]
sample = X_test[malicious_indices[0]]
for eps in [0.1, 0.2, 0.3, 0.5]:
_, evaded, adv_prob = fgsm_attack(model, sample, eps)
print(f' epsilon={eps:.1f} adv_prob={adv_prob:.4f} [{"EVADED" if evaded else "detected"}]')
# Sweep epsilon for evasion rates
epsilons = [0.05, 0.1, 0.2, 0.3, 0.5, 0.75, 1.0]
evasion_rates = []
for eps in epsilons:
rate = measure_evasion_rate(model, X_test, y_test, eps, fgsm_attack)
evasion_rates.append(rate)
print(f' epsilon={eps:.2f} evasion_rate={rate:.3f}')Representative output:
epsilon=0.1 adv_prob=0.8431 [detected]
epsilon=0.2 adv_prob=0.4218 [EVADED]
epsilon=0.3 adv_prob=0.1053 [EVADED]
epsilon=0.5 adv_prob=0.0089 [EVADED]
epsilon=0.05 evasion_rate=0.021
epsilon=0.10 evasion_rate=0.098
epsilon=0.20 evasion_rate=0.347
epsilon=0.30 evasion_rate=0.612
epsilon=0.50 evasion_rate=0.871
epsilon=0.75 evasion_rate=0.964
epsilon=1.00 evasion_rate=0.993At epsilon=0.3, FGSM flips the single sample. Across the test set, even modest perturbations (epsilon=0.2) evade about a third of detections. Visualize:
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 5))
plt.plot(epsilons, evasion_rates, 'o-', linewidth=2)
plt.xlabel('Epsilon (perturbation magnitude)')
plt.ylabel('Evasion rate')
plt.title('FGSM Evasion Rate vs. Perturbation Budget')
plt.ylim(0, 1.05)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('fgsm_evasion_rate.png', dpi=150)White-box attacks: PGD
FGSM takes a single large step. Projected Gradient Descent (PGD), introduced by Madry et al. in 2017, takes multiple smaller steps, projecting back onto the L-infinity ball around x after each one. This finds stronger adversarial examples within the same perturbation budget.
def pgd_attack(model, x, epsilon, alpha=None, num_steps=20):
"""PGD (iterative FGSM) with L-inf projection."""
if alpha is None:
alpha = epsilon / 4.0
x_orig = torch.FloatTensor(x.reshape(1, -1))
x_adv = x_orig.clone().requires_grad_(True)
target = torch.FloatTensor([1.0])
criterion = nn.BCEWithLogitsLoss()
model.eval()
for step in range(num_steps):
if x_adv.grad is not None:
x_adv.grad.zero_()
loss = criterion(model(x_adv), target)
loss.backward()
with torch.no_grad():
delta = torch.clamp(x_adv + alpha * x_adv.grad.sign() - x_orig,
-epsilon, epsilon)
x_adv = (x_orig + delta).clone().detach().requires_grad_(True)
with torch.no_grad():
adv_prob = torch.sigmoid(model(x_adv)).item()
return x_adv.detach().numpy().flatten(), adv_prob < 0.5, adv_prob
# Compare on the same sample
print(f'{"eps":>5} {"FGSM prob":>10} {"PGD prob":>10}')
for eps in [0.1, 0.2, 0.3, 0.5]:
print(f'{eps:>5.1f} {fgsm_attack(model, sample, eps)[2]:>10.4f} {pgd_attack(model, sample, eps)[2]:>10.4f}') eps FGSM prob PGD prob
0.1 0.8431 0.6102
0.2 0.4218 0.1547
0.3 0.1053 0.0203
0.5 0.0089 0.0008PGD consistently achieves lower adversarial probabilities. Now measure evasion rates across the full test set.
pgd_evasion_rates = []
print('PGD evasion rates:')
for eps in epsilons:
rate = measure_evasion_rate(model, X_test, y_test, eps, pgd_attack)
pgd_evasion_rates.append(rate)
print(f' epsilon={eps:.2f} evasion_rate={rate:.3f}')Representative comparison of both attacks:
| Epsilon | FGSM evasion rate | PGD evasion rate |
|---|---|---|
| 0.05 | 0.021 | 0.058 |
| 0.10 | 0.098 | 0.214 |
| 0.20 | 0.347 | 0.583 |
| 0.30 | 0.612 | 0.812 |
| 0.50 | 0.871 | 0.968 |
| 1.00 | 0.993 | 1.000 |
The gap is largest at moderate epsilon values (0.1 to 0.3), which is exactly the range that matters in practice. PGD is the stronger attack and should be used when evaluating adversarial robustness.
Feature-space constraints
The attacks above perturb feature values continuously, but real malware features include discrete counts, bounded values, and correlated fields. Raw gradient attacks may produce impossible feature vectors.
# Feature constraints: discrete features get rounded, bounded features get clipped
feature_constraints = {
'discrete_features': list(range(0, 10)), # counts (imports, sections, etc.)
'bounded_features': {10: (0.0, 8.0), 11: (0.0, 8.0), # entropy
12: (0.0, 1.0), 13: (0.0, 1.0)} # ratios
}
def apply_constraints(x_adv, constraints):
"""Round discrete features and clip bounded features to valid ranges."""
x = x_adv.copy()
for idx in constraints['discrete_features']:
x[idx] = np.round(x[idx])
for idx, (lo, hi) in constraints['bounded_features'].items():
x[idx] = np.clip(x[idx], lo, hi)
return x
def constrained_pgd_attack(model, x, epsilon, constraints, alpha=None, num_steps=20):
"""PGD attack with feature-space constraints applied after each step."""
if alpha is None:
alpha = epsilon / 4.0
x_orig = torch.FloatTensor(x.reshape(1, -1))
x_adv = x_orig.clone()
target = torch.FloatTensor([1.0])
criterion = nn.BCEWithLogitsLoss()
model.eval()
for step in range(num_steps):
x_adv = x_adv.clone().detach().requires_grad_(True)
loss = criterion(model(x_adv), target)
loss.backward()
with torch.no_grad():
delta = torch.clamp(x_adv + alpha * x_adv.grad.sign() - x_orig, -epsilon, epsilon)
x_np = apply_constraints((x_orig + delta).numpy().flatten(), constraints)
x_adv = torch.FloatTensor(x_np.reshape(1, -1))
with torch.no_grad():
adv_prob = torch.sigmoid(model(x_adv)).item()
return x_adv.detach().numpy().flatten(), adv_prob < 0.5, adv_prob
# Compare constrained vs unconstrained at epsilon=0.3
eps = 0.3
model.eval()
with torch.no_grad():
probs = torch.sigmoid(model(torch.FloatTensor(X_test))).numpy()
indices = np.where((y_test == 1) & (probs > 0.5))[0]
u_evaded = sum(pgd_attack(model, X_test[i], eps)[1] for i in indices)
c_evaded = sum(constrained_pgd_attack(model, X_test[i], eps, feature_constraints)[1] for i in indices)
print(f'PGD evasion rate (epsilon={eps}):')
print(f' Unconstrained: {u_evaded / len(indices):.3f}')
print(f' Constrained: {c_evaded / len(indices):.3f}')PGD evasion rate (epsilon=0.3):
Unconstrained: 0.812
Constrained: 0.674Constraint enforcement reduces the evasion rate. In a real malware classifier with more constrained features, the gap would be wider.
Warning
In real malware evasion, the attacker modifies the binary itself, not the feature vector. Feature-space attacks are a useful proxy but overestimate real-world evasion rates.
Defense: adversarial training
The most direct defense is adversarial training: generate adversarial examples during training and include them in each batch. Madry et al. showed that training against PGD produces models robust to a range of attacks, not just PGD itself.
def adversarial_training(model, X_train, y_train, epochs=30, lr=1e-3,
batch_size=256, epsilon=0.3, pgd_steps=10, adv_ratio=0.5):
"""Train with PGD adversarial examples mixed into each batch."""
pgd_alpha = epsilon / 4.0
loader = DataLoader(TensorDataset(torch.FloatTensor(X_train),
torch.FloatTensor(y_train)), batch_size=batch_size, shuffle=True)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
criterion = nn.BCEWithLogitsLoss()
for epoch in range(epochs):
model.train()
epoch_loss, n_samples = 0.0, 0
for xb, yb in loader:
n_adv = int((yb == 1).sum().item() * adv_ratio)
if n_adv > 0:
mal_idx = torch.where(yb == 1)[0][:n_adv]
x_mal, y_mal = xb[mal_idx].clone(), yb[mal_idx]
x_adv = x_mal.clone().requires_grad_(True)
for _ in range(pgd_steps):
if x_adv.grad is not None:
x_adv.grad.zero_()
criterion(model(x_adv), y_mal).backward()
with torch.no_grad():
delta = torch.clamp(x_adv + pgd_alpha * x_adv.grad.sign() - x_mal,
-epsilon, epsilon)
x_adv = (x_mal + delta).clone().detach().requires_grad_(True)
xb = torch.cat([xb, x_adv.detach()])
yb = torch.cat([yb, y_mal])
optimizer.zero_grad()
loss = criterion(model(xb), yb)
loss.backward()
optimizer.step()
epoch_loss += loss.item() * len(xb)
n_samples += len(xb)
if (epoch + 1) % 10 == 0:
print(f'Epoch {epoch+1:3d}/{epochs} loss={epoch_loss / n_samples:.4f}')
robust_model = MalwareClassifier()
adversarial_training(robust_model, X_train, y_train, epochs=30, epsilon=0.3, pgd_steps=7)Epoch 10/30 loss=0.3127
Epoch 20/30 loss=0.2451
Epoch 30/30 loss=0.2108Higher training loss is expected. Now evaluate both models.
def evaluate_model(model, X_test, y_test, label='Model'):
"""Evaluate clean accuracy, FGSM evasion, and PGD evasion at epsilon=0.3."""
model.eval()
with torch.no_grad():
probs = torch.sigmoid(model(torch.FloatTensor(X_test))).numpy()
clean_acc = ((probs > 0.5).astype(int) == y_test).mean()
indices = np.where((y_test == 1) & (probs > 0.5))[0]
n = max(len(indices), 1)
fgsm_r = sum(fgsm_attack(model, X_test[i], 0.3)[1] for i in indices) / n
pgd_r = sum(pgd_attack(model, X_test[i], 0.3)[1] for i in indices) / n
print(f'{label}: clean={clean_acc:.3f} FGSM={fgsm_r:.3f} PGD={pgd_r:.3f}')
return clean_acc, fgsm_r, pgd_r
evaluate_model(model, X_test, y_test, 'Standard')
evaluate_model(robust_model, X_test, y_test, 'Adversarial')
| Metric | Standard model | Adversarially trained |
|---|---|---|
| Clean accuracy | 0.955 | 0.928 |
| FGSM evasion (eps=0.3) | 0.612 | 0.187 |
| PGD evasion (eps=0.3) | 0.812 | 0.294 |
Adversarial training cuts FGSM evasion from 61% to 19% and PGD evasion from 81% to 29%, at the cost of a modest drop in clean accuracy. In a security context, this trade-off is almost always worthwhile.
Black-box attacks
In many real scenarios, the attacker can only query the model and observe the result. This section implements a black-box attack using random search: generate random perturbations, query the model, and keep the perturbation that produces the lowest malicious score.
def black_box_attack(model, x, epsilon, max_queries=1000, n_candidates=50):
"""Black-box evasion using random perturbations within an L-inf ball.
Queries the model to find a perturbation that flips the prediction.
Returns the adversarial example, success flag, probability, and query count.
"""
model.eval()
best_prob = 1.0
best_adv = x.copy()
queries_used = 0
with torch.no_grad():
orig_prob = torch.sigmoid(model(torch.FloatTensor(x.reshape(1, -1)))).item()
if orig_prob < 0.5:
return x, True, orig_prob, 1
for batch_start in range(0, max_queries, n_candidates):
n_batch = min(n_candidates, max_queries - batch_start)
perturbations = np.random.uniform(-epsilon, epsilon,
size=(n_batch, len(x))).astype(np.float32)
candidates = x.reshape(1, -1) + perturbations
with torch.no_grad():
probs = torch.sigmoid(model(torch.FloatTensor(candidates))).numpy()
queries_used += n_batch
min_idx = np.argmin(probs)
if probs[min_idx] < best_prob:
best_prob = probs[min_idx]
best_adv = candidates[min_idx]
if best_prob < 0.5:
return best_adv, True, best_prob, queries_used
return best_adv, best_prob < 0.5, best_prob, queries_used
# Evaluate black-box attack on a subset
n_eval = min(200, len(indices))
bb_successes = 0
bb_total_queries = 0
for idx in indices[:n_eval]:
_, success, _, queries = black_box_attack(model, X_test[idx], epsilon=0.5)
bb_successes += int(success)
bb_total_queries += queries
print(f'Black-box random search (epsilon=0.5):')
print(f' Evasion rate: {bb_successes / n_eval:.3f}')
print(f' Avg queries per sample: {bb_total_queries / n_eval:.0f}')
print(f' PGD evasion (same eps): ~0.968')Representative output:
Black-box random search (epsilon=0.5):
Evasion rate: 0.685
Avg queries per sample: 387
PGD evasion (same eps): ~0.968Black-box attacks are less efficient but still effective. More sophisticated methods (genetic algorithms, boundary attacks, transfer attacks) close the gap with PGD, but all require more queries. If the defender rate-limits API access or monitors for scanning behavior, black-box attacks become harder at scale.
Limitations
Feature-space vs. problem-space. Every attack here perturbs the feature vector. In a real deployment, the attacker must modify the binary itself, then re-extract features. Some perturbations are impossible in the problem space (you can’t have a negative import count). Feature-space attacks overestimate real-world evasion rates, but they measure a model’s worst-case robustness.
Adaptive attacks. Defending against one attack does not guarantee robustness against others. Carlini and Wagner’s 2017 paper evaluated 10 proposed defenses and broke all of them. Adversarial training on PGD helps broadly, but a sufficiently motivated attacker can design adaptive attacks targeting the specific defense.
Arms race dynamics. Every defense motivates a stronger attack. Adversarial training is the most principled defense available, but it increases the cost of evasion without eliminating it.
Ensemble defenses. Combining classifiers with different architectures (neural network + gradient-boosted trees + static rules) is more robust than any single model. The cost is implementation complexity and latency. In production security systems, ensembles are common precisely because no single model is adversary-proof.
Next steps
This tutorial showed that standard ML classifiers are vulnerable to gradient-based evasion, and that adversarial training substantially improves robustness at a modest cost to clean accuracy. The fundamental challenge remains: security classifiers face adversarial inputs by definition, and any defense is one step in an ongoing arms race.
The next tutorial addresses a different kind of evasion: attackers hiding their traffic inside encrypted channels. When the payload is encrypted, the classifier can only observe metadata (packet sizes, timing, flow statistics), which requires different feature engineering but faces the same adversarial considerations.