Adversarial Evasion of ML Security Classifiers

Over the course of this series, we’ve built classifiers for malware detection using neural network embeddings (Part 3), network intrusion detection with autoencoders (Part 4), phishing URL detection with fine-tuned transformers (Part 5), and DNS exfiltration detection with sequence models (Part 6). Each of those tutorials mentioned adversarial robustness as a limitation. This tutorial confronts that limitation directly: what happens when an attacker knows your model exists and actively tries to evade it?

The adversarial threat model

White-box vs. black-box attacks. In a white-box attack, the adversary has full access to the model (architecture, weights, gradients). In a black-box attack, the adversary can only query the model and observe outputs. Real-world scenarios fall between these extremes.

Attacker’s goal. Flip the prediction from malicious to benign (an evasion attack). The attacker has a malicious sample the classifier correctly flags and wants to modify it so the classifier misses it.

Perturbation constraints. The modified binary must still execute the same payload. Some features are easy to perturb (file size, section names) and others are impossible to change without breaking the malware.

Why security ML is uniquely vulnerable. In image classification, adversarial examples are a research curiosity. In security, the attacker is adversarial by definition. Adversarial robustness is a core requirement, not a bonus.

Setting up the environment

pip install torch numpy pandas scikit-learn matplotlib

Building the target classifier

We’ll generate a synthetic feature dataset rather than requiring the full EMBER dataset from Part 3.

Note

This synthetic dataset is a stand-in for the EMBER-trained model from Part 3. The adversarial techniques apply identically to the full model. The synthetic setup guarantees reproducibility without multi-gigabyte downloads.

import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

np.random.seed(42)
torch.manual_seed(42)

# 20K samples, 100 features. Malicious samples differ on features 0-29.
n_half = 10000
n_features = 100
benign = np.random.randn(n_half, n_features).astype(np.float32)
malicious = np.random.randn(n_half, n_features).astype(np.float32)
malicious[:, :15] += 1.5
malicious[:, 15:30] -= 1.0

X = np.vstack([benign, malicious])
y = np.array([0] * n_half + [1] * n_half, dtype=np.float32)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


class MalwareClassifier(nn.Module):
    """Feedforward classifier for malware detection on tabular features."""

    def __init__(self, input_dim=100, hidden_dims=None):
        super().__init__()
        if hidden_dims is None:
            hidden_dims = [128, 64, 32]
        layers = []
        prev_dim = input_dim
        for h in hidden_dims:
            layers.extend([nn.Linear(prev_dim, h), nn.ReLU(), nn.Dropout(0.2)])
            prev_dim = h
        layers.append(nn.Linear(prev_dim, 1))
        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x).squeeze(-1)


def train_model(model, X_train, y_train, epochs=20, lr=1e-3, batch_size=256):
    """Train the classifier."""
    loader = DataLoader(TensorDataset(torch.FloatTensor(X_train),
                        torch.FloatTensor(y_train)), batch_size=batch_size, shuffle=True)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.BCEWithLogitsLoss()
    for epoch in range(epochs):
        model.train()
        epoch_loss = 0.0
        for xb, yb in loader:
            optimizer.zero_grad()
            loss = criterion(model(xb), yb)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item() * len(xb)
        if (epoch + 1) % 5 == 0:
            print(f'Epoch {epoch+1:3d}/{epochs}  loss={epoch_loss / len(X_train):.4f}')

model = MalwareClassifier()
train_model(model, X_train, y_train)

model.eval()
with torch.no_grad():
    preds = (torch.sigmoid(model(torch.FloatTensor(X_test))).numpy() > 0.5).astype(int)
print(classification_report(y_test, preds, target_names=['benign', 'malicious']))

Representative output:

Epoch   5/20  loss=0.2741
Epoch  10/20  loss=0.1134
Epoch  15/20  loss=0.0682
Epoch  20/20  loss=0.0498

--- Test Set Performance ---
              precision    recall  f1-score   support

      benign       0.96      0.95      0.95      2000
   malicious       0.95      0.96      0.96      2000

    accuracy                           0.95      4000
   macro avg       0.95      0.96      0.95      4000
weighted avg       0.95      0.96      0.95      4000

The model achieves roughly 95% accuracy: high enough to be useful, low enough that perturbations can flip predictions.

White-box attacks: FGSM

The Fast Gradient Sign Method (FGSM) by Goodfellow et al. (2014) computes the gradient of the loss with respect to the input, taking the true label as the target, and steps along that gradient: x_adv = x + epsilon * sign(grad_x(loss)). The + sign matters: we are doing gradient ascent on the loss with respect to the true label, which by definition pushes the prediction away from the true class. epsilon controls perturbation magnitude.

def fgsm_attack(model, x, epsilon):
    """FGSM: perturb a malicious sample to evade detection."""
    x_tensor = torch.FloatTensor(x.reshape(1, -1)).requires_grad_(True)
    model.eval()
    # Loss is computed against the TRUE label (malicious=1.0); we then ascend
    # this loss to push the prediction away from the true class.
    loss = nn.BCEWithLogitsLoss()(model(x_tensor), torch.FloatTensor([1.0]))
    loss.backward()
    x_adv = x_tensor.data + epsilon * x_tensor.grad.data.sign()
    with torch.no_grad():
        adv_prob = torch.sigmoid(model(x_adv)).item()
    return x_adv.detach().numpy().flatten(), adv_prob < 0.5, adv_prob


def measure_evasion_rate(model, X_test, y_test, epsilon, attack_fn):
    """Fraction of correctly-classified malicious samples that the attack flips."""
    model.eval()
    with torch.no_grad():
        probs = torch.sigmoid(model(torch.FloatTensor(X_test))).numpy()
    indices = np.where((y_test == 1) & (probs > 0.5))[0]
    if len(indices) == 0:
        return 0.0
    return sum(attack_fn(model, X_test[i], epsilon)[1] for i in indices) / len(indices)


# Demo on a single sample
malicious_indices = np.where(y_test == 1)[0]
sample = X_test[malicious_indices[0]]
for eps in [0.1, 0.2, 0.3, 0.5]:
    _, evaded, adv_prob = fgsm_attack(model, sample, eps)
    print(f'  epsilon={eps:.1f}  adv_prob={adv_prob:.4f}  [{"EVADED" if evaded else "detected"}]')

# Sweep epsilon for evasion rates
epsilons = [0.05, 0.1, 0.2, 0.3, 0.5, 0.75, 1.0]
evasion_rates = []
for eps in epsilons:
    rate = measure_evasion_rate(model, X_test, y_test, eps, fgsm_attack)
    evasion_rates.append(rate)
    print(f'  epsilon={eps:.2f}  evasion_rate={rate:.3f}')

Representative output:

  epsilon=0.1  adv_prob=0.8431  [detected]
  epsilon=0.2  adv_prob=0.4218  [EVADED]
  epsilon=0.3  adv_prob=0.1053  [EVADED]
  epsilon=0.5  adv_prob=0.0089  [EVADED]
  epsilon=0.05  evasion_rate=0.021
  epsilon=0.10  evasion_rate=0.098
  epsilon=0.20  evasion_rate=0.347
  epsilon=0.30  evasion_rate=0.612
  epsilon=0.50  evasion_rate=0.871
  epsilon=0.75  evasion_rate=0.964
  epsilon=1.00  evasion_rate=0.993

At epsilon=0.3, FGSM flips the single sample. Across the test set, even modest perturbations (epsilon=0.2) evade about a third of detections. Visualize:

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
plt.plot(epsilons, evasion_rates, 'o-', linewidth=2)
plt.xlabel('Epsilon (perturbation magnitude)')
plt.ylabel('Evasion rate')
plt.title('FGSM Evasion Rate vs. Perturbation Budget')
plt.ylim(0, 1.05)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('fgsm_evasion_rate.png', dpi=150)

White-box attacks: PGD

FGSM takes a single large step. Projected Gradient Descent (PGD), introduced by Madry et al. in 2017, takes multiple smaller steps, projecting back onto the L-infinity ball around x after each one. This finds stronger adversarial examples within the same perturbation budget.

def pgd_attack(model, x, epsilon, alpha=None, num_steps=20):
    """PGD (iterative FGSM) with L-inf projection."""
    if alpha is None:
        alpha = epsilon / 4.0
    x_orig = torch.FloatTensor(x.reshape(1, -1))
    x_adv = x_orig.clone().requires_grad_(True)
    target = torch.FloatTensor([1.0])
    criterion = nn.BCEWithLogitsLoss()
    model.eval()
    for step in range(num_steps):
        if x_adv.grad is not None:
            x_adv.grad.zero_()
        loss = criterion(model(x_adv), target)
        loss.backward()
        with torch.no_grad():
            delta = torch.clamp(x_adv + alpha * x_adv.grad.sign() - x_orig,
                                -epsilon, epsilon)
            x_adv = (x_orig + delta).clone().detach().requires_grad_(True)
    with torch.no_grad():
        adv_prob = torch.sigmoid(model(x_adv)).item()
    return x_adv.detach().numpy().flatten(), adv_prob < 0.5, adv_prob


# Compare on the same sample
print(f'{"eps":>5}  {"FGSM prob":>10}  {"PGD prob":>10}')
for eps in [0.1, 0.2, 0.3, 0.5]:
    print(f'{eps:>5.1f}  {fgsm_attack(model, sample, eps)[2]:>10.4f}  {pgd_attack(model, sample, eps)[2]:>10.4f}')

  eps   FGSM prob   PGD prob
  0.1      0.8431     0.6102
  0.2      0.4218     0.1547
  0.3      0.1053     0.0203
  0.5      0.0089     0.0008

PGD consistently achieves lower adversarial probabilities. Now measure evasion rates across the full test set.

pgd_evasion_rates = []
print('PGD evasion rates:')
for eps in epsilons:
    rate = measure_evasion_rate(model, X_test, y_test, eps, pgd_attack)
    pgd_evasion_rates.append(rate)
    print(f'  epsilon={eps:.2f}  evasion_rate={rate:.3f}')

Representative comparison of both attacks:

Epsilon	FGSM evasion rate	PGD evasion rate
0.05	0.021	0.058
0.10	0.098	0.214
0.20	0.347	0.583
0.30	0.612	0.812
0.50	0.871	0.968
1.00	0.993	1.000

The gap is largest at moderate epsilon values (0.1 to 0.3), which is exactly the range that matters in practice. PGD is the stronger attack and should be used when evaluating adversarial robustness.

Feature-space constraints

The attacks above perturb feature values continuously, but real malware features include discrete counts, bounded values, and correlated fields. Raw gradient attacks may produce impossible feature vectors.

# Feature constraints: discrete features get rounded, bounded features get clipped
feature_constraints = {
    'discrete_features': list(range(0, 10)),  # counts (imports, sections, etc.)
    'bounded_features': {10: (0.0, 8.0), 11: (0.0, 8.0),  # entropy
                         12: (0.0, 1.0), 13: (0.0, 1.0)}   # ratios
}

def apply_constraints(x_adv, constraints):
    """Round discrete features and clip bounded features to valid ranges."""
    x = x_adv.copy()
    for idx in constraints['discrete_features']:
        x[idx] = np.round(x[idx])
    for idx, (lo, hi) in constraints['bounded_features'].items():
        x[idx] = np.clip(x[idx], lo, hi)
    return x

def constrained_pgd_attack(model, x, epsilon, constraints, alpha=None, num_steps=20):
    """PGD attack with feature-space constraints applied after each step."""
    if alpha is None:
        alpha = epsilon / 4.0
    x_orig = torch.FloatTensor(x.reshape(1, -1))
    x_adv = x_orig.clone()
    target = torch.FloatTensor([1.0])
    criterion = nn.BCEWithLogitsLoss()
    model.eval()
    for step in range(num_steps):
        x_adv = x_adv.clone().detach().requires_grad_(True)
        loss = criterion(model(x_adv), target)
        loss.backward()
        with torch.no_grad():
            delta = torch.clamp(x_adv + alpha * x_adv.grad.sign() - x_orig, -epsilon, epsilon)
            x_np = apply_constraints((x_orig + delta).numpy().flatten(), constraints)
            x_adv = torch.FloatTensor(x_np.reshape(1, -1))
    with torch.no_grad():
        adv_prob = torch.sigmoid(model(x_adv)).item()
    return x_adv.detach().numpy().flatten(), adv_prob < 0.5, adv_prob

# Compare constrained vs unconstrained at epsilon=0.3
eps = 0.3
model.eval()
with torch.no_grad():
    probs = torch.sigmoid(model(torch.FloatTensor(X_test))).numpy()
indices = np.where((y_test == 1) & (probs > 0.5))[0]

u_evaded = sum(pgd_attack(model, X_test[i], eps)[1] for i in indices)
c_evaded = sum(constrained_pgd_attack(model, X_test[i], eps, feature_constraints)[1] for i in indices)

print(f'PGD evasion rate (epsilon={eps}):')
print(f'  Unconstrained: {u_evaded / len(indices):.3f}')
print(f'  Constrained:   {c_evaded / len(indices):.3f}')

PGD evasion rate (epsilon=0.3):
  Unconstrained: 0.812
  Constrained:   0.674

Constraint enforcement reduces the evasion rate. In a real malware classifier with more constrained features, the gap would be wider.

Warning

In real malware evasion, the attacker modifies the binary itself, not the feature vector. Feature-space attacks are a useful proxy but overestimate real-world evasion rates.

Defense: adversarial training

The most direct defense is adversarial training: generate adversarial examples during training and include them in each batch. Madry et al. showed that training against PGD produces models robust to a range of attacks, not just PGD itself.

def adversarial_training(model, X_train, y_train, epochs=30, lr=1e-3,
                         batch_size=256, epsilon=0.3, pgd_steps=10, adv_ratio=0.5):
    """Train with PGD adversarial examples mixed into each batch."""
    pgd_alpha = epsilon / 4.0
    loader = DataLoader(TensorDataset(torch.FloatTensor(X_train),
                        torch.FloatTensor(y_train)), batch_size=batch_size, shuffle=True)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.BCEWithLogitsLoss()
    for epoch in range(epochs):
        model.train()
        epoch_loss, n_samples = 0.0, 0
        for xb, yb in loader:
            n_adv = int((yb == 1).sum().item() * adv_ratio)
            if n_adv > 0:
                mal_idx = torch.where(yb == 1)[0][:n_adv]
                x_mal, y_mal = xb[mal_idx].clone(), yb[mal_idx]
                x_adv = x_mal.clone().requires_grad_(True)
                for _ in range(pgd_steps):
                    if x_adv.grad is not None:
                        x_adv.grad.zero_()
                    criterion(model(x_adv), y_mal).backward()
                    with torch.no_grad():
                        delta = torch.clamp(x_adv + pgd_alpha * x_adv.grad.sign() - x_mal,
                                            -epsilon, epsilon)
                        x_adv = (x_mal + delta).clone().detach().requires_grad_(True)
                xb = torch.cat([xb, x_adv.detach()])
                yb = torch.cat([yb, y_mal])
            optimizer.zero_grad()
            loss = criterion(model(xb), yb)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item() * len(xb)
            n_samples += len(xb)
        if (epoch + 1) % 10 == 0:
            print(f'Epoch {epoch+1:3d}/{epochs}  loss={epoch_loss / n_samples:.4f}')

robust_model = MalwareClassifier()
adversarial_training(robust_model, X_train, y_train, epochs=30, epsilon=0.3, pgd_steps=7)

Epoch  10/30  loss=0.3127
Epoch  20/30  loss=0.2451
Epoch  30/30  loss=0.2108

Higher training loss is expected. Now evaluate both models.

def evaluate_model(model, X_test, y_test, label='Model'):
    """Evaluate clean accuracy, FGSM evasion, and PGD evasion at epsilon=0.3."""
    model.eval()
    with torch.no_grad():
        probs = torch.sigmoid(model(torch.FloatTensor(X_test))).numpy()
    clean_acc = ((probs > 0.5).astype(int) == y_test).mean()
    indices = np.where((y_test == 1) & (probs > 0.5))[0]
    n = max(len(indices), 1)
    fgsm_r = sum(fgsm_attack(model, X_test[i], 0.3)[1] for i in indices) / n
    pgd_r = sum(pgd_attack(model, X_test[i], 0.3)[1] for i in indices) / n
    print(f'{label}: clean={clean_acc:.3f}  FGSM={fgsm_r:.3f}  PGD={pgd_r:.3f}')
    return clean_acc, fgsm_r, pgd_r

evaluate_model(model, X_test, y_test, 'Standard')
evaluate_model(robust_model, X_test, y_test, 'Adversarial')

Metric	Standard model	Adversarially trained
Clean accuracy	0.955	0.928
FGSM evasion (eps=0.3)	0.612	0.187
PGD evasion (eps=0.3)	0.812	0.294

Adversarial training cuts FGSM evasion from 61% to 19% and PGD evasion from 81% to 29%, at the cost of a modest drop in clean accuracy. In a security context, this trade-off is almost always worthwhile.

Black-box attacks

In many real scenarios, the attacker can only query the model and observe the result. This section implements a black-box attack using random search: generate random perturbations, query the model, and keep the perturbation that produces the lowest malicious score.

def black_box_attack(model, x, epsilon, max_queries=1000, n_candidates=50):
    """Black-box evasion using random perturbations within an L-inf ball.

    Queries the model to find a perturbation that flips the prediction.
    Returns the adversarial example, success flag, probability, and query count.
    """
    model.eval()
    best_prob = 1.0
    best_adv = x.copy()
    queries_used = 0
    with torch.no_grad():
        orig_prob = torch.sigmoid(model(torch.FloatTensor(x.reshape(1, -1)))).item()
    if orig_prob < 0.5:
        return x, True, orig_prob, 1

    for batch_start in range(0, max_queries, n_candidates):
        n_batch = min(n_candidates, max_queries - batch_start)
        perturbations = np.random.uniform(-epsilon, epsilon,
                                          size=(n_batch, len(x))).astype(np.float32)
        candidates = x.reshape(1, -1) + perturbations
        with torch.no_grad():
            probs = torch.sigmoid(model(torch.FloatTensor(candidates))).numpy()
        queries_used += n_batch
        min_idx = np.argmin(probs)
        if probs[min_idx] < best_prob:
            best_prob = probs[min_idx]
            best_adv = candidates[min_idx]
        if best_prob < 0.5:
            return best_adv, True, best_prob, queries_used

    return best_adv, best_prob < 0.5, best_prob, queries_used


# Evaluate black-box attack on a subset
n_eval = min(200, len(indices))
bb_successes = 0
bb_total_queries = 0
for idx in indices[:n_eval]:
    _, success, _, queries = black_box_attack(model, X_test[idx], epsilon=0.5)
    bb_successes += int(success)
    bb_total_queries += queries

print(f'Black-box random search (epsilon=0.5):')
print(f'  Evasion rate:           {bb_successes / n_eval:.3f}')
print(f'  Avg queries per sample: {bb_total_queries / n_eval:.0f}')
print(f'  PGD evasion (same eps): ~0.968')

Representative output:

Black-box random search (epsilon=0.5):
  Evasion rate:           0.685
  Avg queries per sample: 387
  PGD evasion (same eps): ~0.968

Black-box attacks are less efficient but still effective. More sophisticated methods (genetic algorithms, boundary attacks, transfer attacks) close the gap with PGD, but all require more queries. If the defender rate-limits API access or monitors for scanning behavior, black-box attacks become harder at scale.

Limitations

Feature-space vs. problem-space. Every attack here perturbs the feature vector. In a real deployment, the attacker must modify the binary itself, then re-extract features. Some perturbations are impossible in the problem space (you can’t have a negative import count). Feature-space attacks overestimate real-world evasion rates, but they measure a model’s worst-case robustness.

Adaptive attacks. Defending against one attack does not guarantee robustness against others. Carlini and Wagner’s 2017 paper evaluated 10 proposed defenses and broke all of them. Adversarial training on PGD helps broadly, but a sufficiently motivated attacker can design adaptive attacks targeting the specific defense.

Arms race dynamics. Every defense motivates a stronger attack. Adversarial training is the most principled defense available, but it increases the cost of evasion without eliminating it.

Ensemble defenses. Combining classifiers with different architectures (neural network + gradient-boosted trees + static rules) is more robust than any single model. The cost is implementation complexity and latency. In production security systems, ensembles are common precisely because no single model is adversary-proof.

Next steps

This tutorial showed that standard ML classifiers are vulnerable to gradient-based evasion, and that adversarial training substantially improves robustness at a modest cost to clean accuracy. The fundamental challenge remains: security classifiers face adversarial inputs by definition, and any defense is one step in an ongoing arms race.

The next tutorial addresses a different kind of evasion: attackers hiding their traffic inside encrypted channels. When the payload is encrypted, the classifier can only observe metadata (packet sizes, timing, flow statistics), which requires different feature engineering but faces the same adversarial considerations.