Malware Classification with Neural Network Embeddings

The first three tutorials in this series applied ML to specific security tasks: a RAG pipeline for advisories, anomaly detection on audit logs, and a ROP gadget classifier. Each used a different technique (embeddings + retrieval, Isolation Forest, XGBoost) matched to the problem. This tutorial adds a new tool to the stack: neural networks, specifically, a feedforward classifier trained on static features extracted from PE (Portable Executable) files.

The target task is malware classification: given a binary file, predict whether it’s malicious or benign without executing it. This is a core problem in endpoint security, and it’s one where ML has genuine production value: static feature classifiers can flag suspicious files in milliseconds, before any sandbox execution.

We’ll use the EMBER dataset (Endgame Malware BEnchmark for Research), a public dataset of features extracted from about 1 million PE files. EMBER includes unlabeled rows as well as labeled malicious and benign samples; after filtering unlabeled entries, the commonly used EMBER 2018 split yields about 600,000 labeled training samples and 200,000 labeled test samples. EMBER provides pre-extracted features, so you don’t need a malware collection to train the model.

Note

This tutorial intentionally uses a feedforward network (not a CNN or transformer) on tabular features. The goal is to understand when neural networks are the right tool for security classification, and when simpler methods like XGBoost (from tutorial 3) are sufficient.

Why neural networks for malware classification?

Short answer: they’re not always better. For tabular data, gradient-boosted trees (XGBoost, LightGBM) often match or beat neural networks with less tuning. A feedforward network like the one in this tutorial is mainly interesting when you want a learned intermediate representation in addition to a classifier.

Characteristic	Tree-based (XGBoost)	Neural network
Tabular features	Excellent	Good (needs tuning)
Feature interactions	Automatic (tree splits)	Learned (hidden layers)
High-dimensional sparse features	Strong default	Can work well, especially if you later reuse embeddings
Transfer learning	Not possible	Possible (fine-tuning)
Online updates	Limited	Natural with SGD-style training
Training speed	Fast	Slower
Interpretability	Feature importance	Harder (SHAP, gradients)

Neural networks shine when you want embeddings (vector representations that capture similarity) or when you plan to transfer the model to related tasks. A malware classifier’s hidden layer activations can encode a useful representation of binary structure; you can reuse that embedding for clustering, similarity search, or downstream fine-tuning. If you only need the strongest tabular baseline on EMBER, start with LightGBM.

Setting up the environment

python -m venv venv && source venv/bin/activate
pip install torch numpy pandas scikit-learn matplotlib lightgbm joblib
pip install "lief==0.9.0"
pip install git+https://github.com/elastic/ember.git

The ember package provides the feature extraction and dataset utilities. lief is the binary parsing library used by EMBER’s feature extractor. EMBER version 2 features were originally generated with LIEF 0.9.0, so pinning that version keeps feature extraction behavior consistent with the published benchmark.

Note

PyTorch does not require a GPU for this tutorial. On a modern desktop CPU, a reduced run on a few hundred thousand samples is reasonable; a full labeled EMBER 2018 run is much heavier and benefits from more RAM and/or a GPU. The code below avoids the worst memory traps by using a validation split, batching evaluation, and storing normalized arrays as float32.

Downloading EMBER

mkdir -p malware-classifier/data
mkdir -p malware-classifier/models
cd malware-classifier

# Download the EMBER 2018 archive from the official EMBER README:
# https://github.com/elastic/ember
curl -L https://ember.elastic.co/ember_dataset_2018_2.tar.bz2 -o data/ember2018.tar.bz2
tar -xjf data/ember2018.tar.bz2 -C data

# Vectorize the extracted JSON feature files into the binary arrays used below
python -c "import ember; ember.create_vectorized_features('data/ember2018/')"

The full archive is a multi-gigabyte download covering about 1 million PE files. After filtering unlabeled rows, the labeled subset used below is about 600K train and 200K test. Each sample has 2,381 numerical features.

Understanding the EMBER features

EMBER extracts eight feature groups from each PE file’s static properties (no execution needed):

EMBER feature groups (2,381 total features):

  ┌─────────────────────┬───────┬─────────────────────────────────┐
  │ Feature Group       │ Dims  │ What it captures                │
  ├─────────────────────┼───────┼─────────────────────────────────┤
  │ Byte histogram      │ 256   │ Distribution of byte values     │
  │ Byte-entropy hist   │ 256   │ Entropy of byte windows         │
  │ String info         │ 104   │ Extracted string statistics     │
  │ General file info   │ 10    │ File size, virtual size, etc.   │
  │ PE header info      │ 62    │ Timestamp, subsystem, DLL flags │
  │ Section info        │ 255   │ Per-section size, entropy, perms│
  │ Import info         │ 1,280 │ Imported libraries and functions│
  │ Export info         │ 128   │ Exported function features      │
  │ Data directories    │ 30    │ Resource, debug, TLS info       │
  └─────────────────────┴───────┴─────────────────────────────────┘

The import info is the largest and most discriminative group. Malware tends to import specific Windows API functions (e.g., VirtualAlloc, CreateRemoteThread, WriteProcessMemory) that are uncommon in benign software. The byte-entropy histogram captures whether sections are packed or encrypted (high entropy).

Loading the data

import ember
import numpy as np

# Load vectorized features
X_train, y_train, X_test, y_test = ember.read_vectorized_features('data/ember2018')

# Filter out unlabeled samples (y == -1 in EMBER means "unlabeled")
train_mask = y_train != -1
test_mask = y_test != -1

X_train = X_train[train_mask]
y_train = y_train[train_mask]
X_test = X_test[test_mask]
y_test = y_test[test_mask]

print(f'Training: {len(X_train)} samples ({sum(y_train == 1)} malicious, {sum(y_train == 0)} benign)')
print(f'Test:     {len(X_test)} samples ({sum(y_test == 1)} malicious, {sum(y_test == 0)} benign)')
print(f'Features: {X_train.shape[1]}')

Training: 600000 samples (300000 malicious, 300000 benign)
Test:     200000 samples (100000 malicious, 100000 benign)
Features: 2381

Practical resource note

The full labeled EMBER split is large enough to stress commodity laptops. The raw feature matrix is manageable on disk, but naive preprocessing can expand it into tens of gigabytes of RAM once you account for float64 scaling buffers, copies, torch tensors, and the test set. If you have 16 GB of RAM, start with a capped run such as 200K training rows, 50K validation rows, and 50K test rows, then scale up only after the pipeline works end to end.

Feature normalization

Neural networks are sensitive to feature scale. Normalize features to zero mean and unit variance, then cast the arrays to float32 so they match PyTorch’s default tensor dtype and use half the memory of float64:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train_raw, X_val_raw, y_train_raw, y_val_raw = train_test_split(
    X_train,
    y_train,
    test_size=0.1,
    random_state=42,
    stratify=y_train,
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_raw).astype(np.float32, copy=False)
X_val_scaled = scaler.transform(X_val_raw).astype(np.float32, copy=False)
X_test_scaled = scaler.transform(X_test).astype(np.float32, copy=False)

# Replace NaN/inf values that may result from constant features
X_train_scaled = np.nan_to_num(X_train_scaled, nan=0.0, posinf=0.0, neginf=0.0)
X_val_scaled = np.nan_to_num(X_val_scaled, nan=0.0, posinf=0.0, neginf=0.0)
X_test_scaled = np.nan_to_num(X_test_scaled, nan=0.0, posinf=0.0, neginf=0.0)

Tip

XGBoost doesn’t need feature scaling, tree splits are scale-invariant. Neural networks do, because gradient descent is sensitive to the magnitude of input features. Unnormalized inputs lead to exploding gradients and poor convergence.

Building the neural network

Create malware-classifier/model.py.

import torch
import torch.nn as nn

class MalwareClassifier(nn.Module):
    """Feedforward neural network for PE malware classification.

    Architecture:
      Input (2381) → FC(1024) → ReLU → BN → Dropout → FC(512) → ReLU → BN → Dropout
      → FC(256) → ReLU → BN → Dropout → FC(128) → ReLU → FC(1)
    """

    def __init__(self, input_dim=2381, dropout=0.3):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, 1024),
            nn.ReLU(),
            nn.BatchNorm1d(1024),
            nn.Dropout(dropout),

            nn.Linear(1024, 512),
            nn.ReLU(),
            nn.BatchNorm1d(512),
            nn.Dropout(dropout),

            nn.Linear(512, 256),
            nn.ReLU(),
            nn.BatchNorm1d(256),
            nn.Dropout(dropout),

            nn.Linear(256, 128),
            nn.ReLU(),

            nn.Linear(128, 1),
        )

    def forward(self, x):
        return self.network(x)

    def get_embedding(self, x):
        """Extract the 128-dim embedding from the penultimate layer."""
        # Run through all layers except the last Linear(128, 1)
        for layer in list(self.network.children())[:-1]:
            x = layer(x)
        return x

Architecture choices

Depth: Four hidden layers (1024 → 512 → 256 → 128). Deeper than needed for simple tabular data, but the progressive dimensionality reduction creates a useful embedding bottleneck at the 128-dim layer.

BatchNorm: Normalizes activations within each mini-batch, stabilizing training. Particularly useful here because the 2,381 input features have heterogeneous distributions.

Dropout (0.3): Randomly zeros 30% of activations during training, preventing overfitting. With 600K training samples, overfitting is moderate, but the model has ~3.1M parameters, dropout provides regularization.

No Sigmoid in forward(): We use BCEWithLogitsLoss, which applies the sigmoid internally for numerical stability.

Training

Create malware-classifier/train.py.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
import joblib
from pathlib import Path
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from model import MalwareClassifier

def evaluate(model, X, y, batch_size, device):
    loader = DataLoader(
        TensorDataset(torch.from_numpy(X), torch.from_numpy(y.astype(np.float32)).unsqueeze(1)),
        batch_size=batch_size,
        shuffle=False,
    )

    logits_list = []
    total_loss = 0.0
    criterion = nn.BCEWithLogitsLoss()

    model.eval()
    with torch.no_grad():
        for X_batch, y_batch in loader:
            X_batch = X_batch.to(device)
            y_batch = y_batch.to(device)
            logits = model(X_batch)
            loss = criterion(logits, y_batch)
            total_loss += loss.item() * len(X_batch)
            logits_list.append(logits.cpu())

    logits = torch.cat(logits_list).numpy()
    probs = 1 / (1 + np.exp(-logits))
    preds = (probs > 0.5).astype(int)
    avg_loss = total_loss / len(X)
    auc = roc_auc_score(y, probs)
    return avg_loss, auc, preds, probs

def train(X_train, y_train, X_val, y_val, epochs=20, batch_size=4096, lr=1e-3):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f'Using device: {device}')

    # Convert to tensors
    X_train_t = torch.from_numpy(X_train)
    y_train_t = torch.from_numpy(y_train.astype(np.float32)).unsqueeze(1)

    train_loader = DataLoader(
        TensorDataset(X_train_t, y_train_t),
        batch_size=batch_size,
        shuffle=True,
    )

    model = MalwareClassifier(input_dim=X_train.shape[1]).to(device)
    criterion = nn.BCEWithLogitsLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=3)

    for epoch in range(epochs):
        model.train()
        total_loss = 0
        for X_batch, y_batch in train_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)

            optimizer.zero_grad()
            output = model(X_batch)
            loss = criterion(output, y_batch)
            loss.backward()
            optimizer.step()
            total_loss += loss.item() * len(X_batch)

        train_loss = total_loss / len(X_train_t)
        val_loss, val_auc, _, _ = evaluate(model, X_val, y_val, batch_size, device)
        scheduler.step(val_loss)
        current_lr = optimizer.param_groups[0]['lr']

        print(
            f'Epoch {epoch+1:2d}/{epochs}  '
            f'train_loss={train_loss:.4f}  val_loss={val_loss:.4f}  '
            f'val_AUC={val_auc:.4f}  lr={current_lr:.1e}'
        )

    return model

if __name__ == '__main__':
    import ember

    MAX_TRAIN = 200_000   # raise toward 600_000 on machines with more RAM
    MAX_TEST = 50_000     # raise toward 200_000 for the full benchmark

    X_train, y_train, X_test, y_test = ember.read_vectorized_features('data/ember2018')

    mask_tr = y_train != -1
    mask_te = y_test != -1
    X_train, y_train = X_train[mask_tr], y_train[mask_tr]
    X_test, y_test = X_test[mask_te], y_test[mask_te]

    if MAX_TRAIN < len(X_train):
        X_train, _, y_train, _ = train_test_split(
            X_train,
            y_train,
            train_size=MAX_TRAIN,
            random_state=42,
            stratify=y_train,
        )

    if MAX_TEST < len(X_test):
        X_test_eval_raw, _, y_test_eval, _ = train_test_split(
            X_test,
            y_test,
            train_size=MAX_TEST,
            random_state=42,
            stratify=y_test,
        )
    else:
        X_test_eval_raw, y_test_eval = X_test, y_test

    X_train_raw, X_val_raw, y_train_raw, y_val_raw = train_test_split(
        X_train,
        y_train,
        test_size=0.1,
        random_state=42,
        stratify=y_train,
    )

    scaler = StandardScaler()
    X_train_scaled = np.nan_to_num(
        scaler.fit_transform(X_train_raw).astype(np.float32, copy=False)
    )
    X_val_scaled = np.nan_to_num(
        scaler.transform(X_val_raw).astype(np.float32, copy=False)
    )
    X_test_scaled = np.nan_to_num(
        scaler.transform(X_test_eval_raw).astype(np.float32, copy=False)
    )

    model = train(X_train_scaled, y_train_raw, X_val_scaled, y_val_raw)

    test_loss, test_auc, test_preds, _ = evaluate(
        model, X_test_scaled, y_test_eval, batch_size=4096, device=next(model.parameters()).device
    )

    print('\n--- Held-out Test Set Performance ---')
    print(classification_report(y_test_eval, test_preds, target_names=['benign', 'malicious']))
    print(f'Test loss: {test_loss:.4f}')
    print(f'Test ROC AUC: {test_auc:.4f}')

    Path('models').mkdir(exist_ok=True)
    torch.save(model.state_dict(), 'models/malware_classifier.pt')

    joblib.dump(scaler, 'models/scaler.pkl')
    print('Model and scaler saved.')

python train.py

Using device: cpu
Epoch  1/20  train_loss=0.3271  val_loss=0.2329  val_AUC=0.9684  lr=1.0e-03
Epoch  2/20  train_loss=0.2108  val_loss=0.1681  val_AUC=0.9821  lr=1.0e-03
...
Epoch 15/20  train_loss=0.0710  val_loss=0.0594  val_AUC=0.9976  lr=1.0e-04
...
Epoch 20/20  train_loss=0.0578  val_loss=0.0511  val_AUC=0.9981  lr=1.0e-05

--- Held-out Test Set Performance ---
              precision    recall  f1-score   support

      benign       0.99      0.99      0.99     25000
   malicious       0.99      0.99      0.99     25000

    accuracy                           0.99     50000
   macro avg       0.99      0.99      0.99     50000

Test loss: 0.0507
Test ROC AUC: 0.9980

The important methodological detail is that the test set stays untouched until the end. During development, you should watch validation metrics, not test metrics.

Comparing with LightGBM

To be rigorous, train a tree-based baseline on the same train/validation split and compare on the same held-out test set.

import lightgbm as lgb  # EMBER's default benchmark uses LightGBM
from sklearn.metrics import classification_report, roc_auc_score

# LightGBM (EMBER's published baseline). Tree models use the unscaled features.
lgb_model = lgb.LGBMClassifier(
    n_estimators=1000,
    max_depth=10,
    learning_rate=0.05,
    num_leaves=128,
    n_jobs=-1,
)
lgb_model.fit(X_train_raw, y_train_raw)
lgb_probs = lgb_model.predict_proba(X_test_eval_raw)[:, 1]
lgb_preds = (lgb_probs > 0.5).astype(int)

lgb_auc = roc_auc_score(y_test_eval, lgb_probs)
print(f'LightGBM AUC: {lgb_auc:.4f}')
print(classification_report(y_test_eval, lgb_preds, target_names=['benign', 'malicious']))

If you use the reduced-memory capped run from the training script, apply the same caps here so both models see the same rows.

Typical results on the full labeled EMBER benchmark:

Model	AUC	F1 (malicious)	Training time
LightGBM	0.9991	0.993	~3 min
Neural network	0.9987	0.990	~10 min

The tree-based model slightly outperforms the neural network on this task, consistent with the general finding that gradient-boosted trees are hard to beat on tabular data. The neural network’s advantage is elsewhere: embeddings and transfer learning. XGBoost usually lands in the same range, but LightGBM is the cleaner runnable baseline shown here.

Extracting and using embeddings

The 128-dimensional embedding from the penultimate layer encodes a learned representation of each binary. Samples with similar static structure often land near each other in this space, which makes the embedding useful for similarity search and exploratory clustering.

import torch
import numpy as np
from model import MalwareClassifier

def extract_embeddings(model, X, batch_size=4096):
    """Extract 128-dim embeddings from the penultimate layer."""
    model.eval()
    device = next(model.parameters()).device
    embeddings = []

    for i in range(0, len(X), batch_size):
        batch = torch.FloatTensor(X[i:i+batch_size]).to(device)
        with torch.no_grad():
            emb = model.get_embedding(batch)
        embeddings.append(emb.cpu().numpy())

    return np.vstack(embeddings)

# Load trained model
model = MalwareClassifier()
model.load_state_dict(torch.load('models/malware_classifier.pt', map_location='cpu'))

# Extract embeddings for the test set
embeddings = extract_embeddings(model, X_test_scaled)
# Use y_test_eval if you followed the capped evaluation split from train.py;
# otherwise replace it with the labels matching the matrix you embedded.
eval_labels = y_test_eval
print(f'Embedding shape: {embeddings.shape}')  # (num_eval_samples, 128)

Similarity search

Find binaries similar to a known malware sample:

from sklearn.metrics.pairwise import cosine_similarity

def find_similar(query_embedding, all_embeddings, top_k=10):
    """Find the top-k most similar samples by cosine similarity."""
    sims = cosine_similarity(query_embedding.reshape(1, -1), all_embeddings)[0]
    top_indices = np.argsort(sims)[::-1][:top_k]
    return [(idx, sims[idx]) for idx in top_indices]

# Find samples similar to test sample 0
similar = find_similar(embeddings[0], embeddings)
for idx, sim in similar:
    label = 'malicious' if eval_labels[idx] == 1 else 'benign'
    print(f'  Sample {idx}: similarity={sim:.4f} ({label})')

Clustering suspicious samples

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

# Cluster malware embeddings
malware_mask = eval_labels == 1
malware_embeddings = embeddings[malware_mask]

kmeans = KMeans(n_clusters=20, random_state=42, n_init=10)
clusters = kmeans.fit_predict(malware_embeddings)

# Visualize with t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=50)
coords = tsne.fit_transform(malware_embeddings[:5000])  # subsample for speed

plt.figure(figsize=(12, 8))
scatter = plt.scatter(coords[:, 0], coords[:, 1], c=clusters[:5000], cmap='tab20', s=1, alpha=0.5)
plt.title('Malware Embedding Clusters (t-SNE)')
plt.colorbar(scatter, label='Cluster')
plt.savefig('malware_clusters.png', dpi=150)
print('Saved malware_clusters.png')

These clusters are unsupervised groupings of similar embeddings, not ground-truth malware families. In practice, they can surface structural patterns such as shared packers or repeated feature profiles, but you need external family labels or analyst validation before calling any cluster a family.

Classifying new binaries

To classify a binary that isn’t in the EMBER dataset, extract features from it using EMBER’s feature extractor:

import ember
import joblib
import numpy as np
import torch
from model import MalwareClassifier

def classify_binary(file_path, model, scaler):
    """Classify a PE file as malicious or benign."""
    # Read the raw bytes
    with open(file_path, 'rb') as f:
        bytez = f.read()

    # Extract EMBER features
    features = np.array(ember.features.PEFeatureExtractor(2).feature_vector(bytez))
    features = features.reshape(1, -1)

    # Scale
    features_scaled = np.nan_to_num(scaler.transform(features))

    # Predict
    model.eval()
    with torch.no_grad():
        logits = model(torch.FloatTensor(features_scaled))
        prob = torch.sigmoid(logits).item()

    verdict = 'malicious' if prob > 0.5 else 'benign'
    confidence = prob if prob > 0.5 else 1 - prob
    return verdict, confidence

# Usage
model = MalwareClassifier()
model.load_state_dict(torch.load('models/malware_classifier.pt', map_location='cpu'))
scaler = joblib.load('models/scaler.pkl')

verdict, confidence = classify_binary('/path/to/sample.exe', model, scaler)
print(f'{verdict} (confidence: {confidence:.4f})')

Warning

Evasion Static classifiers are vulnerable to adversarial evasion. Attackers can modify PE features (add benign imports, pad sections, change timestamps) to shift the model’s prediction without changing the malware’s behavior. This is an active research area, adversarial robustness for malware classifiers is significantly harder than for image classifiers because the attacker must preserve functional equivalence.

Limitations

No behavioral analysis. Static features miss runtime behavior: network communication, file system changes, process injection. A packed binary with encrypted payload may look benign statically but execute malware when run. Combine static classifiers with dynamic analysis (sandboxing) for defense in depth.

Dataset drift. The EMBER 2018 dataset reflects malware from 2018. Malware evolves, new packers, new evasion techniques, new functionality. A model trained on 2018 data will degrade on 2026 samples. Regular retraining on recent samples is essential.

Label quality. EMBER labels come from VirusTotal consensus. Some labels are wrong (benign files flagged by a few engines, or malware missed by most). The model learns from these noisy labels, which sets a ceiling on accuracy.

Tree-based models are usually sufficient. For tabular feature classification, LightGBM or XGBoost will give you comparable accuracy with faster training, easier tuning, and better interpretability. Use neural networks when you specifically need embeddings, transfer learning, or online updating, not as a default choice.