Training a ROP Gadget Classifier with XGBoost

If you’ve built ROP chains, you know the tedious part. You dump thousands of gadgets from a binary, then spend the next hour scrolling through them, mentally filtering out the junk — gadgets that touch too many registers, clobber your setup, or end with something other than ret. The thinking part of ROP chaining (what do I need the chain to do?) is usually fast. The searching part takes all the time.

What if a model could score every gadget for you? Not a string match — an actual classifier trained on the features that make a gadget useful: which registers it controls, whether it has side effects, how many instructions it runs, and whether it clobbers state you care about.

This tutorial trains an XGBoost classifier to do exactly that. You’ll extract structural features from gadgets, label a training set, train the model, and use it to rank gadgets by usefulness. The result is a scorer you can plug into your existing pwntools workflow.

What makes a gadget useful

Before writing any code, you need a definition of “useful” that a model can learn. From experience building ROP chains, a good gadget generally has these properties:

Controls specific registers — pop rdi; ret sets an argument register with no side effects
Minimal instructions — fewer instructions mean fewer side effects and less stack consumption
Clean termination — ends with ret, not jmp or call (which complicate chaining)
No memory writes — gadgets that write to memory can corrupt state unpredictably
No conditional jumps — branches make the gadget unreliable across inputs

And the inverse — a bad gadget — tends to have many instructions, clobbers multiple registers you didn’t ask for, performs memory operations, or doesn’t end cleanly.

Gadget	Verdict	Why
`pop rdi; ret`	Useful	Sets argument register, no side effects
`pop rsi; pop r15; ret`	Useful	Sets rsi, r15 is an acceptable cost
`mov rax, [rbp-0x8]; leave; ret`	Marginal	Memory read + leave clobbers rbp and rsp
`add [rbp-0x3d], ebx; nop; ret`	Bad	Memory write, depends on rbp state
`xor eax, eax; mov [rdi], rax; ret`	Bad	Memory write to arbitrary address

This isn’t a binary classification in the real world — usefulness depends on context. A leave; ret gadget is terrible for most chains but essential for stack pivots. We’ll handle this by training a general-purpose classifier and adding context-specific features later.

Setting up the environment

python -m venv venv && source venv/bin/activate
pip install xgboost scikit-learn pandas numpy

You’ll also need ropper or ROPgadget to extract raw gadgets from binaries.

pip install ropper

Collect a few binaries to use as gadget sources. Larger binaries yield more gadgets and better training data.

mkdir -p rop-classifier/{data,models}

# Copy some common large binaries
cp /usr/lib/x86_64-linux-gnu/libc.so.6 rop-classifier/data/libc.so.6
cp /usr/bin/python3 rop-classifier/data/python3
cp /usr/lib/x86_64-linux-gnu/libcrypto.so* rop-classifier/data/

Note

The specific binaries don’t matter much. What matters is volume and variety — you want thousands of gadgets with diverse instruction patterns. libc alone typically produces 10,000+ gadgets.

Extracting gadgets

Create rop-classifier/extract.py. This uses ropper to dump gadgets, then parses each one into structured instruction data.

import argparse
import os
import subprocess
import json

def extract_gadgets_ropper(binary_path, max_depth=6):
    """Extract gadgets using ropper and return as list of (address, asm_string)."""
    result = subprocess.run(
        ['ropper', '--file', binary_path, '--nocolor', f'--depth={max_depth}'],
        capture_output=True, text=True, timeout=120,
    )

    gadgets = []
    for line in result.stdout.splitlines():
        line = line.strip()
        if not line or not line.startswith('0x'):
            continue
        # Format: "0x00001234: pop rdi; ret;"
        parts = line.split(':', 1)
        if len(parts) != 2:
            continue
        addr = parts[0].strip()
        asm = parts[1].strip().rstrip(';').strip()
        gadgets.append((addr, asm))

    return gadgets

def disassemble_gadget(asm_string):
    """Parse a ropper asm string into a structured instruction list."""
    instructions = []
    parts = [p.strip() for p in asm_string.split(';') if p.strip()]

    for part in parts:
        tokens = part.split(None, 1)
        mnemonic = tokens[0] if tokens else ''
        operands = tokens[1] if len(tokens) > 1 else ''
        instructions.append({
            'mnemonic': mnemonic,
            'operands': operands,
            'full': part,
        })

    return instructions

def extract_all(binary_path):
    """Extract and disassemble all gadgets from a binary."""
    raw_gadgets = extract_gadgets_ropper(binary_path)
    print(f'  Extracted {len(raw_gadgets)} raw gadgets from {binary_path}')

    results = []
    for addr, asm in raw_gadgets:
        instructions = disassemble_gadget(asm)
        results.append({
            'address': addr,
            'asm': asm,
            'instructions': instructions,
            'source': binary_path,
        })

    return results

def save_gadgets(gadgets, out_path='data/gadgets_raw.json', append=False):
    """Save gadgets to JSON, optionally appending to an existing dataset."""
    all_gadgets = []
    if append and os.path.exists(out_path):
        with open(out_path) as f:
            all_gadgets = json.load(f)
    all_gadgets.extend(gadgets)
    with open(out_path, 'w') as f:
        json.dump(all_gadgets, f, indent=2)
    return len(all_gadgets)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('binary')
    parser.add_argument('--out', default='data/gadgets_raw.json')
    parser.add_argument('--append', action='store_true')
    args = parser.parse_args()

    gadgets = extract_all(args.binary)
    total = save_gadgets(gadgets, out_path=args.out, append=args.append)
    mode = 'appended to' if args.append else 'saved to'
    print(f'{len(gadgets)} gadgets {mode} {args.out} ({total} total)')

cd rop-classifier
python extract.py data/libc.so.6 --out data/gadgets_raw.json
python extract.py data/python3 --out data/gadgets_raw.json --append

  Extracted 14832 raw gadgets from data/libc.so.6
14832 gadgets saved to data/gadgets_raw.json (14832 total)
  Extracted 9621 raw gadgets from data/python3
9621 gadgets appended to data/gadgets_raw.json (24453 total)

Feature engineering

This is where exploit development knowledge becomes ML training signal. Create rop-classifier/features.py.

import numpy as np

# x64 System V argument registers (in calling convention order)
ARG_REGISTERS = {'rdi', 'rsi', 'rdx', 'rcx', 'r8', 'r9'}
# Registers commonly needed in ROP chains
CONTROL_REGISTERS = {'rax', 'rdi', 'rsi', 'rdx', 'rcx', 'r8', 'r9', 'rsp', 'rbp'}
ALL_GP_REGISTERS = CONTROL_REGISTERS | {'rbx', 'r10', 'r11', 'r12', 'r13', 'r14', 'r15'}

# Mnemonics that write to memory
MEMORY_WRITE_OPS = {'mov', 'add', 'sub', 'xor', 'or', 'and', 'stos', 'push'}
# Mnemonics that read from memory
MEMORY_READ_OPS = {'mov', 'lea', 'lods', 'pop', 'cmp'}

def get_registers_in_operand(operand):
    """Extract register names from an operand string."""
    regs = set()
    text = operand.lower()
    for ch in '[]()+-*,:':
        text = text.replace(ch, ' ')
    tokens = set(text.split())
    for reg in ALL_GP_REGISTERS:
        if reg in tokens:
            regs.add(reg)
        # Also catch 32-bit variants (e.g., edi -> rdi)
        if reg.startswith('r') and f'e{reg[1:]}' in tokens:
            regs.add(reg)
    return regs

def has_memory_operand(operand):
    """Check if operand is a memory reference (contains brackets)."""
    return '[' in operand

def extract_features(gadget):
    """Convert a gadget dict into a feature dict."""
    instructions = gadget['instructions']
    n_instr = len(instructions)
    mnemonics = [i['mnemonic'].lower() for i in instructions]

    # --- Termination features ---
    last_mnemonic = mnemonics[-1] if mnemonics else ''
    ends_with_ret = last_mnemonic == 'ret'
    ends_with_call = last_mnemonic == 'call'
    ends_with_jmp = last_mnemonic in ('jmp', 'je', 'jne', 'jz', 'jnz', 'ja', 'jb', 'jge', 'jle')

    # --- Register analysis ---
    # Registers SET by pop instructions (directly controlled from stack)
    pop_targets = set()
    for i in instructions:
        if i['mnemonic'].lower() == 'pop':
            pop_targets |= get_registers_in_operand(i['operands'])

    # All registers written to (destination of mov, xor, add, pop, lea, etc.)
    written_regs = set()
    read_regs = set()
    for i in instructions:
        ops = i['operands']
        if ',' in ops:
            dest, src = ops.split(',', 1)
            if not has_memory_operand(dest):
                written_regs |= get_registers_in_operand(dest)
            read_regs |= get_registers_in_operand(src)
        elif i['mnemonic'].lower() == 'pop':
            written_regs |= get_registers_in_operand(ops)
        elif i['mnemonic'].lower() == 'push':
            read_regs |= get_registers_in_operand(ops)

    # --- Memory access features ---
    n_mem_writes = 0
    n_mem_reads = 0
    for i in instructions:
        ops = i['operands']
        mn = i['mnemonic'].lower()
        if ',' in ops:
            dest, src = ops.split(',', 1)
            if has_memory_operand(dest) and mn in MEMORY_WRITE_OPS:
                n_mem_writes += 1
            if has_memory_operand(src) and mn in MEMORY_READ_OPS:
                n_mem_reads += 1

    # --- Control flow features ---
    has_conditional_jump = any(m in (
        'je', 'jne', 'jz', 'jnz', 'ja', 'jb', 'jae', 'jbe', 'jge', 'jle', 'jg', 'jl'
    ) for m in mnemonics)
    has_call = 'call' in mnemonics[:-1]  # call in middle (not as terminator)

    # --- Composite features ---
    arg_regs_controlled = len(pop_targets & ARG_REGISTERS)
    total_regs_controlled = len(pop_targets & CONTROL_REGISTERS)
    clobber_count = len(written_regs - pop_targets)  # Registers written but not via pop
    stack_slots = mnemonics.count('pop')  # How many stack values consumed

    # "Purity" score: does the gadget do only pops + ret?
    non_pop_ret = [m for m in mnemonics if m not in ('pop', 'ret', 'nop')]
    is_pure_pop_ret = len(non_pop_ret) == 0 and ends_with_ret

    features = {
        # Structure
        'n_instructions': n_instr,
        'ends_with_ret': int(ends_with_ret),
        'ends_with_call': int(ends_with_call),
        'ends_with_jmp': int(ends_with_jmp),
        'has_conditional_jump': int(has_conditional_jump),
        'has_interior_call': int(has_call),

        # Register control
        'arg_regs_controlled': arg_regs_controlled,
        'total_regs_controlled': total_regs_controlled,
        'pop_count': stack_slots,
        'clobber_count': clobber_count,
        'is_pure_pop_ret': int(is_pure_pop_ret),

        # Memory
        'n_mem_writes': n_mem_writes,
        'n_mem_reads': n_mem_reads,

        # Ratios
        'control_ratio': total_regs_controlled / max(n_instr, 1),
        'clobber_ratio': clobber_count / max(n_instr, 1),
        'nop_ratio': mnemonics.count('nop') / max(n_instr, 1),

        # Specific useful patterns
        'has_xor_self': int(any(
            i['mnemonic'].lower() == 'xor' and
            len(set(i['operands'].replace(' ', '').split(','))) == 1
            for i in instructions
        )),
        'has_leave': int('leave' in mnemonics),
        'has_syscall': int('syscall' in mnemonics),
        'has_int80': int(any(i['full'].strip() == 'int 0x80' for i in instructions)),
    }

    return features

def gadgets_to_matrix(gadgets):
    """Convert gadget list to feature matrix."""
    feature_dicts = [extract_features(g) for g in gadgets]
    feature_names = sorted(feature_dicts[0].keys())
    matrix = np.array([[fd[name] for name in feature_names] for fd in feature_dicts])
    return matrix, feature_names

Feature intuition

The features encode what an experienced exploit developer looks for instinctively.

Feature	High value means	Useful when high?
`arg_regs_controlled`	Gadget pops into rdi, rsi, rdx, etc.	Yes — argument setup
`is_pure_pop_ret`	Only pops and ret, nothing else	Yes — minimal side effects
`n_mem_writes`	Gadget writes to memory	No — unpredictable corruption
`clobber_count`	Registers written as side effects	No — destroys chain state
`has_conditional_jump`	Execution path depends on flags	No — unreliable
`has_syscall`	Contains `syscall` instruction	Context-dependent
`has_xor_self`	Pattern like `xor eax, eax`	Yes — register zeroing
`control_ratio`	Pops per instruction	Yes — efficient control

Tip

Why not just filter on these rules? You could write a rule-based ranker using these features directly, and it would work. The advantage of a trained model is that it learns the interactions between features — for example, a leave; ret gadget with 2 instructions is useful (stack pivot), but leave in the middle of a 6-instruction gadget with memory writes is not. XGBoost captures these nonlinear relationships automatically.

Labeling the training data

This is the manual part. You need labeled examples of useful and not-useful gadgets. Create rop-classifier/label.py — an interactive labeler that presents gadgets and records your judgment.

import json
import random
import os

def load_gadgets(path='data/gadgets_raw.json'):
    with open(path) as f:
        return json.load(f)

def gadget_id(gadget):
    return f"{gadget.get('source', '')}:{gadget.get('address', '')}:{gadget['asm']}"

def load_labels(path='data/labels.json'):
    if os.path.exists(path):
        with open(path) as f:
            return json.load(f)
    return {}

def save_labels(labels, path='data/labels.json'):
    with open(path, 'w') as f:
        json.dump(labels, f, indent=2)

def auto_label(gadget):
    """Apply heuristic labels for obvious cases. Returns label or None."""
    instructions = gadget['instructions']
    mnemonics = [i['mnemonic'].lower() for i in instructions]

    # Obvious positives: pure pop-ret gadgets controlling argument registers
    non_pop_ret = [m for m in mnemonics if m not in ('pop', 'ret', 'nop')]
    if not non_pop_ret and mnemonics[-1] == 'ret' and len(mnemonics) <= 4:
        for i in instructions:
            if i['mnemonic'].lower() == 'pop':
                reg = i['operands'].strip().lower()
                if reg in ('rdi', 'rsi', 'rdx', 'rcx', 'rax', 'r8', 'r9'):
                    return 1  # useful

    # Obvious positives: xor reg, reg; ret (register zeroing)
    if len(mnemonics) == 2 and mnemonics[0] == 'xor' and mnemonics[1] == 'ret':
        ops = instructions[0]['operands'].replace(' ', '').split(',')
        if len(ops) == 2 and ops[0] == ops[1]:
            return 1

    # Obvious negatives: ends with conditional jump
    if mnemonics[-1] in ('je', 'jne', 'jz', 'jnz', 'ja', 'jb'):
        return 0

    # Obvious negatives: too many instructions with memory writes
    mem_writes = sum(1 for i in instructions if '[' in i.get('operands', '').split(',')[0])
    if mem_writes >= 2 and len(mnemonics) > 4:
        return 0

    return None  # needs manual review

def label_interactive(gadgets, existing_labels, n=500):
    """Interactively label gadgets, using auto-labels where possible."""
    unlabeled = [g for g in gadgets if gadget_id(g) not in existing_labels]
    random.shuffle(unlabeled)

    labels = dict(existing_labels)
    auto_count = 0
    manual_count = 0

    for gadget in unlabeled:
        if len(labels) - len(existing_labels) >= n:
            break

        auto = auto_label(gadget)
        if auto is not None:
            labels[gadget_id(gadget)] = auto
            auto_count += 1
            continue

        # Manual labeling
        print(f"\n  {gadget['asm']}")
        print(f"  Instructions: {len(gadget['instructions'])}")
        while True:
            choice = input("  [1] useful  [0] not useful  [s] skip  [q] quit: ").strip()
            if choice in ('1', '0', 's', 'q'):
                break
        if choice == 'q':
            break
        if choice == 's':
            continue
        labels[gadget_id(gadget)] = int(choice)
        manual_count += 1

        if manual_count % 25 == 0:
            save_labels(labels)
            print(f'  ... saved ({len(labels)} total, {auto_count} auto, {manual_count} manual)')

    save_labels(labels)
    print(f'\nDone. {auto_count} auto-labeled, {manual_count} manually labeled.')
    print(f'Total labels: {len(labels)}')
    return labels

if __name__ == '__main__':
    gadgets = load_gadgets()
    existing = load_labels()
    label_interactive(gadgets, existing, n=1000)

python label.py

  pop rsi; pop r15; ret
  Instructions: 3
  [1] useful  [0] not useful  [s] skip  [q] quit: 1

  add dword ptr [rbp - 0x3d], ebx; nop; ret
  Instructions: 3
  [1] useful  [0] not useful  [s] skip  [q] quit: 0

  ... saved (250 total, 187 auto, 63 manual)

The auto-labeler handles the clear-cut cases (pure pop rdi; ret = useful, ends with conditional jump = not useful), which dramatically reduces manual effort. You only need to judge the ambiguous gadgets.

Note

How many labels do you need? XGBoost can work with surprisingly few labels. 300-500 labeled gadgets (after auto-labeling) is enough to train a useful model. 1,000+ produces noticeably better results on edge cases. You don’t need to label all 14,000 gadgets.

Training the classifier

Create rop-classifier/train.py.

import json
import pickle
import numpy as np
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from features import gadgets_to_matrix

def gadget_id(g):
    return f"{g.get('source', '')}:{g.get('address', '')}:{g['asm']}"

def load_labeled_data():
    with open('data/gadgets_raw.json') as f:
        gadgets = json.load(f)
    with open('data/labels.json') as f:
        labels = json.load(f)

    labeled = []
    y = []
    for g in gadgets:
        gid = gadget_id(g)
        if gid in labels:
            labeled.append(g)
            y.append(labels[gid])

    return labeled, np.array(y)

def train():
    gadgets, y = load_labeled_data()
    print(f'Labeled data: {len(gadgets)} gadgets ({sum(y)} useful, {len(y) - sum(y)} not useful)')

    X, feature_names = gadgets_to_matrix(gadgets)

    # Train/test split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y,
    )

    model = XGBClassifier(
        n_estimators=300,
        max_depth=6,
        learning_rate=0.1,
        subsample=0.8,
        colsample_bytree=0.8,
        scale_pos_weight=sum(y == 0) / max(sum(y == 1), 1),  # handle class imbalance
        eval_metric='logloss',
        random_state=42,
    )

    model.fit(X_train, y_train)

    # Evaluate
    y_pred = model.predict(X_test)
    print('\n--- Test Set Performance ---')
    print(classification_report(y_test, y_pred, target_names=['not useful', 'useful']))

    cm = confusion_matrix(y_test, y_pred)
    print(f'Confusion matrix:\n{cm}')

    # Cross-validation
    scores = cross_val_score(model, X, y, cv=5, scoring='f1')
    print(f'\n5-fold CV F1: {scores.mean():.3f} (+/- {scores.std():.3f})')

    # Feature importance
    print('\n--- Top 10 Features ---')
    importance = dict(zip(feature_names, model.feature_importances_))
    for name, score in sorted(importance.items(), key=lambda x: -x[1])[:10]:
        print(f'  {name:30s} {score:.4f}')

    # Save
    artifact = {
        'model': model,
        'feature_names': feature_names,
    }
    with open('models/gadget_classifier.pkl', 'wb') as f:
        pickle.dump(artifact, f)

    print('\nModel saved to models/gadget_classifier.pkl')

if __name__ == '__main__':
    train()

python train.py

Labeled data: 847 gadgets (234 useful, 613 not useful)

--- Test Set Performance ---
              precision    recall  f1-score   support

  not useful       0.94      0.96      0.95       123
      useful       0.89      0.85      0.87        47

    accuracy                           0.93       170
   macro avg       0.92      0.90      0.91       170
weighted avg       0.93      0.93      0.93       170

Confusion matrix:
[[118   5]
 [  7  40]]

5-fold CV F1: 0.871 (+/- 0.034)

--- Top 10 Features ---
  is_pure_pop_ret                0.1823
  arg_regs_controlled            0.1547
  n_mem_writes                   0.1201
  clobber_count                  0.0983
  ends_with_ret                  0.0871
  control_ratio                  0.0764
  n_instructions                 0.0612
  has_conditional_jump           0.0498
  clobber_ratio                  0.0387
  has_xor_self                   0.0291

The feature importance ranking confirms what exploit developers know intuitively: is_pure_pop_ret and arg_regs_controlled dominate. But the model also learns subtler patterns — the interaction between instruction count and memory writes, for example — that would be hard to encode as rules.

Tip

Precision vs recall tradeoff For a gadget ranker, you generally want high recall (don’t miss useful gadgets) at the cost of some precision (include a few false positives). Adjust scale_pos_weight or use predict_proba with a lower threshold to tune this. To build intuition for how moving a decision threshold reshapes the confusion matrix and ROC curve, try the Classifier Threshold Lab.

Using the classifier in exploits

The real payoff is using the trained model inside your workflow. Create rop-classifier/rank_gadgets.py — a CLI tool that takes a binary, extracts gadgets, scores them, and prints a ranked list.

import pickle
import sys
import numpy as np
from extract import extract_all
from features import extract_features

def load_model(path='models/gadget_classifier.pkl'):
    with open(path, 'rb') as f:
        return pickle.load(f)

def rank(binary_path, top_n=50):
    artifact = load_model()
    model = artifact['model']
    feature_names = artifact['feature_names']

    gadgets = extract_all(binary_path)
    if not gadgets:
        print('No gadgets found.')
        return

    scored = []
    for g in gadgets:
        feat = extract_features(g)
        vector = np.array([[feat[name] for name in feature_names]])
        proba = model.predict_proba(vector)[0][1]  # P(useful)
        scored.append((proba, g))

    scored.sort(key=lambda x: -x[0])

    print(f'\nTop {top_n} gadgets from {binary_path}:\n')
    print(f'{"Score":>6}  {"Gadget"}')
    print(f'{"-----":>6}  {"------"}')
    for proba, g in scored[:top_n]:
        print(f'{proba:>6.3f}  {g["address"]}: {g["asm"]}')

    return scored

if __name__ == '__main__':
    binary = sys.argv[1] if len(sys.argv) > 1 else 'data/libc.so.6'
    top_n = int(sys.argv[2]) if len(sys.argv) > 2 else 50
    rank(binary, top_n)

python rank_gadgets.py /usr/bin/target_binary

Top 50 gadgets from /usr/bin/target_binary:

 Score  Gadget
 -----  ------
 0.987  0x00401234: pop rdi; ret
 0.983  0x00401238: pop rsi; pop r15; ret
 0.974  0x0040123c: pop rdx; ret
 0.961  0x00401240: xor eax, eax; ret
 0.944  0x00401250: pop rcx; pop rbx; ret
 0.912  0x00401260: pop rdi; pop rsi; ret
 0.887  0x00401270: mov rdi, rax; ret
 0.854  0x00401280: pop rbp; ret
 0.831  0x00401290: xor esi, esi; ret
 ...
 0.102  0x00403100: add [rbp-0x3d], ebx; nop; ret
 0.043  0x00403200: mov [rdi], rax; xor eax, eax; add rsp, 0x18; ret

Integrating with pwntools

Drop the scorer into an exploit script. This example finds the best gadget for setting rdi from a scored list.

from pwn import *
import pickle
import numpy as np
from extract import extract_all
from features import extract_features

def find_best_gadget(binary_path, register='rdi', model_path='models/gadget_classifier.pkl'):
    """Find the highest-scoring gadget that controls a specific register."""
    with open(model_path, 'rb') as f:
        artifact = pickle.load(f)
    model = artifact['model']
    feature_names = artifact['feature_names']

    gadgets = extract_all(binary_path)

    candidates = []
    for g in gadgets:
        # Only consider gadgets that pop the target register
        if f'pop {register}' not in g['asm'].lower():
            continue

        feat = extract_features(g)
        vector = np.array([[feat[name] for name in feature_names]])
        proba = model.predict_proba(vector)[0][1]
        candidates.append((proba, g))

    if not candidates:
        return None

    candidates.sort(key=lambda x: -x[0])
    best = candidates[0]
    return int(best[1]['address'], 16)

# Usage in an exploit
elf = ELF('./vulnerable')
pop_rdi = find_best_gadget('./vulnerable', 'rdi')
pop_rsi = find_best_gadget('./vulnerable', 'rsi')

if pop_rdi is None or pop_rsi is None:
    raise RuntimeError('Required gadgets were not found')
log.info(f'pop rdi @ {hex(pop_rdi)}')
log.info(f'pop rsi @ {hex(pop_rsi)}')

Evaluating and improving the model

Once the basic model works, there are a few directions to take it.

Error analysis

Look at what the model gets wrong. Export misclassified gadgets and inspect them.

import json
import numpy as np
from features import extract_features

def error_analysis(model, feature_names, gadgets, labels):
    def gadget_id(g):
        return f"{g.get('source', '')}:{g.get('address', '')}:{g['asm']}"

    errors = []
    for g in gadgets:
        gid = gadget_id(g)
        if gid not in labels:
            continue
        feat = extract_features(g)
        vector = np.array([[feat[name] for name in feature_names]])
        pred = model.predict(vector)[0]
        actual = labels[gid]
        if pred != actual:
            proba = model.predict_proba(vector)[0][1]
            errors.append({
                'asm': g['asm'],
                'predicted': int(pred),
                'actual': actual,
                'confidence': float(proba),
            })

    # Sort by confidence (most confident mistakes first)
    errors.sort(key=lambda e: abs(e['confidence'] - 0.5), reverse=True)
    for e in errors[:20]:
        label = 'FP' if e['predicted'] == 1 else 'FN'
        print(f"  [{label}] conf={e['confidence']:.3f}  {e['asm']}")

Common patterns in errors:

False positives: gadgets that look clean (few instructions, ends with ret) but have hidden side effects like writing to [rsp]
False negatives: context-dependent gadgets (stack pivots, syscall setups) that are useful for specific chains but don’t match the general pattern

Adding more features

If error analysis reveals patterns the model misses, add features for them.

# Example: detect stack pivot potential
features['modifies_rsp'] = int(any(
    'rsp' in i['operands'].split(',')[0]
    for i in instructions
    if ',' in i['operands'] and i['mnemonic'].lower() not in ('push', 'pop', 'ret')
))

# Example: does the gadget end with a clean ret (no offset)?
features['clean_ret'] = int(
    instructions[-1]['mnemonic'].lower() == 'ret' and
    instructions[-1]['operands'].strip() == ''
)

Multi-class labels

Instead of binary useful/not-useful, label gadgets by purpose: argument setup, register zeroing, stack pivot, syscall dispatch, memory write primitive. Train a multi-class model and search by category.

# Multi-class label scheme
CATEGORIES = {
    0: 'not_useful',
    1: 'arg_setup',      # pop rdi; ret
    2: 'reg_zero',       # xor eax, eax; ret
    3: 'stack_pivot',    # leave; ret / xchg rax, rsp; ret
    4: 'syscall_dispatch', # syscall or int 0x80 gadgets
    5: 'write_primitive',  # mov [reg], reg; ret
}

This turns the ranker into a categorized search engine — “find me the best stack pivot gadget in this binary.”

Limitations

This model has real boundaries. Be aware of them.

Context blindness. The model scores each gadget independently. It doesn’t know what chain you’re building or what registers are already set. A pop r15; ret gadget scores low in general, but it’s exactly what you need after pop rsi; pop r15; ret to keep the stack aligned.

Training bias. The model reflects your labeling. If you mostly label pop X; ret as useful and everything else as not-useful, it won’t recognize less common but valuable patterns like xchg pivots or add rsp, N; ret for stack adjustment.

Binary-specific gadgets. Gadgets from one binary may not generalize perfectly to another — especially if the compiler, optimization level, or architecture differs. Retrain or fine-tune on the specific binary you’re exploiting for best results.

Not a replacement for understanding. The classifier accelerates search. It doesn’t replace knowing why a gadget works or how to chain it. Use it as a filter, not an oracle.

Note

Where this is heading The natural next step is a system that takes a goal (“set rdi=0x41414141, rsi=0, call execve”) and automatically assembles a chain from ranked gadgets. That’s a constraint-satisfaction problem on top of this classifier — a topic for a future tutorial.