If you’ve built ROP chains, you know the tedious part. You dump thousands of gadgets from a binary, then spend the next hour scrolling through them, mentally filtering out the junk — gadgets that touch too many registers, clobber your setup, or end with something other than ret. The thinking part of ROP chaining (what do I need the chain to do?) is usually fast. The searching part takes all the time.
What if a model could score every gadget for you? Not a string match — an actual classifier trained on the features that make a gadget useful: which registers it controls, whether it has side effects, how many instructions it runs, and whether it clobbers state you care about.
This tutorial trains an XGBoost classifier to do exactly that. You’ll extract structural features from gadgets, label a training set, train the model, and use it to rank gadgets by usefulness. The result is a scorer you can plug into your existing pwntools workflow.
What makes a gadget useful
Before writing any code, you need a definition of “useful” that a model can learn. From experience building ROP chains, a good gadget generally has these properties:
- Controls specific registers —
pop rdi; retsets an argument register with no side effects - Minimal instructions — fewer instructions mean fewer side effects and less stack consumption
- Clean termination — ends with
ret, notjmporcall(which complicate chaining) - No memory writes — gadgets that write to memory can corrupt state unpredictably
- No conditional jumps — branches make the gadget unreliable across inputs
And the inverse — a bad gadget — tends to have many instructions, clobbers multiple registers you didn’t ask for, performs memory operations, or doesn’t end cleanly.
| Gadget | Verdict | Why |
|---|---|---|
pop rdi; ret | Useful | Sets argument register, no side effects |
pop rsi; pop r15; ret | Useful | Sets rsi, r15 is an acceptable cost |
mov rax, [rbp-0x8]; leave; ret | Marginal | Memory read + leave clobbers rbp and rsp |
add [rbp-0x3d], ebx; nop; ret | Bad | Memory write, depends on rbp state |
xor eax, eax; mov [rdi], rax; ret | Bad | Memory write to arbitrary address |
This isn’t a binary classification in the real world — usefulness depends on context. A leave; ret gadget is terrible for most chains but essential for stack pivots. We’ll handle this by training a general-purpose classifier and adding context-specific features later.
Setting up the environment
python -m venv venv && source venv/bin/activate
pip install xgboost scikit-learn pandas numpyYou’ll also need ropper or ROPgadget to extract raw gadgets from binaries.
pip install ropperCollect a few binaries to use as gadget sources. Larger binaries yield more gadgets and better training data.
mkdir -p rop-classifier/{data,models}
# Copy some common large binaries
cp /usr/lib/x86_64-linux-gnu/libc.so.6 rop-classifier/data/libc.so.6
cp /usr/bin/python3 rop-classifier/data/python3
cp /usr/lib/x86_64-linux-gnu/libcrypto.so* rop-classifier/data/Note
The specific binaries don’t matter much. What matters is volume and variety — you want thousands of gadgets with diverse instruction patterns. libc alone typically produces 10,000+ gadgets.
Extracting gadgets
Create rop-classifier/extract.py. This uses ropper to dump gadgets, then parses each one into structured instruction data.
import argparse
import os
import subprocess
import json
def extract_gadgets_ropper(binary_path, max_depth=6):
"""Extract gadgets using ropper and return as list of (address, asm_string)."""
result = subprocess.run(
['ropper', '--file', binary_path, '--nocolor', f'--depth={max_depth}'],
capture_output=True, text=True, timeout=120,
)
gadgets = []
for line in result.stdout.splitlines():
line = line.strip()
if not line or not line.startswith('0x'):
continue
# Format: "0x00001234: pop rdi; ret;"
parts = line.split(':', 1)
if len(parts) != 2:
continue
addr = parts[0].strip()
asm = parts[1].strip().rstrip(';').strip()
gadgets.append((addr, asm))
return gadgets
def disassemble_gadget(asm_string):
"""Parse a ropper asm string into a structured instruction list."""
instructions = []
parts = [p.strip() for p in asm_string.split(';') if p.strip()]
for part in parts:
tokens = part.split(None, 1)
mnemonic = tokens[0] if tokens else ''
operands = tokens[1] if len(tokens) > 1 else ''
instructions.append({
'mnemonic': mnemonic,
'operands': operands,
'full': part,
})
return instructions
def extract_all(binary_path):
"""Extract and disassemble all gadgets from a binary."""
raw_gadgets = extract_gadgets_ropper(binary_path)
print(f' Extracted {len(raw_gadgets)} raw gadgets from {binary_path}')
results = []
for addr, asm in raw_gadgets:
instructions = disassemble_gadget(asm)
results.append({
'address': addr,
'asm': asm,
'instructions': instructions,
'source': binary_path,
})
return results
def save_gadgets(gadgets, out_path='data/gadgets_raw.json', append=False):
"""Save gadgets to JSON, optionally appending to an existing dataset."""
all_gadgets = []
if append and os.path.exists(out_path):
with open(out_path) as f:
all_gadgets = json.load(f)
all_gadgets.extend(gadgets)
with open(out_path, 'w') as f:
json.dump(all_gadgets, f, indent=2)
return len(all_gadgets)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('binary')
parser.add_argument('--out', default='data/gadgets_raw.json')
parser.add_argument('--append', action='store_true')
args = parser.parse_args()
gadgets = extract_all(args.binary)
total = save_gadgets(gadgets, out_path=args.out, append=args.append)
mode = 'appended to' if args.append else 'saved to'
print(f'{len(gadgets)} gadgets {mode} {args.out} ({total} total)')cd rop-classifier
python extract.py data/libc.so.6 --out data/gadgets_raw.json
python extract.py data/python3 --out data/gadgets_raw.json --append Extracted 14832 raw gadgets from data/libc.so.6
14832 gadgets saved to data/gadgets_raw.json (14832 total)
Extracted 9621 raw gadgets from data/python3
9621 gadgets appended to data/gadgets_raw.json (24453 total)Feature engineering
This is where exploit development knowledge becomes ML training signal. Create rop-classifier/features.py.
import numpy as np
# x64 System V argument registers (in calling convention order)
ARG_REGISTERS = {'rdi', 'rsi', 'rdx', 'rcx', 'r8', 'r9'}
# Registers commonly needed in ROP chains
CONTROL_REGISTERS = {'rax', 'rdi', 'rsi', 'rdx', 'rcx', 'r8', 'r9', 'rsp', 'rbp'}
ALL_GP_REGISTERS = CONTROL_REGISTERS | {'rbx', 'r10', 'r11', 'r12', 'r13', 'r14', 'r15'}
# Mnemonics that write to memory
MEMORY_WRITE_OPS = {'mov', 'add', 'sub', 'xor', 'or', 'and', 'stos', 'push'}
# Mnemonics that read from memory
MEMORY_READ_OPS = {'mov', 'lea', 'lods', 'pop', 'cmp'}
def get_registers_in_operand(operand):
"""Extract register names from an operand string."""
regs = set()
text = operand.lower()
for ch in '[]()+-*,:':
text = text.replace(ch, ' ')
tokens = set(text.split())
for reg in ALL_GP_REGISTERS:
if reg in tokens:
regs.add(reg)
# Also catch 32-bit variants (e.g., edi -> rdi)
if reg.startswith('r') and f'e{reg[1:]}' in tokens:
regs.add(reg)
return regs
def has_memory_operand(operand):
"""Check if operand is a memory reference (contains brackets)."""
return '[' in operand
def extract_features(gadget):
"""Convert a gadget dict into a feature dict."""
instructions = gadget['instructions']
n_instr = len(instructions)
mnemonics = [i['mnemonic'].lower() for i in instructions]
# --- Termination features ---
last_mnemonic = mnemonics[-1] if mnemonics else ''
ends_with_ret = last_mnemonic == 'ret'
ends_with_call = last_mnemonic == 'call'
ends_with_jmp = last_mnemonic in ('jmp', 'je', 'jne', 'jz', 'jnz', 'ja', 'jb', 'jge', 'jle')
# --- Register analysis ---
# Registers SET by pop instructions (directly controlled from stack)
pop_targets = set()
for i in instructions:
if i['mnemonic'].lower() == 'pop':
pop_targets |= get_registers_in_operand(i['operands'])
# All registers written to (destination of mov, xor, add, pop, lea, etc.)
written_regs = set()
read_regs = set()
for i in instructions:
ops = i['operands']
if ',' in ops:
dest, src = ops.split(',', 1)
if not has_memory_operand(dest):
written_regs |= get_registers_in_operand(dest)
read_regs |= get_registers_in_operand(src)
elif i['mnemonic'].lower() == 'pop':
written_regs |= get_registers_in_operand(ops)
elif i['mnemonic'].lower() == 'push':
read_regs |= get_registers_in_operand(ops)
# --- Memory access features ---
n_mem_writes = 0
n_mem_reads = 0
for i in instructions:
ops = i['operands']
mn = i['mnemonic'].lower()
if ',' in ops:
dest, src = ops.split(',', 1)
if has_memory_operand(dest) and mn in MEMORY_WRITE_OPS:
n_mem_writes += 1
if has_memory_operand(src) and mn in MEMORY_READ_OPS:
n_mem_reads += 1
# --- Control flow features ---
has_conditional_jump = any(m in (
'je', 'jne', 'jz', 'jnz', 'ja', 'jb', 'jae', 'jbe', 'jge', 'jle', 'jg', 'jl'
) for m in mnemonics)
has_call = 'call' in mnemonics[:-1] # call in middle (not as terminator)
# --- Composite features ---
arg_regs_controlled = len(pop_targets & ARG_REGISTERS)
total_regs_controlled = len(pop_targets & CONTROL_REGISTERS)
clobber_count = len(written_regs - pop_targets) # Registers written but not via pop
stack_slots = mnemonics.count('pop') # How many stack values consumed
# "Purity" score: does the gadget do only pops + ret?
non_pop_ret = [m for m in mnemonics if m not in ('pop', 'ret', 'nop')]
is_pure_pop_ret = len(non_pop_ret) == 0 and ends_with_ret
features = {
# Structure
'n_instructions': n_instr,
'ends_with_ret': int(ends_with_ret),
'ends_with_call': int(ends_with_call),
'ends_with_jmp': int(ends_with_jmp),
'has_conditional_jump': int(has_conditional_jump),
'has_interior_call': int(has_call),
# Register control
'arg_regs_controlled': arg_regs_controlled,
'total_regs_controlled': total_regs_controlled,
'pop_count': stack_slots,
'clobber_count': clobber_count,
'is_pure_pop_ret': int(is_pure_pop_ret),
# Memory
'n_mem_writes': n_mem_writes,
'n_mem_reads': n_mem_reads,
# Ratios
'control_ratio': total_regs_controlled / max(n_instr, 1),
'clobber_ratio': clobber_count / max(n_instr, 1),
'nop_ratio': mnemonics.count('nop') / max(n_instr, 1),
# Specific useful patterns
'has_xor_self': int(any(
i['mnemonic'].lower() == 'xor' and
len(set(i['operands'].replace(' ', '').split(','))) == 1
for i in instructions
)),
'has_leave': int('leave' in mnemonics),
'has_syscall': int('syscall' in mnemonics),
'has_int80': int(any(i['full'].strip() == 'int 0x80' for i in instructions)),
}
return features
def gadgets_to_matrix(gadgets):
"""Convert gadget list to feature matrix."""
feature_dicts = [extract_features(g) for g in gadgets]
feature_names = sorted(feature_dicts[0].keys())
matrix = np.array([[fd[name] for name in feature_names] for fd in feature_dicts])
return matrix, feature_namesFeature intuition
The features encode what an experienced exploit developer looks for instinctively.
| Feature | High value means | Useful when high? |
|---|---|---|
arg_regs_controlled | Gadget pops into rdi, rsi, rdx, etc. | Yes — argument setup |
is_pure_pop_ret | Only pops and ret, nothing else | Yes — minimal side effects |
n_mem_writes | Gadget writes to memory | No — unpredictable corruption |
clobber_count | Registers written as side effects | No — destroys chain state |
has_conditional_jump | Execution path depends on flags | No — unreliable |
has_syscall | Contains syscall instruction | Context-dependent |
has_xor_self | Pattern like xor eax, eax | Yes — register zeroing |
control_ratio | Pops per instruction | Yes — efficient control |
Tip
Why not just filter on these rules? You could write a rule-based ranker using these features directly, and it would work. The advantage of a trained model is that it learns the interactions between features — for example, a
leave; retgadget with 2 instructions is useful (stack pivot), butleavein the middle of a 6-instruction gadget with memory writes is not. XGBoost captures these nonlinear relationships automatically.
Labeling the training data
This is the manual part. You need labeled examples of useful and not-useful gadgets. Create rop-classifier/label.py — an interactive labeler that presents gadgets and records your judgment.
import json
import random
import os
def load_gadgets(path='data/gadgets_raw.json'):
with open(path) as f:
return json.load(f)
def gadget_id(gadget):
return f"{gadget.get('source', '')}:{gadget.get('address', '')}:{gadget['asm']}"
def load_labels(path='data/labels.json'):
if os.path.exists(path):
with open(path) as f:
return json.load(f)
return {}
def save_labels(labels, path='data/labels.json'):
with open(path, 'w') as f:
json.dump(labels, f, indent=2)
def auto_label(gadget):
"""Apply heuristic labels for obvious cases. Returns label or None."""
instructions = gadget['instructions']
mnemonics = [i['mnemonic'].lower() for i in instructions]
# Obvious positives: pure pop-ret gadgets controlling argument registers
non_pop_ret = [m for m in mnemonics if m not in ('pop', 'ret', 'nop')]
if not non_pop_ret and mnemonics[-1] == 'ret' and len(mnemonics) <= 4:
for i in instructions:
if i['mnemonic'].lower() == 'pop':
reg = i['operands'].strip().lower()
if reg in ('rdi', 'rsi', 'rdx', 'rcx', 'rax', 'r8', 'r9'):
return 1 # useful
# Obvious positives: xor reg, reg; ret (register zeroing)
if len(mnemonics) == 2 and mnemonics[0] == 'xor' and mnemonics[1] == 'ret':
ops = instructions[0]['operands'].replace(' ', '').split(',')
if len(ops) == 2 and ops[0] == ops[1]:
return 1
# Obvious negatives: ends with conditional jump
if mnemonics[-1] in ('je', 'jne', 'jz', 'jnz', 'ja', 'jb'):
return 0
# Obvious negatives: too many instructions with memory writes
mem_writes = sum(1 for i in instructions if '[' in i.get('operands', '').split(',')[0])
if mem_writes >= 2 and len(mnemonics) > 4:
return 0
return None # needs manual review
def label_interactive(gadgets, existing_labels, n=500):
"""Interactively label gadgets, using auto-labels where possible."""
unlabeled = [g for g in gadgets if gadget_id(g) not in existing_labels]
random.shuffle(unlabeled)
labels = dict(existing_labels)
auto_count = 0
manual_count = 0
for gadget in unlabeled:
if len(labels) - len(existing_labels) >= n:
break
auto = auto_label(gadget)
if auto is not None:
labels[gadget_id(gadget)] = auto
auto_count += 1
continue
# Manual labeling
print(f"\n {gadget['asm']}")
print(f" Instructions: {len(gadget['instructions'])}")
while True:
choice = input(" [1] useful [0] not useful [s] skip [q] quit: ").strip()
if choice in ('1', '0', 's', 'q'):
break
if choice == 'q':
break
if choice == 's':
continue
labels[gadget_id(gadget)] = int(choice)
manual_count += 1
if manual_count % 25 == 0:
save_labels(labels)
print(f' ... saved ({len(labels)} total, {auto_count} auto, {manual_count} manual)')
save_labels(labels)
print(f'\nDone. {auto_count} auto-labeled, {manual_count} manually labeled.')
print(f'Total labels: {len(labels)}')
return labels
if __name__ == '__main__':
gadgets = load_gadgets()
existing = load_labels()
label_interactive(gadgets, existing, n=1000)python label.py pop rsi; pop r15; ret
Instructions: 3
[1] useful [0] not useful [s] skip [q] quit: 1
add dword ptr [rbp - 0x3d], ebx; nop; ret
Instructions: 3
[1] useful [0] not useful [s] skip [q] quit: 0
... saved (250 total, 187 auto, 63 manual)The auto-labeler handles the clear-cut cases (pure pop rdi; ret = useful, ends with conditional jump = not useful), which dramatically reduces manual effort. You only need to judge the ambiguous gadgets.
Note
How many labels do you need? XGBoost can work with surprisingly few labels. 300-500 labeled gadgets (after auto-labeling) is enough to train a useful model. 1,000+ produces noticeably better results on edge cases. You don’t need to label all 14,000 gadgets.
Training the classifier
Create rop-classifier/train.py.
import json
import pickle
import numpy as np
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from features import gadgets_to_matrix
def gadget_id(g):
return f"{g.get('source', '')}:{g.get('address', '')}:{g['asm']}"
def load_labeled_data():
with open('data/gadgets_raw.json') as f:
gadgets = json.load(f)
with open('data/labels.json') as f:
labels = json.load(f)
labeled = []
y = []
for g in gadgets:
gid = gadget_id(g)
if gid in labels:
labeled.append(g)
y.append(labels[gid])
return labeled, np.array(y)
def train():
gadgets, y = load_labeled_data()
print(f'Labeled data: {len(gadgets)} gadgets ({sum(y)} useful, {len(y) - sum(y)} not useful)')
X, feature_names = gadgets_to_matrix(gadgets)
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y,
)
model = XGBClassifier(
n_estimators=300,
max_depth=6,
learning_rate=0.1,
subsample=0.8,
colsample_bytree=0.8,
scale_pos_weight=sum(y == 0) / max(sum(y == 1), 1), # handle class imbalance
eval_metric='logloss',
random_state=42,
)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print('\n--- Test Set Performance ---')
print(classification_report(y_test, y_pred, target_names=['not useful', 'useful']))
cm = confusion_matrix(y_test, y_pred)
print(f'Confusion matrix:\n{cm}')
# Cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='f1')
print(f'\n5-fold CV F1: {scores.mean():.3f} (+/- {scores.std():.3f})')
# Feature importance
print('\n--- Top 10 Features ---')
importance = dict(zip(feature_names, model.feature_importances_))
for name, score in sorted(importance.items(), key=lambda x: -x[1])[:10]:
print(f' {name:30s} {score:.4f}')
# Save
artifact = {
'model': model,
'feature_names': feature_names,
}
with open('models/gadget_classifier.pkl', 'wb') as f:
pickle.dump(artifact, f)
print('\nModel saved to models/gadget_classifier.pkl')
if __name__ == '__main__':
train()python train.pyLabeled data: 847 gadgets (234 useful, 613 not useful)
--- Test Set Performance ---
precision recall f1-score support
not useful 0.94 0.96 0.95 123
useful 0.89 0.85 0.87 47
accuracy 0.93 170
macro avg 0.92 0.90 0.91 170
weighted avg 0.93 0.93 0.93 170
Confusion matrix:
[[118 5]
[ 7 40]]
5-fold CV F1: 0.871 (+/- 0.034)
--- Top 10 Features ---
is_pure_pop_ret 0.1823
arg_regs_controlled 0.1547
n_mem_writes 0.1201
clobber_count 0.0983
ends_with_ret 0.0871
control_ratio 0.0764
n_instructions 0.0612
has_conditional_jump 0.0498
clobber_ratio 0.0387
has_xor_self 0.0291The feature importance ranking confirms what exploit developers know intuitively: is_pure_pop_ret and arg_regs_controlled dominate. But the model also learns subtler patterns — the interaction between instruction count and memory writes, for example — that would be hard to encode as rules.
Tip
Precision vs recall tradeoff For a gadget ranker, you generally want high recall (don’t miss useful gadgets) at the cost of some precision (include a few false positives). Adjust
scale_pos_weightor usepredict_probawith a lower threshold to tune this. To build intuition for how moving a decision threshold reshapes the confusion matrix and ROC curve, try the Classifier Threshold Lab.
Using the classifier in exploits
The real payoff is using the trained model inside your workflow. Create rop-classifier/rank_gadgets.py — a CLI tool that takes a binary, extracts gadgets, scores them, and prints a ranked list.
import pickle
import sys
import numpy as np
from extract import extract_all
from features import extract_features
def load_model(path='models/gadget_classifier.pkl'):
with open(path, 'rb') as f:
return pickle.load(f)
def rank(binary_path, top_n=50):
artifact = load_model()
model = artifact['model']
feature_names = artifact['feature_names']
gadgets = extract_all(binary_path)
if not gadgets:
print('No gadgets found.')
return
scored = []
for g in gadgets:
feat = extract_features(g)
vector = np.array([[feat[name] for name in feature_names]])
proba = model.predict_proba(vector)[0][1] # P(useful)
scored.append((proba, g))
scored.sort(key=lambda x: -x[0])
print(f'\nTop {top_n} gadgets from {binary_path}:\n')
print(f'{"Score":>6} {"Gadget"}')
print(f'{"-----":>6} {"------"}')
for proba, g in scored[:top_n]:
print(f'{proba:>6.3f} {g["address"]}: {g["asm"]}')
return scored
if __name__ == '__main__':
binary = sys.argv[1] if len(sys.argv) > 1 else 'data/libc.so.6'
top_n = int(sys.argv[2]) if len(sys.argv) > 2 else 50
rank(binary, top_n)python rank_gadgets.py /usr/bin/target_binaryTop 50 gadgets from /usr/bin/target_binary:
Score Gadget
----- ------
0.987 0x00401234: pop rdi; ret
0.983 0x00401238: pop rsi; pop r15; ret
0.974 0x0040123c: pop rdx; ret
0.961 0x00401240: xor eax, eax; ret
0.944 0x00401250: pop rcx; pop rbx; ret
0.912 0x00401260: pop rdi; pop rsi; ret
0.887 0x00401270: mov rdi, rax; ret
0.854 0x00401280: pop rbp; ret
0.831 0x00401290: xor esi, esi; ret
...
0.102 0x00403100: add [rbp-0x3d], ebx; nop; ret
0.043 0x00403200: mov [rdi], rax; xor eax, eax; add rsp, 0x18; retIntegrating with pwntools
Drop the scorer into an exploit script. This example finds the best gadget for setting rdi from a scored list.
from pwn import *
import pickle
import numpy as np
from extract import extract_all
from features import extract_features
def find_best_gadget(binary_path, register='rdi', model_path='models/gadget_classifier.pkl'):
"""Find the highest-scoring gadget that controls a specific register."""
with open(model_path, 'rb') as f:
artifact = pickle.load(f)
model = artifact['model']
feature_names = artifact['feature_names']
gadgets = extract_all(binary_path)
candidates = []
for g in gadgets:
# Only consider gadgets that pop the target register
if f'pop {register}' not in g['asm'].lower():
continue
feat = extract_features(g)
vector = np.array([[feat[name] for name in feature_names]])
proba = model.predict_proba(vector)[0][1]
candidates.append((proba, g))
if not candidates:
return None
candidates.sort(key=lambda x: -x[0])
best = candidates[0]
return int(best[1]['address'], 16)
# Usage in an exploit
elf = ELF('./vulnerable')
pop_rdi = find_best_gadget('./vulnerable', 'rdi')
pop_rsi = find_best_gadget('./vulnerable', 'rsi')
if pop_rdi is None or pop_rsi is None:
raise RuntimeError('Required gadgets were not found')
log.info(f'pop rdi @ {hex(pop_rdi)}')
log.info(f'pop rsi @ {hex(pop_rsi)}')Evaluating and improving the model
Once the basic model works, there are a few directions to take it.
Error analysis
Look at what the model gets wrong. Export misclassified gadgets and inspect them.
import json
import numpy as np
from features import extract_features
def error_analysis(model, feature_names, gadgets, labels):
def gadget_id(g):
return f"{g.get('source', '')}:{g.get('address', '')}:{g['asm']}"
errors = []
for g in gadgets:
gid = gadget_id(g)
if gid not in labels:
continue
feat = extract_features(g)
vector = np.array([[feat[name] for name in feature_names]])
pred = model.predict(vector)[0]
actual = labels[gid]
if pred != actual:
proba = model.predict_proba(vector)[0][1]
errors.append({
'asm': g['asm'],
'predicted': int(pred),
'actual': actual,
'confidence': float(proba),
})
# Sort by confidence (most confident mistakes first)
errors.sort(key=lambda e: abs(e['confidence'] - 0.5), reverse=True)
for e in errors[:20]:
label = 'FP' if e['predicted'] == 1 else 'FN'
print(f" [{label}] conf={e['confidence']:.3f} {e['asm']}")Common patterns in errors:
- False positives: gadgets that look clean (few instructions, ends with ret) but have hidden side effects like writing to
[rsp] - False negatives: context-dependent gadgets (stack pivots, syscall setups) that are useful for specific chains but don’t match the general pattern
Adding more features
If error analysis reveals patterns the model misses, add features for them.
# Example: detect stack pivot potential
features['modifies_rsp'] = int(any(
'rsp' in i['operands'].split(',')[0]
for i in instructions
if ',' in i['operands'] and i['mnemonic'].lower() not in ('push', 'pop', 'ret')
))
# Example: does the gadget end with a clean ret (no offset)?
features['clean_ret'] = int(
instructions[-1]['mnemonic'].lower() == 'ret' and
instructions[-1]['operands'].strip() == ''
)Multi-class labels
Instead of binary useful/not-useful, label gadgets by purpose: argument setup, register zeroing, stack pivot, syscall dispatch, memory write primitive. Train a multi-class model and search by category.
# Multi-class label scheme
CATEGORIES = {
0: 'not_useful',
1: 'arg_setup', # pop rdi; ret
2: 'reg_zero', # xor eax, eax; ret
3: 'stack_pivot', # leave; ret / xchg rax, rsp; ret
4: 'syscall_dispatch', # syscall or int 0x80 gadgets
5: 'write_primitive', # mov [reg], reg; ret
}This turns the ranker into a categorized search engine — “find me the best stack pivot gadget in this binary.”
Limitations
This model has real boundaries. Be aware of them.
Context blindness. The model scores each gadget independently. It doesn’t know what chain you’re building or what registers are already set. A pop r15; ret gadget scores low in general, but it’s exactly what you need after pop rsi; pop r15; ret to keep the stack aligned.
Training bias. The model reflects your labeling. If you mostly label pop X; ret as useful and everything else as not-useful, it won’t recognize less common but valuable patterns like xchg pivots or add rsp, N; ret for stack adjustment.
Binary-specific gadgets. Gadgets from one binary may not generalize perfectly to another — especially if the compiler, optimization level, or architecture differs. Retrain or fine-tune on the specific binary you’re exploiting for best results.
Not a replacement for understanding. The classifier accelerates search. It doesn’t replace knowing why a gadget works or how to chain it. Use it as a filter, not an oracle.
Note
Where this is heading The natural next step is a system that takes a goal (“set rdi=0x41414141, rsi=0, call execve”) and automatically assembles a chain from ranked gadgets. That’s a constraint-satisfaction problem on top of this classifier — a topic for a future tutorial.