Lab

Jailbreak Sandbox

Test jailbreak techniques against progressively hardened defenses with simulated model responses and scoring.

JavaScript Required

The Jailbreak Sandbox requires JavaScript to run in your browser. Please enable JavaScript to use this tool.

Educational security demonstration

This tool simulates jailbreak outcomes without making real LLM API calls. All responses are pre-computed to demonstrate how different defense configurations affect attack success rates. No prompts are sent to any model.

What this is

An interactive sandbox for testing common LLM jailbreak techniques against layered defense configurations. See how each defense mechanism reduces attack effectiveness through pre-computed simulated outcomes.

What you'll learn

Common jailbreak technique categories
How defense layers interact and complement each other
Why defense-in-depth matters for LLM safety

Technique Library

Select a jailbreak technique to test against the current defense configuration.

Defense Configuration

Enable defenses to see how they reduce jailbreak effectiveness.

All defenses off

Instruction Hierarchy

Prioritize system prompt over user input

System prompt priority

Effectiveness: High vs direct attacks

Input Sanitization

Detect and neutralize encoded payloads

Preprocessing filter

Effectiveness: High vs encoding

Output Filtering

Scan responses for policy violations

Post-generation check

Effectiveness: Broad but reactive

Constitutional AI

Self-critique and revision loop

RLHF + self-review

Effectiveness: Broad but imperfect

Test Arena

Run the selected technique against the current defense configuration.

Select a technique above, then click Run Test.

Effectiveness Matrix

Overview of technique success rates across defense configurations. Scores represent jailbreak success likelihood (lower is safer).

Technique

Understanding Jailbreak Techniques

Persona Attacks

The attacker asks the model to role-play as an unrestricted character (e.g., "DAN" or a fictional villain). The persona framing shifts the model's context away from its safety training, making it more likely to comply with restricted requests.

Encoding Bypasses

Harmful instructions are hidden in Base64, ROT13, hex, or other encodings. If the model decodes the payload internally, the safety filters on raw text may miss the intent. Input sanitization defenses target this by decoding before filtering.

Multi-Turn Escalation

Rather than a single prompt, the attacker gradually shifts context across several messages, starting benign and slowly steering the conversation toward restricted territory. Each individual message looks innocuous, making detection harder.

Effectiveness Scoring

The score (0–100) estimates how likely the jailbreak is to succeed against the current defense configuration. Each defense reduces the score by a technique-specific amount: encoding bypasses are heavily penalized by input sanitization, while persona attacks are more affected by instruction hierarchy.

Keyboard shortcuts

1-6 Select technique
Enter Run test
R Reset

Security model (30 seconds)

This tool runs entirely in your browser. No prompts are sent to any AI model - all responses are pre-computed and embedded in the page. No data is sent to any server. This is a pure simulation for educational purposes.