Tutorial

Tuxscope Lab 6: Observing Memory with Page Fault Tracing

Trace page faults and OOM kills to understand how Linux implements virtual memory through demand paging.

5 min read intermediate

Prerequisites

  • Completed Tuxscope Labs 1-5
  • Basic understanding of virtual memory
  • A built tuxscope binary (see Lab 1)
  • Linux 5.8+ with root privileges

Part 6 of 7 in Tuxscope: Linux Kernel Observability with eBPF

Table of Contents

When a program allocates memory with malloc(), the kernel does not immediately give it physical RAM. Instead, it hands back a virtual address and waits. The first time the program actually touches that address, the CPU raises a page fault, and the kernel steps in to map a physical page. This lazy strategy is called demand paging, and it is one of the most important performance optimizations in the Linux kernel.

In this lab you will attach eBPF programs to the page fault and OOM killer tracepoints. You will see demand paging happen in real time, learn to distinguish minor faults from major faults, and observe what happens when the system runs out of memory entirely.

Note

Prerequisites This tutorial is part of the Tuxscope series. You need a built tuxscope binary, Linux 5.8+, and root privileges.

Virtual memory and page faults

Every process on Linux operates in its own virtual address space. The kernel maintains a page table that maps virtual addresses to physical page frames. But the mapping is not established when memory is allocated; it is established when memory is first accessed.

The virtual address space

A 64-bit Linux process has a vast address space, but only portions of it are mapped at any time:

Virtual Address Space (per process)
┌─────────────────────────┐  0xFFFFFFFFFFFFFFFF
│                         │
│    Kernel Space         │  (inaccessible from userspace)
│                         │
├─────────────────────────┤  0x00007FFFFFFFFFFF (typical split)
│    Stack                │  grows downward
│         |               │
│         v               │
│                         │
│    (unmapped gap)       │
│                         │
│         ^               │
│         |               │
│    Heap / mmap          │  grows upward
│                         │
├─────────────────────────┤
│    BSS (zero-init data) │
│    Data (initialized)   │
│    Text (code)          │
├─────────────────────────┤
│    (unmapped)           │
└─────────────────────────┘  0x0000000000000000

When your program calls malloc(4096), the C library requests memory from the kernel (via brk() or mmap()). The kernel updates the process’s virtual memory area (VMA) list to record that this range is valid, but it does not allocate a physical page yet. The page table entry for that address remains empty.

What triggers a page fault

The first time the CPU tries to read or write a virtual address that has no page table entry, the hardware raises a page fault exception. The kernel’s page fault handler then:

  1. Checks whether the address belongs to a valid VMA (if not, delivers SIGSEGV)
  2. Allocates a physical page frame
  3. Updates the page table to map virtual address to physical page
  4. Returns to the faulting instruction, which now succeeds
Program               CPU                    Kernel
   |                   |                       |
   |  write to 0x1000  |                       |
   |------------------>|                       |
   |                   |  no mapping!          |
   |                   |  page fault --------->|
   |                   |                       |  check VMA: valid
   |                   |                       |  allocate physical page
   |                   |                       |  update page table
   |                   |<----------------------|
   |                   |  retry write to 0x1000|
   |<------------------|  (now succeeds)       |

Minor vs major faults

  • Minor fault: The page exists somewhere in memory (e.g., a shared library already loaded by another process, or a zero page that just needs allocation). No disk I/O required. Copy-on-write faults after fork() are another common source of minor faults.
  • Major fault: The page must be read from disk (e.g., a memory-mapped file, or a page that was swapped out). These are orders of magnitude slower.

The page_fault_user tracepoint fires for all user-space faults. Most faults on a running system are minor: the kernel allocating pages on demand as programs touch newly allocated memory.

The OOM killer

When the system runs out of memory and swap, the kernel invokes the OOM (Out Of Memory) killer. It selects a victim using the oom_badness() heuristic (exposed as /proc/<pid>/oom_score), sends it SIGKILL, and reclaims its memory. The oom/mark_victim tracepoint fires when a process is selected for termination.

Warning

The OOM killer is a last resort. By the time it fires, the system is in a degraded state. In production, you should have monitoring that triggers well before OOM conditions. Tuxscope’s OOM tracing is useful for understanding the mechanism, not as a production alerting tool.

The eBPF programs

This lab attaches to two tracepoints:

TracepointFires when
exceptions/page_fault_userA user-space page fault occurs
oom/mark_victimThe OOM killer selects a victim

The event struct

#[repr(C)]
pub struct MemoryEvent {
    pub pid: u32,
    pub event_type: u8, // 0 = page_fault, 1 = oom_kill
    pub _padding: [u8; 3],
    pub address: u64,
    pub timestamp_ns: u64,
    pub comm: [u8; 16],
}

The address field holds the faulting virtual address for page faults, or zero for OOM events. Addresses are displayed in hexadecimal, which makes it easier to see which region of the virtual address space is being faulted in; stack addresses look different from heap addresses.

The map declaration matches the pattern established in Lab 4: a RingBuf shared between the page-fault and OOM handlers.

#[map]
static EVENTS: RingBuf = RingBuf::with_byte_size(256 * 1024, 0);

The page fault handler

#[tracepoint]
pub fn handle_page_fault(ctx: TracePointContext) -> u32 {
    match try_page_fault(&ctx) {
        Ok(()) => 0,
        Err(_) => 1,
    }
}

fn try_page_fault(ctx: &TracePointContext) -> Result<(), i64> {
    let tgid = (bpf_get_current_pid_tgid() >> 32) as u32;

    // Read the faulting virtual address from the tracepoint context.
    // Offset 8 is where the address field lives in the
    // exceptions/page_fault_user format on x86_64.
    let address: u64 = unsafe { ctx.read_at(8)? };

    let event = MemoryEvent {
        pid: tgid,
        event_type: 0, // page_fault
        _padding: [0; 3],
        address,
        timestamp_ns: unsafe { bpf_ktime_get_ns() },
        comm: bpf_get_current_comm().map_err(|e| e as i64)?,
    };
    EVENTS.output(&event, 0).map_err(|_| 1i64)?;
    Ok(())
}

The offset 8 comes from the tracepoint format file. Inspect it the same way Lab 5 inspected sched_process_fork:

sudo cat /sys/kernel/debug/tracing/events/exceptions/page_fault_user/format

You should see something close to this on x86_64:

field:unsigned short common_type;     offset:0; size:2; signed:0;
field:unsigned char  common_flags;    offset:2; size:1; signed:0;
field:unsigned char  common_preempt_count; offset:3; size:1; signed:0;
field:int            common_pid;      offset:4; size:4; signed:1;

field:unsigned long  address;         offset:8;  size:8; signed:0;
field:unsigned long  ip;              offset:16; size:8; signed:0;
field:unsigned long  error_code;      offset:24; size:8; signed:0;

The address field at offset:8 is what the BPF code reads. If your kernel reports different offsets, update the literal accordingly.

The OOM handler

#[tracepoint]
pub fn handle_oom(ctx: TracePointContext) -> u32 {
    match try_oom(&ctx) {
        Ok(()) => 0,
        Err(_) => 1,
    }
}

fn try_oom(ctx: &TracePointContext) -> Result<(), i64> {
    // The victim's PID is at offset 8 in the oom/mark_victim format
    let victim_pid: i32 = unsafe { ctx.read_at(8)? };

    let event = MemoryEvent {
        pid: victim_pid as u32,
        event_type: 1, // oom_kill
        _padding: [0; 3],
        address: 0,
        timestamp_ns: unsafe { bpf_ktime_get_ns() },
        comm: bpf_get_current_comm().map_err(|e| e as i64)?,
    };
    EVENTS.output(&event, 0).map_err(|_| 1i64)?;
    Ok(())
}

Verify the offset before trusting the PID:

sudo cat /sys/kernel/debug/tracing/events/oom/mark_victim/format
field:unsigned short common_type;     offset:0; size:2; signed:0;
field:unsigned char  common_flags;    offset:2; size:1; signed:0;
field:unsigned char  common_preempt_count; offset:3; size:1; signed:0;
field:int            common_pid;      offset:4; size:4; signed:1;

field:int            pid;             offset:8; size:4; signed:1;

Note that for the OOM handler, the pid field is the victim’s PID (read from the tracepoint context), not the currently running task. The OOM killer runs in kernel context and the “current” task may be any process that happened to trigger the memory allocation that tipped the system over.

Warning

Validate before trusting the output Tracepoint format offsets can shift between kernel versions and between architectures (x86_64 vs arm64 in particular). If the BPF output doesn’t match what /sys/kernel/debug/tracing/trace_pipe reports for the same events, re-read the format file before tuning anything else.

Wiring the userspace loader

Following the cumulative pattern from Lab 5, userspace loads each program and attaches it to its tracepoint:

let program_pf = bpf.program_mut("handle_page_fault").unwrap();
program_pf.load()?;
program_pf.attach("exceptions", "page_fault_user")?;

let program_oom = bpf.program_mut("handle_oom").unwrap();
program_oom.load()?;
program_oom.attach("oom", "mark_victim")?;

Running it

Warning

The page_fault_user tracepoint fires extremely frequently on a busy system. Every new page touched by every process generates an event. Without PID filtering, you will be overwhelmed with output. Always use --pid when tracing page faults.

Tracing page faults for a specific process

First, find the PID of a process you want to trace:

sleep 3600 &
echo $!
# e.g., 5100

Start tuxscope with a PID filter:

sudo tuxscope mem --pid 5100

In another terminal, allocate and touch memory. A quick way to generate page faults is with a program that allocates a buffer and writes to every page:

// pagefault_demo.c
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

int main() {
    // Give yourself time to attach tuxscope after starting the process.
    sleep(2);

    // Allocate 100 MB
    size_t size = 100 * 1024 * 1024;
    char *buf = malloc(size);

    // Touch every page (4096 bytes apart) to trigger page faults
    for (size_t i = 0; i < size; i += 4096) {
        buf[i] = 'A';
    }

    free(buf);
    return 0;
}

Compile it, start it in the background, then attach tuxscope to its PID. The demo sleeps briefly before allocating so you have time to attach:

gcc -o pagefault_demo pagefault_demo.c
./pagefault_demo &
DEMO_PID=$!
sudo tuxscope mem --pid "$DEMO_PID"

Increase the sleep duration if the demo still finishes before tuxscope attaches.

You will see a stream of page fault events:

[PAGE_FAULT]  pid=5200 address=0x7f3a4c000000 comm=pagefault_demo
[PAGE_FAULT]  pid=5200 address=0x7f3a4c001000 comm=pagefault_demo
[PAGE_FAULT]  pid=5200 address=0x7f3a4c002000 comm=pagefault_demo
[PAGE_FAULT]  pid=5200 address=0x7f3a4c003000 comm=pagefault_demo
...

Notice the addresses incrementing by 0x1000, 4096 bytes, the page size. Each write to a new page triggers exactly one fault.

Understanding the addresses

The 0x7f... range is the mmap region where large malloc() allocations land. Compare this with stack faults, which appear near 0x7fff...:

[PAGE_FAULT]  pid=5200 address=0x7f3a4c000000   <-- heap/mmap region
[PAGE_FAULT]  pid=5200 address=0x7ffd8a123000   <-- stack region

The address tells you what kind of memory access triggered the fault, which is invaluable for diagnosing performance issues caused by excessive page faulting.

JSON output

sudo tuxscope mem --pid 5200 --format json
{"pid":5200,"event_type":"page_fault","address":"0x7f3a4c000000","timestamp_ns":230481567890,"comm":"pagefault_demo"}
{"pid":5200,"event_type":"page_fault","address":"0x7f3a4c001000","timestamp_ns":230481568120,"comm":"pagefault_demo"}
{"pid":5200,"event_type":"page_fault","address":"0x7f3a4c002000","timestamp_ns":230481568340,"comm":"pagefault_demo"}

Observing the OOM killer

Warning

The following exercise will intentionally exhaust memory on your system. Run it only on a test machine or VM. Active processes may be killed.

To trigger the OOM killer, you can use a stress tool:

# In one terminal, trace OOM events (no --pid filter needed; OOM is rare)
sudo tuxscope mem

# In another terminal, exhaust memory
stress-ng --vm 4 --vm-bytes 90% --vm-keep -t 60

If the system enters OOM, you will see:

[OOM_KILL]  pid=6100 comm=stress-ng

This tells you which process was selected as the victim. The kernel chose it based on oom_score_adj and memory consumption.

Exercises

  1. Count faults per allocation size. Write a program that allocates 1 MB, 10 MB, and 100 MB in sequence (touching every page each time). Pipe the JSON output through jq and count the page faults for each phase. Verify that the count matches size / 4096.

  2. Compare malloc vs mmap. Allocate memory using mmap(MAP_ANONYMOUS | MAP_PRIVATE) directly instead of malloc(). Do you see any difference in the page fault pattern? Try adding MAP_POPULATE and observe what changes.

  3. Stack depth faults. Write a deeply recursive function and trace page faults. You should see faults near the top of user address space (typically 0x7fff...) as the stack grows. How many pages does the kernel allocate before the stack overflows?

  4. OOM score investigation. Before triggering OOM, check the oom_score of various processes (cat /proc/<pid>/oom_score). After the OOM killer fires, verify that it chose the process with the highest score. Try adjusting oom_score_adj and observe the change.

What’s next

In Lab 7: Disk I/O Profiling, you will trace the Linux block layer to measure I/O latency. That lab introduces a key new eBPF technique: using a HashMap for stateful tracking across two tracepoints, allowing you to compute the time between an I/O request being issued and completed.