When a program allocates memory with malloc(), the kernel does not immediately give it physical RAM. Instead, it hands back a virtual address and waits. The first time the program actually touches that address, the CPU raises a page fault, and the kernel steps in to map a physical page. This lazy strategy is called demand paging, and it is one of the most important performance optimizations in the Linux kernel.
In this lab you will attach eBPF programs to the page fault and OOM killer tracepoints. You will see demand paging happen in real time, learn to distinguish minor faults from major faults, and observe what happens when the system runs out of memory entirely.
Note
Prerequisites This tutorial is part of the Tuxscope series. You need a built tuxscope binary, Linux 5.8+, and root privileges.
Virtual memory and page faults
Every process on Linux operates in its own virtual address space. The kernel maintains a page table that maps virtual addresses to physical page frames. But the mapping is not established when memory is allocated; it is established when memory is first accessed.
The virtual address space
A 64-bit Linux process has a vast address space, but only portions of it are mapped at any time:
Virtual Address Space (per process)
┌─────────────────────────┐ 0xFFFFFFFFFFFFFFFF
│ │
│ Kernel Space │ (inaccessible from userspace)
│ │
├─────────────────────────┤ 0x00007FFFFFFFFFFF (typical split)
│ Stack │ grows downward
│ | │
│ v │
│ │
│ (unmapped gap) │
│ │
│ ^ │
│ | │
│ Heap / mmap │ grows upward
│ │
├─────────────────────────┤
│ BSS (zero-init data) │
│ Data (initialized) │
│ Text (code) │
├─────────────────────────┤
│ (unmapped) │
└─────────────────────────┘ 0x0000000000000000When your program calls malloc(4096), the C library requests memory from the kernel (via brk() or mmap()). The kernel updates the process’s virtual memory area (VMA) list to record that this range is valid, but it does not allocate a physical page yet. The page table entry for that address remains empty.
What triggers a page fault
The first time the CPU tries to read or write a virtual address that has no page table entry, the hardware raises a page fault exception. The kernel’s page fault handler then:
- Checks whether the address belongs to a valid VMA (if not, delivers SIGSEGV)
- Allocates a physical page frame
- Updates the page table to map virtual address to physical page
- Returns to the faulting instruction, which now succeeds
Program CPU Kernel
| | |
| write to 0x1000 | |
|------------------>| |
| | no mapping! |
| | page fault --------->|
| | | check VMA: valid
| | | allocate physical page
| | | update page table
| |<----------------------|
| | retry write to 0x1000|
|<------------------| (now succeeds) |Minor vs major faults
- Minor fault: The page exists somewhere in memory (e.g., a shared library already loaded by another process, or a zero page that just needs allocation). No disk I/O required. Copy-on-write faults after
fork()are another common source of minor faults. - Major fault: The page must be read from disk (e.g., a memory-mapped file, or a page that was swapped out). These are orders of magnitude slower.
The page_fault_user tracepoint fires for all user-space faults. Most faults on a running system are minor: the kernel allocating pages on demand as programs touch newly allocated memory.
The OOM killer
When the system runs out of memory and swap, the kernel invokes the OOM (Out Of Memory) killer. It selects a victim using the oom_badness() heuristic (exposed as /proc/<pid>/oom_score), sends it SIGKILL, and reclaims its memory. The oom/mark_victim tracepoint fires when a process is selected for termination.
Warning
The OOM killer is a last resort. By the time it fires, the system is in a degraded state. In production, you should have monitoring that triggers well before OOM conditions. Tuxscope’s OOM tracing is useful for understanding the mechanism, not as a production alerting tool.
The eBPF programs
This lab attaches to two tracepoints:
| Tracepoint | Fires when |
|---|---|
exceptions/page_fault_user | A user-space page fault occurs |
oom/mark_victim | The OOM killer selects a victim |
The event struct
#[repr(C)]
pub struct MemoryEvent {
pub pid: u32,
pub event_type: u8, // 0 = page_fault, 1 = oom_kill
pub _padding: [u8; 3],
pub address: u64,
pub timestamp_ns: u64,
pub comm: [u8; 16],
}The address field holds the faulting virtual address for page faults, or zero for OOM events. Addresses are displayed in hexadecimal, which makes it easier to see which region of the virtual address space is being faulted in; stack addresses look different from heap addresses.
The map declaration matches the pattern established in Lab 4: a RingBuf shared between the page-fault and OOM handlers.
#[map]
static EVENTS: RingBuf = RingBuf::with_byte_size(256 * 1024, 0);The page fault handler
#[tracepoint]
pub fn handle_page_fault(ctx: TracePointContext) -> u32 {
match try_page_fault(&ctx) {
Ok(()) => 0,
Err(_) => 1,
}
}
fn try_page_fault(ctx: &TracePointContext) -> Result<(), i64> {
let tgid = (bpf_get_current_pid_tgid() >> 32) as u32;
// Read the faulting virtual address from the tracepoint context.
// Offset 8 is where the address field lives in the
// exceptions/page_fault_user format on x86_64.
let address: u64 = unsafe { ctx.read_at(8)? };
let event = MemoryEvent {
pid: tgid,
event_type: 0, // page_fault
_padding: [0; 3],
address,
timestamp_ns: unsafe { bpf_ktime_get_ns() },
comm: bpf_get_current_comm().map_err(|e| e as i64)?,
};
EVENTS.output(&event, 0).map_err(|_| 1i64)?;
Ok(())
}The offset 8 comes from the tracepoint format file. Inspect it the same way Lab 5 inspected sched_process_fork:
sudo cat /sys/kernel/debug/tracing/events/exceptions/page_fault_user/formatYou should see something close to this on x86_64:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:unsigned long address; offset:8; size:8; signed:0;
field:unsigned long ip; offset:16; size:8; signed:0;
field:unsigned long error_code; offset:24; size:8; signed:0;The address field at offset:8 is what the BPF code reads. If your kernel reports different offsets, update the literal accordingly.
The OOM handler
#[tracepoint]
pub fn handle_oom(ctx: TracePointContext) -> u32 {
match try_oom(&ctx) {
Ok(()) => 0,
Err(_) => 1,
}
}
fn try_oom(ctx: &TracePointContext) -> Result<(), i64> {
// The victim's PID is at offset 8 in the oom/mark_victim format
let victim_pid: i32 = unsafe { ctx.read_at(8)? };
let event = MemoryEvent {
pid: victim_pid as u32,
event_type: 1, // oom_kill
_padding: [0; 3],
address: 0,
timestamp_ns: unsafe { bpf_ktime_get_ns() },
comm: bpf_get_current_comm().map_err(|e| e as i64)?,
};
EVENTS.output(&event, 0).map_err(|_| 1i64)?;
Ok(())
}Verify the offset before trusting the PID:
sudo cat /sys/kernel/debug/tracing/events/oom/mark_victim/formatfield:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:int pid; offset:8; size:4; signed:1;Note that for the OOM handler, the pid field is the victim’s PID (read from the tracepoint context), not the currently running task. The OOM killer runs in kernel context and the “current” task may be any process that happened to trigger the memory allocation that tipped the system over.
Warning
Validate before trusting the output Tracepoint format offsets can shift between kernel versions and between architectures (x86_64 vs arm64 in particular). If the BPF output doesn’t match what
/sys/kernel/debug/tracing/trace_pipereports for the same events, re-read the format file before tuning anything else.
Wiring the userspace loader
Following the cumulative pattern from Lab 5, userspace loads each program and attaches it to its tracepoint:
let program_pf = bpf.program_mut("handle_page_fault").unwrap();
program_pf.load()?;
program_pf.attach("exceptions", "page_fault_user")?;
let program_oom = bpf.program_mut("handle_oom").unwrap();
program_oom.load()?;
program_oom.attach("oom", "mark_victim")?;Running it
Warning
The
page_fault_usertracepoint fires extremely frequently on a busy system. Every new page touched by every process generates an event. Without PID filtering, you will be overwhelmed with output. Always use--pidwhen tracing page faults.
Tracing page faults for a specific process
First, find the PID of a process you want to trace:
sleep 3600 &
echo $!
# e.g., 5100Start tuxscope with a PID filter:
sudo tuxscope mem --pid 5100In another terminal, allocate and touch memory. A quick way to generate page faults is with a program that allocates a buffer and writes to every page:
// pagefault_demo.c
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int main() {
// Give yourself time to attach tuxscope after starting the process.
sleep(2);
// Allocate 100 MB
size_t size = 100 * 1024 * 1024;
char *buf = malloc(size);
// Touch every page (4096 bytes apart) to trigger page faults
for (size_t i = 0; i < size; i += 4096) {
buf[i] = 'A';
}
free(buf);
return 0;
}Compile it, start it in the background, then attach tuxscope to its PID. The demo sleeps briefly before allocating so you have time to attach:
gcc -o pagefault_demo pagefault_demo.c
./pagefault_demo &
DEMO_PID=$!
sudo tuxscope mem --pid "$DEMO_PID"Increase the sleep duration if the demo still finishes before tuxscope attaches.
You will see a stream of page fault events:
[PAGE_FAULT] pid=5200 address=0x7f3a4c000000 comm=pagefault_demo
[PAGE_FAULT] pid=5200 address=0x7f3a4c001000 comm=pagefault_demo
[PAGE_FAULT] pid=5200 address=0x7f3a4c002000 comm=pagefault_demo
[PAGE_FAULT] pid=5200 address=0x7f3a4c003000 comm=pagefault_demo
...Notice the addresses incrementing by 0x1000, 4096 bytes, the page size. Each write to a new page triggers exactly one fault.
Understanding the addresses
The 0x7f... range is the mmap region where large malloc() allocations land. Compare this with stack faults, which appear near 0x7fff...:
[PAGE_FAULT] pid=5200 address=0x7f3a4c000000 <-- heap/mmap region
[PAGE_FAULT] pid=5200 address=0x7ffd8a123000 <-- stack regionThe address tells you what kind of memory access triggered the fault, which is invaluable for diagnosing performance issues caused by excessive page faulting.
JSON output
sudo tuxscope mem --pid 5200 --format json{"pid":5200,"event_type":"page_fault","address":"0x7f3a4c000000","timestamp_ns":230481567890,"comm":"pagefault_demo"}
{"pid":5200,"event_type":"page_fault","address":"0x7f3a4c001000","timestamp_ns":230481568120,"comm":"pagefault_demo"}
{"pid":5200,"event_type":"page_fault","address":"0x7f3a4c002000","timestamp_ns":230481568340,"comm":"pagefault_demo"}Observing the OOM killer
Warning
The following exercise will intentionally exhaust memory on your system. Run it only on a test machine or VM. Active processes may be killed.
To trigger the OOM killer, you can use a stress tool:
# In one terminal, trace OOM events (no --pid filter needed; OOM is rare)
sudo tuxscope mem
# In another terminal, exhaust memory
stress-ng --vm 4 --vm-bytes 90% --vm-keep -t 60If the system enters OOM, you will see:
[OOM_KILL] pid=6100 comm=stress-ngThis tells you which process was selected as the victim. The kernel chose it based on oom_score_adj and memory consumption.
Exercises
-
Count faults per allocation size. Write a program that allocates 1 MB, 10 MB, and 100 MB in sequence (touching every page each time). Pipe the JSON output through
jqand count the page faults for each phase. Verify that the count matchessize / 4096. -
Compare malloc vs mmap. Allocate memory using
mmap(MAP_ANONYMOUS | MAP_PRIVATE)directly instead ofmalloc(). Do you see any difference in the page fault pattern? Try addingMAP_POPULATEand observe what changes. -
Stack depth faults. Write a deeply recursive function and trace page faults. You should see faults near the top of user address space (typically
0x7fff...) as the stack grows. How many pages does the kernel allocate before the stack overflows? -
OOM score investigation. Before triggering OOM, check the
oom_scoreof various processes (cat /proc/<pid>/oom_score). After the OOM killer fires, verify that it chose the process with the highest score. Try adjustingoom_score_adjand observe the change.
What’s next
In Lab 7: Disk I/O Profiling, you will trace the Linux block layer to measure I/O latency. That lab introduces a key new eBPF technique: using a HashMap for stateful tracking across two tracepoints, allowing you to compute the time between an I/O request being issued and completed.