Tutorial

Tuxscope Lab 2: Syscall Tracing

Trace all system calls in real time, capture syscall IDs from raw_syscalls/sys_enter, resolve them to names, and switch from PerfEventArray to RingBuf.

6 min read beginner

Prerequisites

  • Completed Lab 1 (Hello eBPF)
  • Basic understanding of userspace vs kernel

Part 2 of 7 in Tuxscope: Linux Kernel Observability with eBPF

Table of Contents

In Lab 1 you attached a probe to a single syscall, write(), and streamed events from kernel to userspace. That was enough to prove the pipeline works, but a real observability tool needs to see everything a process does. In this lab you will trace all system calls by attaching to the raw_syscalls/sys_enter tracepoint, capture the syscall number, and resolve it to a human-readable name in userspace.

Along the way you will replace PerfEventArray with RingBuf: the modern buffer type that solves several problems with the per-CPU approach used in Lab 1.

The complete source code is at gitlab.com/sfoerster/tuxscope.

Note

Prerequisites You need a Linux system running kernel 5.8 or later, root access (or sudo), and the tuxscope binary built from source. You should have completed Lab 1 and be comfortable with the tuxscope workspace structure.

What are system calls?

A system call (syscall) is the interface between a user-space process and the Linux kernel. When a program needs to do anything that requires kernel privileges, open a file, allocate memory, send a network packet, fork a new process, it makes a syscall.

The process works like this:

  1. The program places the syscall number in a register (rax on x86_64) and arguments in other registers.
  2. It executes the syscall instruction, which traps into the kernel.
  3. The kernel looks up the syscall number in the syscall table, calls the corresponding kernel function, and returns the result.
 User Space                      Kernel Space
┌──────────────┐                ┌──────────────────────────┐
│              │                │  Syscall Table           │
│  open()  ────┼── rax = 2 ───→│  [0] read    → sys_read  │
│              │   syscall      │  [1] write   → sys_write │
│              │   instruction  │  [2] open    → sys_open  │
│              │                │  [3] close   → sys_close │
│  Result ←────┼── rax = fd ───│  ...                     │
│              │                │  [300+] ...              │
└──────────────┘                └──────────────────────────┘

Linux x86_64 has over 300 syscalls. Each has a stable number that never changes (the kernel maintains backward compatibility). You can see the full table in the kernel source at arch/x86/entry/syscalls/syscall_64.tbl, or more conveniently at ausyscall --dump if you have the audit tools installed.

The most common syscalls you will see in trace output:

NumberNamePurpose
0readRead from a file descriptor
1writeWrite to a file descriptor
2openOpen a file
3closeClose a file descriptor
9mmapMap memory
56cloneCreate a new process/thread
59execveExecute a program
231exit_groupTerminate all threads
257openatOpen file relative to directory
262newfstatatGet file status

Note

openat vs open Modern programs almost never call open() directly. glibc’s open() wrapper uses openat() (syscall 257) internally, with AT_FDCWD as the directory file descriptor. If you trace a program that opens files and see no open syscalls, look for openat instead.

From PerfEventArray to RingBuf

Lab 1 used PerfEventArray, which creates a separate ring buffer per CPU. This works but has drawbacks:

  • No global ordering. Events from different CPUs arrive in separate buffers. Userspace must merge and sort them to get a chronological view.
  • Per-CPU polling. Userspace needs one reader per CPU, adding complexity.
  • Fixed allocation. Each CPU gets its own buffer, so total memory usage scales with CPU count.

RingBuf (available since Linux 5.8) is a single shared ring buffer across all CPUs:

 PerfEventArray (Lab 1)              RingBuf (Lab 2+)
┌──────────────────────┐            ┌──────────────────────┐
│ CPU 0: [e1] [e4] [e7]│            │                      │
│ CPU 1: [e2] [e5]     │            │ [e1] [e2] [e3] [e4] │
│ CPU 2: [e3] [e6]     │            │ [e5] [e6] [e7]      │
│                      │            │                      │
│ 3 buffers to poll    │            │ 1 buffer to poll     │
│ Events unordered     │            │ Events ordered       │
└──────────────────────┘            └──────────────────────┘

RingBuf provides global ordering (events appear in submission order), simpler userspace code (one buffer to poll), and better memory efficiency. The tradeoff is slightly more contention under very high event rates on many-core systems, but for observability workloads this is rarely a problem.

All remaining tuxscope labs use RingBuf.

The event struct

The syscall event captures the syscall number along with the usual process metadata:

// tuxscope-common/src/lib.rs

#[repr(C)]
#[derive(Clone, Copy)]
pub struct SyscallEvent {
    pub pid: u32,
    pub syscall_id: u64,
    pub timestamp_ns: u64,
    pub comm: [u8; 16],
}

The syscall_id field holds the raw syscall number (e.g., 1 for write, 2 for open). Userspace resolves this to a name. Doing the lookup in userspace rather than in the eBPF program is deliberate, string lookups are expensive in BPF context and would waste cycles in the kernel.

The eBPF program

This program attaches to raw_syscalls/sys_enter, which fires on every syscall entry regardless of type:

// tuxscope-ebpf/src/syscall.rs

use aya_ebpf::{
    macros::{map, tracepoint},
    maps::RingBuf,
    programs::TracePointContext,
    helpers::{bpf_get_current_pid_tgid, bpf_ktime_get_ns, bpf_get_current_comm},
};
use tuxscope_common::SyscallEvent;

#[map]
static EVENTS: RingBuf = RingBuf::with_byte_size(256 * 1024, 0);

#[tracepoint]
pub fn syscall_trace(ctx: TracePointContext) -> u32 {
    match try_syscall_trace(&ctx) {
        Ok(()) => 0,
        Err(_) => 1,
    }
}

fn try_syscall_trace(ctx: &TracePointContext) -> Result<(), i64> {
    let pid = (bpf_get_current_pid_tgid() >> 32) as u32;
    let timestamp_ns = unsafe { bpf_ktime_get_ns() };
    let comm = bpf_get_current_comm().map_err(|e| e as i64)?;

    // The syscall number is at offset 8 in the raw_syscalls/sys_enter args
    let syscall_id: u64 = unsafe { ctx.read_at(8)? };

    let event = SyscallEvent {
        pid,
        syscall_id,
        timestamp_ns,
        comm,
    };

    if let Some(mut entry) = EVENTS.reserve::<SyscallEvent>(0) {
        entry.write(event);
        entry.submit(0);
    }

    Ok(())
}

Key differences from Lab 1:

  1. RingBuf instead of PerfEventArray. The map is declared with a fixed byte size (256 KB). The reserve / write / submit pattern is how you push events into a RingBuf; you reserve space, write the data, then submit it atomically.
  2. ctx.read_at(8) reads the syscall number from the tracepoint context. The raw_syscalls/sys_enter format has the syscall number at byte offset 8 (after an 8-byte common header). This offset comes from the tracepoint format definition at /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/format.
  3. No helper call for the syscall number. Unlike PID and comm, the syscall ID is not available through a BPF helper. It is part of the tracepoint arguments, so you read it directly from the context.

Warning

Tracepoint argument offsets are architecture-specific The offset 8 for the syscall number is correct for raw_syscalls/sys_enter on x86_64, where the field follows an 8-byte common header. On arm64 the layout differs (and the syscall number table itself is different), and on i386 / arm32 the field widths can shift by 4 bytes. Always re-read the format file for the architecture you are running on:

cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/format
# Look for the line:  field:long id;  offset:8;  size:8;

The offsets are stable for a given (tracepoint, kernel arch) pair, but differ between tracepoints and between architectures. For portable code, use aya’s typed tracepoint bindings or BTF / CO-RE relocations rather than hardcoding numeric offsets. We cover that pattern in the BTF lab later in the series; this lab uses raw offsets to keep the moving parts visible.

The userspace handler

Userspace resolves syscall numbers to names using a lookup table. The table is built from the kernel headers at compile time, but for simplicity tuxscope includes a static mapping of the most common x86_64 syscalls:

// tuxscope/src/syscall_names.rs (excerpt)

pub fn syscall_name(id: u64) -> &'static str {
    match id {
        0 => "read",
        1 => "write",
        2 => "open",
        3 => "close",
        4 => "stat",
        5 => "fstat",
        8 => "lseek",
        9 => "mmap",
        10 => "mprotect",
        11 => "munmap",
        12 => "brk",
        56 => "clone",
        57 => "fork",
        59 => "execve",
        60 => "exit",
        231 => "exit_group",
        257 => "openat",
        262 => "newfstatat",
        _ => "unknown",
    }
}

The event reader for RingBuf is simpler than PerfEventArray, one buffer, one reader:

// Reading from RingBuf (simplified)
use aya::maps::RingBuf;
use tuxscope_common::SyscallEvent;

let mut ring_buf = RingBuf::try_from(bpf.take_map("EVENTS").unwrap())?;

loop {
    while let Some(item) = ring_buf.next() {
        let event: SyscallEvent = unsafe { *(item.as_ptr() as *const SyscallEvent) };
        let comm = core::str::from_utf8(&event.comm)
            .unwrap_or("<invalid>")
            .trim_end_matches('\0');
        let name = syscall_name(event.syscall_id);
        println!("{:<8} {:<16} {:<4} {}", event.pid, comm, event.syscall_id, name);
    }
    // Brief sleep to avoid busy-spinning
    tokio::time::sleep(Duration::from_millis(10)).await;
}

Running it

Trace all syscalls system-wide:

sudo tuxscope syscall
PID      COMM             ID   NAME
1842     bash             1    write
1842     bash             0    read
3217     sshd             0    read
3217     sshd             1    write
4501     Xwayland         7    poll
1        systemd          232  epoll_wait
1842     bash             12   brk
2910     journald         0    read
4501     Xwayland         1    write
1842     bash             257  openat
1842     bash             5    fstat
1842     bash             3    close

The output is extremely noisy system-wide. Filter to a single process to see what it is actually doing:

# In one terminal, find the PID of a shell
echo $$
# 1842

# In another terminal, trace that shell
sudo tuxscope syscall --pid 1842

Now type ls in the traced shell and watch the syscalls:

PID      COMM             ID   NAME
1842     bash             0    read
1842     bash             59   execve
1842     ls               12   brk
1842     ls               257  openat
1842     ls               262  newfstatat
1842     ls               3    close
1842     ls               5    fstat
1842     ls               9    mmap
1842     ls               9    mmap
1842     ls               257  openat
1842     ls               0    read
1842     ls               3    close
1842     ls               1    write
1842     ls               3    close
1842     ls               231  exit_group

You can read the story: bash reads your input (read), forks and execs ls (execve), ls sets up its memory (brk, mmap), opens and stats files (openat, newfstatat), writes the result to stdout (write), and exits (exit_group). Every user-visible action is a sequence of syscalls.

JSON output:

sudo tuxscope syscall --pid 1842 --format json
{"pid":1842,"syscall_id":0,"syscall_name":"read","comm":"bash","timestamp_ns":9823851029384}
{"pid":1842,"syscall_id":59,"syscall_name":"execve","comm":"bash","timestamp_ns":9823851031205}
{"pid":1842,"syscall_id":12,"syscall_name":"brk","comm":"ls","timestamp_ns":9823851045891}
{"pid":1842,"syscall_id":257,"syscall_name":"openat","comm":"ls","timestamp_ns":9823851098234}

Note

comm changes mid-trace Notice how the comm field changes from bash to ls after the execve syscall. The bpf_get_current_comm() helper returns the current process name at the time the BPF program runs. After execve replaces the process image, the name changes, but the PID stays the same.

Understanding the output

A few patterns to look for when reading syscall traces:

Process startup always begins with execve, followed by a burst of brk, mmap, openat (loading shared libraries), mprotect (setting memory permissions), and close.

File operations show up as openat (get a file descriptor), read/write (use it), close (release it). If you see many openat/close pairs with no reads or writes, the program is likely stat-checking files.

Network activity appears as socket, connect, sendto, recvfrom. You cannot see the IP addresses from the syscall trace alone; that requires probing deeper, which is what Lab 4 does.

Idle processes will show epoll_wait, poll, or select; all ways of saying “wake me up when something happens.”

Exercises

  1. Count syscalls by type. Pipe the JSON output through jq to count how many times each syscall occurs for a given process. Which syscalls does your shell make most often while idle? What about during a find / -name "*.conf" run?

  2. Add syscall duration. Attach a second program to raw_syscalls/sys_exit and compute the time between entry and exit for each syscall. Store the entry timestamp in a BPF HashMap keyed by (pid, tid). This is how strace-like tools compute syscall latency.

  3. Build a syscall allow-list. Record all syscalls a program makes during normal operation, then write a BPF program that logs any syscall not on the allow-list. This is the foundation of seccomp profile generation.

  4. Explore the tracepoint format file. Read /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/format and verify that the syscall number is at offset 8. Then look at /sys/kernel/debug/tracing/events/syscalls/sys_enter_openat/format, what additional fields does it expose?

What’s next

Syscall tracing tells you what a process asks the kernel to do, but not the details. In Lab 3: File I/O Observation, you will probe deeper by attaching kprobes to the VFS layer, the kernel’s virtual file system, to see exactly how much data each process reads and writes.