In Lab 1 you attached a probe to a single syscall, write(), and streamed events from kernel to userspace. That was enough to prove the pipeline works, but a real observability tool needs to see everything a process does. In this lab you will trace all system calls by attaching to the raw_syscalls/sys_enter tracepoint, capture the syscall number, and resolve it to a human-readable name in userspace.
Along the way you will replace PerfEventArray with RingBuf: the modern buffer type that solves several problems with the per-CPU approach used in Lab 1.
The complete source code is at gitlab.com/sfoerster/tuxscope.
Note
Prerequisites You need a Linux system running kernel 5.8 or later, root access (or
sudo), and the tuxscope binary built from source. You should have completed Lab 1 and be comfortable with the tuxscope workspace structure.
What are system calls?
A system call (syscall) is the interface between a user-space process and the Linux kernel. When a program needs to do anything that requires kernel privileges, open a file, allocate memory, send a network packet, fork a new process, it makes a syscall.
The process works like this:
- The program places the syscall number in a register (
raxon x86_64) and arguments in other registers. - It executes the
syscallinstruction, which traps into the kernel. - The kernel looks up the syscall number in the syscall table, calls the corresponding kernel function, and returns the result.
User Space Kernel Space
┌──────────────┐ ┌──────────────────────────┐
│ │ │ Syscall Table │
│ open() ────┼── rax = 2 ───→│ [0] read → sys_read │
│ │ syscall │ [1] write → sys_write │
│ │ instruction │ [2] open → sys_open │
│ │ │ [3] close → sys_close │
│ Result ←────┼── rax = fd ───│ ... │
│ │ │ [300+] ... │
└──────────────┘ └──────────────────────────┘Linux x86_64 has over 300 syscalls. Each has a stable number that never changes (the kernel maintains backward compatibility). You can see the full table in the kernel source at arch/x86/entry/syscalls/syscall_64.tbl, or more conveniently at ausyscall --dump if you have the audit tools installed.
The most common syscalls you will see in trace output:
| Number | Name | Purpose |
|---|---|---|
| 0 | read | Read from a file descriptor |
| 1 | write | Write to a file descriptor |
| 2 | open | Open a file |
| 3 | close | Close a file descriptor |
| 9 | mmap | Map memory |
| 56 | clone | Create a new process/thread |
| 59 | execve | Execute a program |
| 231 | exit_group | Terminate all threads |
| 257 | openat | Open file relative to directory |
| 262 | newfstatat | Get file status |
Note
openat vs open Modern programs almost never call
open()directly. glibc’sopen()wrapper usesopenat()(syscall 257) internally, withAT_FDCWDas the directory file descriptor. If you trace a program that opens files and see noopensyscalls, look foropenatinstead.
From PerfEventArray to RingBuf
Lab 1 used PerfEventArray, which creates a separate ring buffer per CPU. This works but has drawbacks:
- No global ordering. Events from different CPUs arrive in separate buffers. Userspace must merge and sort them to get a chronological view.
- Per-CPU polling. Userspace needs one reader per CPU, adding complexity.
- Fixed allocation. Each CPU gets its own buffer, so total memory usage scales with CPU count.
RingBuf (available since Linux 5.8) is a single shared ring buffer across all CPUs:
PerfEventArray (Lab 1) RingBuf (Lab 2+)
┌──────────────────────┐ ┌──────────────────────┐
│ CPU 0: [e1] [e4] [e7]│ │ │
│ CPU 1: [e2] [e5] │ │ [e1] [e2] [e3] [e4] │
│ CPU 2: [e3] [e6] │ │ [e5] [e6] [e7] │
│ │ │ │
│ 3 buffers to poll │ │ 1 buffer to poll │
│ Events unordered │ │ Events ordered │
└──────────────────────┘ └──────────────────────┘RingBuf provides global ordering (events appear in submission order), simpler userspace code (one buffer to poll), and better memory efficiency. The tradeoff is slightly more contention under very high event rates on many-core systems, but for observability workloads this is rarely a problem.
All remaining tuxscope labs use RingBuf.
The event struct
The syscall event captures the syscall number along with the usual process metadata:
// tuxscope-common/src/lib.rs
#[repr(C)]
#[derive(Clone, Copy)]
pub struct SyscallEvent {
pub pid: u32,
pub syscall_id: u64,
pub timestamp_ns: u64,
pub comm: [u8; 16],
}The syscall_id field holds the raw syscall number (e.g., 1 for write, 2 for open). Userspace resolves this to a name. Doing the lookup in userspace rather than in the eBPF program is deliberate, string lookups are expensive in BPF context and would waste cycles in the kernel.
The eBPF program
This program attaches to raw_syscalls/sys_enter, which fires on every syscall entry regardless of type:
// tuxscope-ebpf/src/syscall.rs
use aya_ebpf::{
macros::{map, tracepoint},
maps::RingBuf,
programs::TracePointContext,
helpers::{bpf_get_current_pid_tgid, bpf_ktime_get_ns, bpf_get_current_comm},
};
use tuxscope_common::SyscallEvent;
#[map]
static EVENTS: RingBuf = RingBuf::with_byte_size(256 * 1024, 0);
#[tracepoint]
pub fn syscall_trace(ctx: TracePointContext) -> u32 {
match try_syscall_trace(&ctx) {
Ok(()) => 0,
Err(_) => 1,
}
}
fn try_syscall_trace(ctx: &TracePointContext) -> Result<(), i64> {
let pid = (bpf_get_current_pid_tgid() >> 32) as u32;
let timestamp_ns = unsafe { bpf_ktime_get_ns() };
let comm = bpf_get_current_comm().map_err(|e| e as i64)?;
// The syscall number is at offset 8 in the raw_syscalls/sys_enter args
let syscall_id: u64 = unsafe { ctx.read_at(8)? };
let event = SyscallEvent {
pid,
syscall_id,
timestamp_ns,
comm,
};
if let Some(mut entry) = EVENTS.reserve::<SyscallEvent>(0) {
entry.write(event);
entry.submit(0);
}
Ok(())
}Key differences from Lab 1:
RingBufinstead ofPerfEventArray. The map is declared with a fixed byte size (256 KB). Thereserve/write/submitpattern is how you push events into a RingBuf; you reserve space, write the data, then submit it atomically.ctx.read_at(8)reads the syscall number from the tracepoint context. Theraw_syscalls/sys_enterformat has the syscall number at byte offset 8 (after an 8-byte common header). This offset comes from the tracepoint format definition at/sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/format.- No helper call for the syscall number. Unlike PID and comm, the syscall ID is not available through a BPF helper. It is part of the tracepoint arguments, so you read it directly from the context.
Warning
Tracepoint argument offsets are architecture-specific The offset
8for the syscall number is correct forraw_syscalls/sys_enteron x86_64, where the field follows an 8-byte common header. On arm64 the layout differs (and the syscall number table itself is different), and on i386 / arm32 the field widths can shift by 4 bytes. Always re-read the format file for the architecture you are running on:cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/format # Look for the line: field:long id; offset:8; size:8;The offsets are stable for a given (tracepoint, kernel arch) pair, but differ between tracepoints and between architectures. For portable code, use
aya’s typed tracepoint bindings or BTF / CO-RE relocations rather than hardcoding numeric offsets. We cover that pattern in the BTF lab later in the series; this lab uses raw offsets to keep the moving parts visible.
The userspace handler
Userspace resolves syscall numbers to names using a lookup table. The table is built from the kernel headers at compile time, but for simplicity tuxscope includes a static mapping of the most common x86_64 syscalls:
// tuxscope/src/syscall_names.rs (excerpt)
pub fn syscall_name(id: u64) -> &'static str {
match id {
0 => "read",
1 => "write",
2 => "open",
3 => "close",
4 => "stat",
5 => "fstat",
8 => "lseek",
9 => "mmap",
10 => "mprotect",
11 => "munmap",
12 => "brk",
56 => "clone",
57 => "fork",
59 => "execve",
60 => "exit",
231 => "exit_group",
257 => "openat",
262 => "newfstatat",
_ => "unknown",
}
}The event reader for RingBuf is simpler than PerfEventArray, one buffer, one reader:
// Reading from RingBuf (simplified)
use aya::maps::RingBuf;
use tuxscope_common::SyscallEvent;
let mut ring_buf = RingBuf::try_from(bpf.take_map("EVENTS").unwrap())?;
loop {
while let Some(item) = ring_buf.next() {
let event: SyscallEvent = unsafe { *(item.as_ptr() as *const SyscallEvent) };
let comm = core::str::from_utf8(&event.comm)
.unwrap_or("<invalid>")
.trim_end_matches('\0');
let name = syscall_name(event.syscall_id);
println!("{:<8} {:<16} {:<4} {}", event.pid, comm, event.syscall_id, name);
}
// Brief sleep to avoid busy-spinning
tokio::time::sleep(Duration::from_millis(10)).await;
}Running it
Trace all syscalls system-wide:
sudo tuxscope syscallPID COMM ID NAME
1842 bash 1 write
1842 bash 0 read
3217 sshd 0 read
3217 sshd 1 write
4501 Xwayland 7 poll
1 systemd 232 epoll_wait
1842 bash 12 brk
2910 journald 0 read
4501 Xwayland 1 write
1842 bash 257 openat
1842 bash 5 fstat
1842 bash 3 closeThe output is extremely noisy system-wide. Filter to a single process to see what it is actually doing:
# In one terminal, find the PID of a shell
echo $$
# 1842
# In another terminal, trace that shell
sudo tuxscope syscall --pid 1842Now type ls in the traced shell and watch the syscalls:
PID COMM ID NAME
1842 bash 0 read
1842 bash 59 execve
1842 ls 12 brk
1842 ls 257 openat
1842 ls 262 newfstatat
1842 ls 3 close
1842 ls 5 fstat
1842 ls 9 mmap
1842 ls 9 mmap
1842 ls 257 openat
1842 ls 0 read
1842 ls 3 close
1842 ls 1 write
1842 ls 3 close
1842 ls 231 exit_groupYou can read the story: bash reads your input (read), forks and execs ls (execve), ls sets up its memory (brk, mmap), opens and stats files (openat, newfstatat), writes the result to stdout (write), and exits (exit_group). Every user-visible action is a sequence of syscalls.
JSON output:
sudo tuxscope syscall --pid 1842 --format json{"pid":1842,"syscall_id":0,"syscall_name":"read","comm":"bash","timestamp_ns":9823851029384}
{"pid":1842,"syscall_id":59,"syscall_name":"execve","comm":"bash","timestamp_ns":9823851031205}
{"pid":1842,"syscall_id":12,"syscall_name":"brk","comm":"ls","timestamp_ns":9823851045891}
{"pid":1842,"syscall_id":257,"syscall_name":"openat","comm":"ls","timestamp_ns":9823851098234}Note
comm changes mid-trace Notice how the
commfield changes frombashtolsafter theexecvesyscall. Thebpf_get_current_comm()helper returns the current process name at the time the BPF program runs. Afterexecvereplaces the process image, the name changes, but the PID stays the same.
Understanding the output
A few patterns to look for when reading syscall traces:
Process startup always begins with execve, followed by a burst of brk, mmap, openat (loading shared libraries), mprotect (setting memory permissions), and close.
File operations show up as openat (get a file descriptor), read/write (use it), close (release it). If you see many openat/close pairs with no reads or writes, the program is likely stat-checking files.
Network activity appears as socket, connect, sendto, recvfrom. You cannot see the IP addresses from the syscall trace alone; that requires probing deeper, which is what Lab 4 does.
Idle processes will show epoll_wait, poll, or select; all ways of saying “wake me up when something happens.”
Exercises
-
Count syscalls by type. Pipe the JSON output through
jqto count how many times each syscall occurs for a given process. Which syscalls does your shell make most often while idle? What about during afind / -name "*.conf"run? -
Add syscall duration. Attach a second program to
raw_syscalls/sys_exitand compute the time between entry and exit for each syscall. Store the entry timestamp in a BPF HashMap keyed by(pid, tid). This is how strace-like tools compute syscall latency. -
Build a syscall allow-list. Record all syscalls a program makes during normal operation, then write a BPF program that logs any syscall not on the allow-list. This is the foundation of seccomp profile generation.
-
Explore the tracepoint format file. Read
/sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/formatand verify that the syscall number is at offset 8. Then look at/sys/kernel/debug/tracing/events/syscalls/sys_enter_openat/format, what additional fields does it expose?
What’s next
Syscall tracing tells you what a process asks the kernel to do, but not the details. In Lab 3: File I/O Observation, you will probe deeper by attaching kprobes to the VFS layer, the kernel’s virtual file system, to see exactly how much data each process reads and writes.