In Lab 2 you traced every syscall a process makes. You could see that a process called read and write, but not how much data it moved. In this lab you go one level deeper: instead of tracing the syscall interface, you probe the kernel’s Virtual File System layer to capture the byte count for every read and write operation.
This is also the first lab where you use kprobes instead of tracepoints: a different attachment mechanism that lets you hook into any kernel function, not just the predefined tracepoint locations.
The complete source code is at gitlab.com/sfoerster/tuxscope.
Note
Prerequisites You need a Linux system running kernel 5.8 or later with root access and the tuxscope binary built from source. You should have completed Lab 1 and Lab 2.
The Virtual File System (VFS)
The VFS is one of the most important abstractions in the Linux kernel. It provides a uniform interface for file operations regardless of the underlying storage: ext4, XFS, NFS, procfs, tmpfs, and even sockets all look the same from the VFS perspective.
When a process calls read(), the call path looks like this:
User Space Kernel Space
┌──────────┐ ┌───────────────────────────────────────────┐
│ │ │ │
│ read() ─┼───────→│ sys_read() │
│ │ │ │ │
│ │ │ v │
│ │ │ vfs_read() ← we probe here │
│ │ │ │ │
│ │ │ v │
│ │ │ file->f_op->read() (filesystem driver) │
│ │ │ │ │
│ │ │ v │
│ │ │ ext4_file_read() or nfs_file_read() │
│ │ │ or tmpfs_read() or ... │
│ │ │ │
└──────────┘ └───────────────────────────────────────────┘vfs_read() and vfs_write() are the bottleneck functions; every file read and write passes through them, regardless of filesystem type. By probing these two functions, you capture all file I/O on the system.
The vfs_read function signature is:
ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos);The third parameter, count, is the number of bytes the process requested to read. This is what we capture. The same pattern applies to vfs_write:
ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos);Note
Requested vs actual bytes We capture the
countparameter (bytes requested), not the return value (bytes actually transferred). To get the actual byte count you would need a kretprobe, which captures the return value. For observability purposes, the requested count is usually close enough and avoids the overhead of a second probe per operation.
Kprobes vs tracepoints
Labs 1 and 2 used tracepoints. Tracepoints are stable, well-documented hooks placed at specific locations in the kernel source code by kernel developers. They have a defined argument format and are considered part of the kernel’s ABI; they rarely change between versions.
Kprobes are different. A kprobe can attach to almost any kernel function; you just name the function and the kprobe infrastructure inserts a breakpoint at its entry point. When the function is called, your BPF program runs before the function body executes.
| Feature | Tracepoint | Kprobe |
|---|---|---|
| Attachment | Predefined locations | Any kernel function |
| Stability | Stable ABI, rarely changes | Function may be renamed/removed |
| Argument access | Structured format with offsets | Raw function parameters by index |
| Performance | Slightly lower overhead | Slightly higher (breakpoint-based) |
| Availability | Limited set | Thousands of functions |
The tradeoff: tracepoints are stable but limited; kprobes are flexible but can break if the kernel renames or removes a function. vfs_read and vfs_write have been stable for many kernel versions, so this is a reasonable place to use kprobes.
Warning
Kprobe stability Unlike tracepoints, kprobe targets are not part of the kernel’s stable ABI. The
vfs_readandvfs_writefunctions have been present and stable for years, but there is no guarantee they will not be refactored in a future kernel release. If your kprobe fails to attach, check whether the function still exists withcat /proc/kallsyms | grep vfs_read.
Accessing function arguments in kprobes
In a kprobe, function arguments are accessed through the ProbeContext by index. On x86_64, the first six arguments are passed in registers rdi, rsi, rdx, rcx, r8, r9 (following the System V AMD64 calling convention). The BPF kprobe infrastructure maps these to argument indices 0 through 5.
For vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos):
| Index | Register | Parameter |
|---|---|---|
| 0 | rdi | struct file * |
| 1 | rsi | char __user * |
| 2 | rdx | size_t count |
| 3 | rcx | loff_t *pos |
We want index 2, the byte count.
The event struct
// tuxscope-common/src/lib.rs
#[repr(C)]
#[derive(Clone, Copy)]
pub struct FileIoEvent {
pub pid: u32,
pub op: u8, // 0 = read, 1 = write
pub bytes: u64,
pub timestamp_ns: u64,
pub comm: [u8; 16],
}The op field distinguishes reads from writes. Using a single event type for both operations keeps the userspace code simple, one struct, one buffer, one reader.
The eBPF programs
Two kprobes, one for each VFS function, share the same RingBuf and event format:
// tuxscope-ebpf/src/fileio.rs
use aya_ebpf::{
macros::{kprobe, map},
maps::RingBuf,
programs::ProbeContext,
helpers::{bpf_get_current_pid_tgid, bpf_ktime_get_ns, bpf_get_current_comm},
};
use tuxscope_common::FileIoEvent;
#[map]
static EVENTS: RingBuf = RingBuf::with_byte_size(256 * 1024, 0);
#[kprobe]
pub fn trace_vfs_read(ctx: ProbeContext) -> u32 {
match try_file_io(&ctx, 0) {
Ok(()) => 0,
Err(_) => 1,
}
}
#[kprobe]
pub fn trace_vfs_write(ctx: ProbeContext) -> u32 {
match try_file_io(&ctx, 1) {
Ok(()) => 0,
Err(_) => 1,
}
}
fn try_file_io(ctx: &ProbeContext, op: u8) -> Result<(), i64> {
let pid = (bpf_get_current_pid_tgid() >> 32) as u32;
let timestamp_ns = unsafe { bpf_ktime_get_ns() };
let comm = bpf_get_current_comm().map_err(|e| e as i64)?;
// Argument index 2 is the `count` parameter (size_t)
let bytes: u64 = ctx.arg(2).ok_or(1i64)?;
let event = FileIoEvent {
pid,
op,
bytes,
timestamp_ns,
comm,
};
if let Some(mut entry) = EVENTS.reserve::<FileIoEvent>(0) {
entry.write(event);
entry.submit(0);
}
Ok(())
}Key points:
- Two
#[kprobe]functions call the same inner function with a differentopvalue. This avoids duplicating the event-building logic. ctx.arg(2)reads the third function argument (zero-indexed). On x86_64, this is therdxregister: thecountparameter.- Same RingBuf pattern as Lab 2: reserve, write, submit.
The userspace side attaches both kprobes:
// Attaching kprobes (simplified)
use aya::programs::KProbe;
let vfs_read: &mut KProbe = bpf.program_mut("trace_vfs_read").unwrap().try_into()?;
vfs_read.load()?;
vfs_read.attach("vfs_read", 0)?;
let vfs_write: &mut KProbe = bpf.program_mut("trace_vfs_write").unwrap().try_into()?;
vfs_write.load()?;
vfs_write.attach("vfs_write", 0)?;The attach call takes the kernel function name as a string. If the function does not exist in the running kernel, this returns an error.
Running it
sudo tuxscope fileioPID COMM OP BYTES TIMESTAMP_NS
1842 bash read 1 9824100293847
3217 sshd read 16384 9824100312054
3217 sshd write 128 9824100345891
2910 journald read 8192 9824100398234
4501 Xwayland write 32768 9824100402847
1842 bash write 4 9824100456023
891 postgres read 8192 9824100498412
891 postgres write 8192 9824100512331Filter to a specific process and watch it work:
# Copy a file while tracing
sudo tuxscope fileio --pid $(pgrep -x cp) &
cp /usr/share/dict/words /tmp/words-copyPID COMM OP BYTES TIMESTAMP_NS
5102 cp read 131072 9824200293847
5102 cp write 131072 9824200312054
5102 cp read 131072 9824200345891
5102 cp write 131072 9824200378234
5102 cp read 131072 9824200402847
5102 cp write 131072 9824200435691
5102 cp read 0 9824200468234The cp command reads and writes in 128 KB chunks (131072 bytes). The final read returns 0 bytes; that is how cp knows it reached the end of the file. This chunked read-write pattern is how most file copy operations work.
JSON output:
sudo tuxscope fileio --pid 5102 --format json{"pid":5102,"op":"read","bytes":131072,"comm":"cp","timestamp_ns":9824200293847}
{"pid":5102,"op":"write","bytes":131072,"comm":"cp","timestamp_ns":9824200312054}
{"pid":5102,"op":"read","bytes":131072,"comm":"cp","timestamp_ns":9824200345891}
{"pid":5102,"op":"write","bytes":131072,"comm":"cp","timestamp_ns":9824200378234}Interpreting I/O patterns
Different workloads have recognizable I/O signatures:
Sequential copy (like cp): alternating read-write pairs of equal size, typically 128 KB or 256 KB. High throughput, low syscall count per byte.
Log writing (like journald): frequent small writes (tens to hundreds of bytes). High syscall count per byte. If you see a process doing many tiny writes, it may benefit from buffering.
Database workload (like postgres): mixed reads and writes at page-aligned sizes (typically 8 KB for PostgreSQL). Reads and writes are interleaved as the database processes queries.
Interactive shell: single-byte reads (one keystroke at a time) and small writes (prompt output, command output).
Warning
Noise from kernel-internal I/O Not all VFS operations come from user-initiated syscalls. The kernel generates VFS reads and writes for page cache management, swap, and internal bookkeeping. If you see unexpected I/O from processes like
kworkerorkswapd, that is normal kernel background activity.
Exercises
-
Add the filename. The first argument to
vfs_readandvfs_writeis astruct file *. From this pointer you can reachfile->f_path.dentry->d_name.nameto get the filename. Usebpf_probe_read_kernelto safely dereference each pointer in the chain. This is tricky, the BPF verifier is strict about pointer dereferences, but it makes the output dramatically more useful. -
Aggregate bytes per process. Instead of streaming every event, use a BPF HashMap to accumulate total bytes read and written per PID. In userspace, periodically dump the map to show a top-like view of I/O by process. This is how
biotopworks. -
Capture actual bytes with a kretprobe. Attach a kretprobe to
vfs_readto capture the return value (actual bytes read, or a negative error code). Compare the requested count from the kprobe entry with the actual bytes from the return. How often do reads return fewer bytes than requested? -
Compare with
iotop. Runsudo iotop -oalongsidesudo tuxscope fileioand compare the data. Where do the numbers agree? Where do they diverge? The answer reveals the difference between VFS-level I/O and block-level I/O, reads served from the page cache never reach the block layer.
What’s next
In Lab 4: Network Monitoring, you will probe the TCP stack to observe network connections in real time. You will use kretprobes for the first time to capture return values, read socket structures with bpf_probe_read_kernel, and extract IP addresses and port numbers from kernel data structures.