Tutorial

Tuxscope Lab 3: File I/O Observation

Probe the VFS layer with kprobes on vfs_read and vfs_write to observe per-process file I/O volume in real time.

6 min read intermediate

Prerequisites

  • Completed Labs 1 and 2 (Hello eBPF, Syscall Tracing)

Part 3 of 7 in Tuxscope: Linux Kernel Observability with eBPF

Table of Contents

In Lab 2 you traced every syscall a process makes. You could see that a process called read and write, but not how much data it moved. In this lab you go one level deeper: instead of tracing the syscall interface, you probe the kernel’s Virtual File System layer to capture the byte count for every read and write operation.

This is also the first lab where you use kprobes instead of tracepoints: a different attachment mechanism that lets you hook into any kernel function, not just the predefined tracepoint locations.

The complete source code is at gitlab.com/sfoerster/tuxscope.

Note

Prerequisites You need a Linux system running kernel 5.8 or later with root access and the tuxscope binary built from source. You should have completed Lab 1 and Lab 2.

The Virtual File System (VFS)

The VFS is one of the most important abstractions in the Linux kernel. It provides a uniform interface for file operations regardless of the underlying storage: ext4, XFS, NFS, procfs, tmpfs, and even sockets all look the same from the VFS perspective.

When a process calls read(), the call path looks like this:

 User Space          Kernel Space
┌──────────┐        ┌───────────────────────────────────────────┐
│           │        │                                           │
│  read()  ─┼───────→│  sys_read()                               │
│           │        │      │                                    │
│           │        │      v                                    │
│           │        │  vfs_read()           ← we probe here    │
│           │        │      │                                    │
│           │        │      v                                    │
│           │        │  file->f_op->read()   (filesystem driver) │
│           │        │      │                                    │
│           │        │      v                                    │
│           │        │  ext4_file_read()  or  nfs_file_read()    │
│           │        │      or tmpfs_read()  or ...              │
│           │        │                                           │
└──────────┘        └───────────────────────────────────────────┘

vfs_read() and vfs_write() are the bottleneck functions; every file read and write passes through them, regardless of filesystem type. By probing these two functions, you capture all file I/O on the system.

The vfs_read function signature is:

ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos);

The third parameter, count, is the number of bytes the process requested to read. This is what we capture. The same pattern applies to vfs_write:

ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos);

Note

Requested vs actual bytes We capture the count parameter (bytes requested), not the return value (bytes actually transferred). To get the actual byte count you would need a kretprobe, which captures the return value. For observability purposes, the requested count is usually close enough and avoids the overhead of a second probe per operation.

Kprobes vs tracepoints

Labs 1 and 2 used tracepoints. Tracepoints are stable, well-documented hooks placed at specific locations in the kernel source code by kernel developers. They have a defined argument format and are considered part of the kernel’s ABI; they rarely change between versions.

Kprobes are different. A kprobe can attach to almost any kernel function; you just name the function and the kprobe infrastructure inserts a breakpoint at its entry point. When the function is called, your BPF program runs before the function body executes.

FeatureTracepointKprobe
AttachmentPredefined locationsAny kernel function
StabilityStable ABI, rarely changesFunction may be renamed/removed
Argument accessStructured format with offsetsRaw function parameters by index
PerformanceSlightly lower overheadSlightly higher (breakpoint-based)
AvailabilityLimited setThousands of functions

The tradeoff: tracepoints are stable but limited; kprobes are flexible but can break if the kernel renames or removes a function. vfs_read and vfs_write have been stable for many kernel versions, so this is a reasonable place to use kprobes.

Warning

Kprobe stability Unlike tracepoints, kprobe targets are not part of the kernel’s stable ABI. The vfs_read and vfs_write functions have been present and stable for years, but there is no guarantee they will not be refactored in a future kernel release. If your kprobe fails to attach, check whether the function still exists with cat /proc/kallsyms | grep vfs_read.

Accessing function arguments in kprobes

In a kprobe, function arguments are accessed through the ProbeContext by index. On x86_64, the first six arguments are passed in registers rdi, rsi, rdx, rcx, r8, r9 (following the System V AMD64 calling convention). The BPF kprobe infrastructure maps these to argument indices 0 through 5.

For vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos):

IndexRegisterParameter
0rdistruct file *
1rsichar __user *
2rdxsize_t count
3rcxloff_t *pos

We want index 2, the byte count.

The event struct

// tuxscope-common/src/lib.rs

#[repr(C)]
#[derive(Clone, Copy)]
pub struct FileIoEvent {
    pub pid: u32,
    pub op: u8,        // 0 = read, 1 = write
    pub bytes: u64,
    pub timestamp_ns: u64,
    pub comm: [u8; 16],
}

The op field distinguishes reads from writes. Using a single event type for both operations keeps the userspace code simple, one struct, one buffer, one reader.

The eBPF programs

Two kprobes, one for each VFS function, share the same RingBuf and event format:

// tuxscope-ebpf/src/fileio.rs

use aya_ebpf::{
    macros::{kprobe, map},
    maps::RingBuf,
    programs::ProbeContext,
    helpers::{bpf_get_current_pid_tgid, bpf_ktime_get_ns, bpf_get_current_comm},
};
use tuxscope_common::FileIoEvent;

#[map]
static EVENTS: RingBuf = RingBuf::with_byte_size(256 * 1024, 0);

#[kprobe]
pub fn trace_vfs_read(ctx: ProbeContext) -> u32 {
    match try_file_io(&ctx, 0) {
        Ok(()) => 0,
        Err(_) => 1,
    }
}

#[kprobe]
pub fn trace_vfs_write(ctx: ProbeContext) -> u32 {
    match try_file_io(&ctx, 1) {
        Ok(()) => 0,
        Err(_) => 1,
    }
}

fn try_file_io(ctx: &ProbeContext, op: u8) -> Result<(), i64> {
    let pid = (bpf_get_current_pid_tgid() >> 32) as u32;
    let timestamp_ns = unsafe { bpf_ktime_get_ns() };
    let comm = bpf_get_current_comm().map_err(|e| e as i64)?;

    // Argument index 2 is the `count` parameter (size_t)
    let bytes: u64 = ctx.arg(2).ok_or(1i64)?;

    let event = FileIoEvent {
        pid,
        op,
        bytes,
        timestamp_ns,
        comm,
    };

    if let Some(mut entry) = EVENTS.reserve::<FileIoEvent>(0) {
        entry.write(event);
        entry.submit(0);
    }

    Ok(())
}

Key points:

  1. Two #[kprobe] functions call the same inner function with a different op value. This avoids duplicating the event-building logic.
  2. ctx.arg(2) reads the third function argument (zero-indexed). On x86_64, this is the rdx register: the count parameter.
  3. Same RingBuf pattern as Lab 2: reserve, write, submit.

The userspace side attaches both kprobes:

// Attaching kprobes (simplified)
use aya::programs::KProbe;

let vfs_read: &mut KProbe = bpf.program_mut("trace_vfs_read").unwrap().try_into()?;
vfs_read.load()?;
vfs_read.attach("vfs_read", 0)?;

let vfs_write: &mut KProbe = bpf.program_mut("trace_vfs_write").unwrap().try_into()?;
vfs_write.load()?;
vfs_write.attach("vfs_write", 0)?;

The attach call takes the kernel function name as a string. If the function does not exist in the running kernel, this returns an error.

Running it

sudo tuxscope fileio
PID      COMM             OP      BYTES     TIMESTAMP_NS
1842     bash             read    1         9824100293847
3217     sshd             read    16384     9824100312054
3217     sshd             write   128       9824100345891
2910     journald         read    8192      9824100398234
4501     Xwayland         write   32768     9824100402847
1842     bash             write   4         9824100456023
891      postgres         read    8192      9824100498412
891      postgres         write   8192      9824100512331

Filter to a specific process and watch it work:

# Copy a file while tracing
sudo tuxscope fileio --pid $(pgrep -x cp) &
cp /usr/share/dict/words /tmp/words-copy
PID      COMM             OP      BYTES     TIMESTAMP_NS
5102     cp               read    131072    9824200293847
5102     cp               write   131072    9824200312054
5102     cp               read    131072    9824200345891
5102     cp               write   131072    9824200378234
5102     cp               read    131072    9824200402847
5102     cp               write   131072    9824200435691
5102     cp               read    0         9824200468234

The cp command reads and writes in 128 KB chunks (131072 bytes). The final read returns 0 bytes; that is how cp knows it reached the end of the file. This chunked read-write pattern is how most file copy operations work.

JSON output:

sudo tuxscope fileio --pid 5102 --format json
{"pid":5102,"op":"read","bytes":131072,"comm":"cp","timestamp_ns":9824200293847}
{"pid":5102,"op":"write","bytes":131072,"comm":"cp","timestamp_ns":9824200312054}
{"pid":5102,"op":"read","bytes":131072,"comm":"cp","timestamp_ns":9824200345891}
{"pid":5102,"op":"write","bytes":131072,"comm":"cp","timestamp_ns":9824200378234}

Interpreting I/O patterns

Different workloads have recognizable I/O signatures:

Sequential copy (like cp): alternating read-write pairs of equal size, typically 128 KB or 256 KB. High throughput, low syscall count per byte.

Log writing (like journald): frequent small writes (tens to hundreds of bytes). High syscall count per byte. If you see a process doing many tiny writes, it may benefit from buffering.

Database workload (like postgres): mixed reads and writes at page-aligned sizes (typically 8 KB for PostgreSQL). Reads and writes are interleaved as the database processes queries.

Interactive shell: single-byte reads (one keystroke at a time) and small writes (prompt output, command output).

Warning

Noise from kernel-internal I/O Not all VFS operations come from user-initiated syscalls. The kernel generates VFS reads and writes for page cache management, swap, and internal bookkeeping. If you see unexpected I/O from processes like kworker or kswapd, that is normal kernel background activity.

Exercises

  1. Add the filename. The first argument to vfs_read and vfs_write is a struct file *. From this pointer you can reach file->f_path.dentry->d_name.name to get the filename. Use bpf_probe_read_kernel to safely dereference each pointer in the chain. This is tricky, the BPF verifier is strict about pointer dereferences, but it makes the output dramatically more useful.

  2. Aggregate bytes per process. Instead of streaming every event, use a BPF HashMap to accumulate total bytes read and written per PID. In userspace, periodically dump the map to show a top-like view of I/O by process. This is how biotop works.

  3. Capture actual bytes with a kretprobe. Attach a kretprobe to vfs_read to capture the return value (actual bytes read, or a negative error code). Compare the requested count from the kprobe entry with the actual bytes from the return. How often do reads return fewer bytes than requested?

  4. Compare with iotop. Run sudo iotop -o alongside sudo tuxscope fileio and compare the data. Where do the numbers agree? Where do they diverge? The answer reveals the difference between VFS-level I/O and block-level I/O, reads served from the page cache never reach the block layer.

What’s next

In Lab 4: Network Monitoring, you will probe the TCP stack to observe network connections in real time. You will use kretprobes for the first time to capture return values, read socket structures with bpf_probe_read_kernel, and extract IP addresses and port numbers from kernel data structures.