Tuxscope Lab 7: Profiling Disk I/O Latency with eBPF HashMaps

Every file you read, every database query, every log line written eventually becomes a block I/O request that travels through the Linux block layer to a physical (or virtual) storage device. The time that request spends in flight, from the moment the kernel hands it to the device driver until the device signals completion, is the I/O latency, and it is one of the most critical performance metrics on any system.

This lab introduces a fundamental eBPF technique: stateful tracking with HashMaps. You will attach to two tracepoints, one where the kernel issues an I/O request, and one where it completes, and use an eBPF HashMap to correlate them. This is the same pattern used by production tools like BCC’s biolatency and biosnoop, and understanding it unlocks an entire category of latency measurement problems.

Note

Prerequisites This tutorial is part of the Tuxscope series. You need a built tuxscope binary, Linux 5.8+, and root privileges.

The Linux block layer

When a filesystem (ext4, btrfs, XFS) needs to read or write data on disk, it does not talk to the hardware directly. It submits a request to the block layer, which manages queuing, merging, scheduling, and dispatching of I/O operations.

Request flow

  Application
      │
      │  write() / read()
      ▼
  VFS (Virtual File System)
      │
      │  page cache hit?  ──── yes ──► return immediately
      │        │
      │        no
      ▼
  Filesystem (ext4, btrfs, XFS)
      │
      │  map file offset to block number
      ▼
  Block Layer
      │
      │  ┌─────────────────────────┐
      │  │  I/O Scheduler          │
      │  │  (mq-deadline/BFQ/none) │
      │  │  merge, sort, prioritize│
      │  └─────────────────────────┘
      │
      │  block_rq_issue ◄──── tuxscope attaches here
      ▼
  Device Driver (NVMe, SCSI, virtio-blk)
      │
      │  DMA transfer
      ▼
  Hardware
      │
      │  completion interrupt
      ▼
  Block Layer
      │
      │  block_rq_complete ◄──── tuxscope attaches here
      ▼
  Filesystem / Page Cache

The two tracepoints we care about bracket the hardware I/O:

block/block_rq_issue fires when the block layer submits a request to the device driver
block/block_rq_complete fires when the device signals the request is done

The time between these two events is the device-level I/O latency; it excludes queuing and scheduling time but captures the actual hardware round trip.

I/O schedulers

The block layer can reorder and merge requests before dispatching them. Linux offers several I/O schedulers:

Scheduler	Best for	How it works
`none`	NVMe SSDs	No reordering, hardware has its own queues
`mq-deadline`	SATA SSDs, HDDs	Batches reads and writes with deadline guarantees
`bfq`	Interactive desktops	Fair bandwidth allocation per process

Check your current scheduler:

cat /sys/block/sda/queue/scheduler
# or for NVMe:
cat /sys/block/nvme0n1/queue/scheduler

Sectors and device numbers

Block I/O operates in sectors (typically 512 bytes, regardless of physical sector size). A request to read 4096 bytes is an 8-sector request. The device is identified by major:minor numbers, for example, 259:0 is typically the first NVMe device.

Device major:minor    Meaning
─────────────────────────────
8:0                   /dev/sda
8:1                   /dev/sda1
259:0                 /dev/nvme0n1
259:1                 /dev/nvme0n1p1

The eBPF programs

The event struct

#[repr(C)]
pub struct DiskIoEvent {
    pub pid: u32,
    pub dev: u32,       // device major:minor encoded
    pub sector: u64,
    pub bytes: u32,
    pub rwbs: [u8; 8],  // kernel rwbs string, e.g. "R" or "WS"
    pub latency_ns: u64,
    pub timestamp_ns: u64,
    pub comm: [u8; 16],
}

The latency_ns field is only populated by the completion handler. The issue handler does not emit events to userspace; it only records state. The pid, comm, and bytes fields should come from the issue event, because completion can run in interrupt or worker context rather than in the submitting process.

The HashMap: stateful tracking

This is the key new concept in this lab. Previous labs attached to single tracepoints where all the information was available in one event. I/O latency requires correlating two events that happen at different times.

The solution is an eBPF HashMap that persists between tracepoint invocations:

#[map]
static DISKIO_START: HashMap<DiskIoKey, DiskIoStart> =
    HashMap::with_max_entries(10240, 0);

#[repr(C)]
#[derive(Clone, Copy, Eq, PartialEq)]
pub struct DiskIoKey {
    pub sector: u64,
    pub dev: u32,
    pub nr_sector: u32,
    pub rwbs: [u8; 8],
}

#[repr(C)]
#[derive(Clone, Copy)]
pub struct DiskIoStart {
    pub timestamp_ns: u64,
    pub pid: u32,
    pub bytes: u32,
    pub comm: [u8; 16],
}

This map uses a compound key derived from the block request: device, starting sector, sector count, and the kernel’s rwbs operation string. The value stores the issue timestamp plus the submitting process metadata. The issue handler writes to it; the completion handler reads from it and deletes the entry.

  block_rq_issue                          block_rq_complete
  ─────────────────                       ─────────────────
  dev = 259:0                             dev = 259:0
  sector = 123456                         sector = 123456
  nr_sector = 8                           nr_sector = 8
  rwbs = "W"                              rwbs = "W"
  timestamp = T1                          timestamp = T2
       │                                       │
       │  DISKIO_START.insert(key, start)      │  start = DISKIO_START.get(key)
       │                                       │  latency = T2 - T1
       │                                       │  DISKIO_START.remove(key)
       │                                       │  emit event with latency
       ▼                                       ▼
  ┌──────────────────────────────────────────────┐
  │  DISKIO_START HashMap                        │
  │  key: dev+sector+nr_sector+rwbs              │
  │  value: timestamp+pid+bytes+comm             │
  └──────────────────────────────────────────────┘

The max entries of 10240 is generous for most workloads. If the map fills up, new insertions silently fail and those requests will not have latency tracking. On extremely I/O-heavy systems, you might need to increase this.

Warning

A sector-only key is not safe: it collides across devices, and it can be ambiguous when requests are split, merged, or retried. A request pointer is the best correlation key when your probe has access to struct request *. Tracepoints do not expose that pointer, so this lab uses a compound key and keeps the limitation explicit. PID is worse because one process can have many concurrent I/O requests.

The issue handler

#[repr(C)]
pub struct BlockRqIssue {
    pub common_type: u16,
    pub common_flags: u8,
    pub common_preempt_count: u8,
    pub common_pid: i32,
    pub dev: u32,
    pub sector: u64,
    pub nr_sector: u32,
    pub bytes: u32,
    pub ioprio: u16,
    pub rwbs: [u8; 8],
    pub comm: [u8; 16],
}

#[tracepoint]
pub fn handle_block_rq_issue(ctx: TracePointContext) -> i32 {
    let req: BlockRqIssue = match unsafe { ctx.read_at(0) } {
        Ok(req) => req,
        Err(_) => return 0,
    };
    let timestamp = unsafe { bpf_ktime_get_ns() };
    let tgid = (bpf_get_current_pid_tgid() >> 32) as u32;

    let key = DiskIoKey {
        sector: req.sector,
        dev: req.dev,
        nr_sector: req.nr_sector,
        rwbs: req.rwbs,
    };
    let start = DiskIoStart {
        timestamp_ns: timestamp,
        pid: tgid,
        bytes: req.bytes,
        comm: req.comm,
    };

    // Store submitter metadata at issue time; completion may run elsewhere.
    DISKIO_START.insert(&key, &start, 0).ok();
    0
}

This handler does not emit any event to userspace. Its only job is to record the timestamp and submitter metadata. This is a common eBPF pattern: not every probe needs to produce output. Some probes exist only to capture state for later correlation.

Note

The BlockRqIssue layout mirrors the tracepoint format for current kernels. In production, verify it against /sys/kernel/tracing/events/block/block_rq_issue/format or generate bindings from BTF/tracepoint metadata instead of scattering numeric offsets through the program.

The completion handler

#[repr(C)]
pub struct BlockRqComplete {
    pub common_type: u16,
    pub common_flags: u8,
    pub common_preempt_count: u8,
    pub common_pid: i32,
    pub dev: u32,
    pub sector: u64,
    pub nr_sector: u32,
    pub error: i32,
    pub ioprio: u16,
    pub rwbs: [u8; 8],
}

#[tracepoint]
pub fn handle_block_rq_complete(ctx: TracePointContext) -> i32 {
    let req: BlockRqComplete = match unsafe { ctx.read_at(0) } {
        Ok(req) => req,
        Err(_) => return 0,
    };
    let key = DiskIoKey {
        sector: req.sector,
        dev: req.dev,
        nr_sector: req.nr_sector,
        rwbs: req.rwbs,
    };

    // Look up the issue metadata for this request.
    let start = match unsafe { DISKIO_START.get(&key) } {
        Some(start) => *start,
        None => return 0, // no matching issue event; skip
    };

    let now = unsafe { bpf_ktime_get_ns() };
    let latency_ns = now - start.timestamp_ns;

    // Clean up the map entry
    DISKIO_START.remove(&key).ok();

    let event = DiskIoEvent {
        pid: start.pid,
        dev: req.dev,
        sector: req.sector,
        bytes: start.bytes,
        rwbs: req.rwbs,
        latency_ns,
        timestamp_ns: now,
        comm: start.comm,
    };
    EVENTS.output(&ctx, &event, 0);
    0
}

The handler first checks whether a matching issue timestamp exists. If not, it returns early; this can happen if the map was full when the issue handler ran, or if tuxscope was started after the request was issued. The remove() call is important: without it, the map would leak entries for every completed request.

Userspace formatting

The userspace code decodes the dev field into major:minor format and converts latency from nanoseconds to milliseconds:

let major = (event.dev >> 20) & 0xFFF;
let minor = event.dev & 0xFFFFF;
let latency_ms = event.latency_ns as f64 / 1_000_000.0;
let rw = rwbs_to_string(&event.rwbs);

println!(
    "[DISKIO] dev={}:{} {} sector={} bytes={} latency={:.3}ms pid={} comm={}",
    major, minor, rw, event.sector, event.bytes, latency_ms, event.pid, comm_str
);

Running it

Start tracing disk I/O:

sudo tuxscope diskio

In another terminal, generate some write I/O:

dd if=/dev/zero of=/tmp/testfile bs=1M count=100 oflag=direct

The oflag=direct flag bypasses the page cache and forces actual device I/O, which means every write generates block layer events.

Example output:

[DISKIO] dev=259:0 W sector=12345678 bytes=4096 latency=0.042ms pid=7200 comm=dd
[DISKIO] dev=259:0 W sector=12345686 bytes=4096 latency=0.038ms pid=7200 comm=dd
[DISKIO] dev=259:0 W sector=12345694 bytes=4096 latency=0.051ms pid=7200 comm=dd
[DISKIO] dev=259:0 W sector=12345702 bytes=131072 latency=0.127ms pid=7200 comm=dd
[DISKIO] dev=259:0 W sector=12345958 bytes=131072 latency=0.134ms pid=7200 comm=dd

Observations:

Latency: NVMe writes typically complete in under 0.1ms. SATA SSDs are 0.1-1ms. HDDs are 2-10ms.
Sector increments: Sequential writes show steadily increasing sector numbers.
Merged requests: The block layer may merge adjacent small writes into larger ones (notice the 131072-byte requests; that is 128KB, merged from multiple smaller writes).
Device: 259:0 is the first NVMe device on most systems.

Read latency

Generate read I/O by reading back the file (drop caches first to avoid page cache hits):

echo 3 | sudo tee /proc/sys/vm/drop_caches
dd if=/tmp/testfile of=/dev/null bs=1M

Reads and writes typically have different latency profiles. On SSDs, reads are usually faster. On HDDs, the difference depends on whether the read is sequential or random.

JSON output

sudo tuxscope diskio --format json

{"pid":7200,"dev":"259:0","sector":12345678,"bytes":4096,"rwbs":"W","latency_ns":42000,"timestamp_ns":450123456789,"comm":"dd"}
{"pid":7200,"dev":"259:0","sector":12345686,"bytes":4096,"rwbs":"W","latency_ns":38000,"timestamp_ns":450123457120,"comm":"dd"}
{"pid":7200,"dev":"259:0","sector":12345694,"bytes":4096,"rwbs":"W","latency_ns":51000,"timestamp_ns":450123457890,"comm":"dd"}

Filtering

Filter by the submitting PID captured at block_rq_issue:

sudo tuxscope diskio --pid 7200

Note

The PID is still best-effort. Direct synchronous I/O usually attributes cleanly to the submitting process. Buffered writeback, async I/O, and filesystem background work may be submitted by kernel workers or helper threads, so the PID can describe the kernel path that issued the block request rather than the original application that dirtied the page.

The HashMap pattern

The issue-complete correlation pattern you learned in this lab is one of the most powerful techniques in eBPF observability. It applies to any situation where you need to measure the time between two events:

Use case	Issue tracepoint	Complete tracepoint	Map key
Disk I/O latency	`block_rq_issue`	`block_rq_complete`	request pointer, or dev + sector + size + op
TCP connection latency	`tcp_v4_connect`	`tcp_v4_connect_ret`	socket pointer
DNS resolution time	`udp_sendmsg`	`udp_recvmsg`	(sport, dport)
Syscall duration	`sys_enter_*`	`sys_exit_*`	pid + tgid
Lock contention	`lock_acquire`	`lock_acquired`	lock address

The pattern is always the same:

On the “start” event, store a timestamp in a HashMap keyed by something that uniquely identifies the operation
On the “end” event, look up the timestamp, compute the delta, clean up the entry, and emit the result

Choosing the right key is the design decision. It must uniquely identify an in-flight operation. For block I/O, a struct request * key is ideal when you attach where that pointer is available. With tracepoints, use the strongest compound key available and document the edge cases. For other use cases, pick a key that provides the same uniqueness guarantee.

Exercises

Build a latency histogram. Pipe JSON output into a script that buckets latencies into ranges (0-0.1ms, 0.1-0.5ms, 0.5-1ms, 1-5ms, 5ms+) and prints a histogram at the end. Run dd with direct I/O and then with buffered I/O, how do the distributions differ?
Compare I/O schedulers. Switch your device’s scheduler between none, mq-deadline, and bfq and measure latency under the same workload:
```
echo mq-deadline | sudo tee /sys/block/nvme0n1/queue/scheduler
```
Which scheduler gives the lowest latency? Which gives the most consistent latency?
Random vs sequential I/O. Use fio to generate random 4K reads and sequential 1M reads on the same device. Compare the latency distributions. For HDDs, the difference is dramatic. For NVMe, it is much smaller. Why?
Track I/O by filesystem. Modify the JSON pipeline to correlate device major:minor numbers with mount points (from lsblk -o NAME,MAJ:MIN,MOUNTPOINT). Build a report showing average latency per filesystem.

What’s next

You now have the tools to observe processes, memory, and disk I/O, the three foundational resources that every program on Linux consumes. The HashMap pattern from this lab is reusable across nearly any latency-measurement scenario. Future labs will build on this foundation to trace scheduler latency, security-relevant events, and container boundaries.

Check the Tuxscope repository for the latest labs and updates.