Part VIII — Drivers

§ 8.1 – 8.6 Linux Device Drivers & Hardware Interface

PCIe topology + BAR + MSI-X (§8.1) · Network driver TX/RX rings + NAPI (§8.2) · SR-IOV PF/VF + flow steering (§8.3) · DDIO cache-direct DMA (§8.4) · Character devices + UIO (§8.5) · Block layer + blk-mq + bio (§8.6)

1. Overview

Linux device drivers sit between the abstract kernel interfaces (VFS file operations, socket ops, block layer) and physical hardware. Every driver follows the same pattern: register with a bus (PCIe, USB, platform), get called by the bus via probe(), map hardware registers via BAR memory-mapped I/O, set up DMA ring buffers, and wire interrupts (MSI-X per queue). Character devices expose a file_operations vtable directly to user space; block devices add a request queue and I/O scheduler; network drivers plug into NAPI.

2. § 8.1 — PCI / PCIe Device Driver

PCIe Topology — Root Complex → Switch → Endpoints

PCIe is a point-to-point serial interconnect organized as a tree. The Root Complex (inside the CPU package) is the root of the tree; it owns the PCIe config space address space and generates MMIO transactions. Switches route TLPs (Transaction Layer Packets) by BDF (Bus:Device.Function). Each Endpoint (NIC, NVMe, GPU) exposes up to 6 BARs (Base Address Registers) in config space — each BAR describes a physical MMIO window that the kernel maps into virtual address space.

BAR Layout — MMIO Mapping

During pci_probe(), the driver calls pci_request_regions() to reserve the BARs, then pci_ioremap_bar() (or ioremap() directly) to create a kernel virtual mapping. All NIC register accesses thereafter are readl() / writel() to that virtual pointer — the CPU generates a PCIe non-posted write TLP.

CallWhat it does
pci_enable_device(pdev)Enable power, enable bus mastering (device can issue DMA)
pci_request_regions(pdev, name)Reserve all BARs so no other driver can map them
pci_ioremap_bar(pdev, bar_num)ioremap the given BAR into kernel virtual space → returns void *
pci_enable_msix_range(pdev, entries, min, max)Allocate MSI-X vectors; returns actual count allocated
request_irq(vector, handler, 0, name, queue)Hook Linux IRQ handler to one MSI-X vector
dma_set_mask_and_coherent(dev, DMA_BIT_MASK(64))Tell kernel device supports 64-bit DMA addresses
dma_map_single(dev, ptr, size, dir)Pin page + get DMA address for a single buffer

MSI-X — Per-Queue Interrupt Vectors

Legacy INTx uses a shared wire; under high PPS every queue fires the same IRQ and the handler must poll all queues. MSI-X gives each queue its own vector: the NIC DMA-writes a small message to a CPU LAPIC memory address, triggering a specific vector directly. The kernel can pin each vector to a different CPU core (via irq_set_affinity_hint()), eliminating cross-CPU cache bouncing.

Minimal C Demo — PCI BAR Register Access

PCIe BAR MMIO — probe + doorbell write — C Demo
stdin (optional)

3. § 8.2 — Network Driver Model

TX Descriptor Ring — Producer/Consumer

The NIC TX ring is a circular array of fixed-size descriptors in DMA-coherent memory. The driver (producer) fills descriptors with DMA addresses and writes the new tail to a BAR register (the doorbell). The NIC DMA engine (consumer) reads from the head, fetches the packet data over PCIe, transmits on the wire, and sets the DD (Descriptor Done) bit. The driver reclaims completed descriptors on the next TX completion interrupt, freeing the sk_buff and unmapping the DMA address.

RX Descriptor Ring — Pre-posted Buffers

The RX ring works in reverse: the driver pre-fills descriptors with empty sk_buff DMA addresses and advances the tail. When a packet arrives, the NIC DMA-writes directly into the pre-posted buffer, sets DD, and fires the MSI-X vector. The napi_poll() handler walks the ring, consumes completed descriptors, builds the sk_buff, calls netif_receive_skb(), then immediately refills the ring slot and advances the doorbell.

TX Path — End-to-End

CallbackWhen calledWhat it does
ndo_open(dev)ifconfig up / ip link set upEnable NIC, allocate TX/RX rings, request MSI-X IRQs, start NAPI
ndo_stop(dev)ifconfig downFree rings, release IRQs, disable DMA, quiesce NAPI
ndo_start_xmit(skb, dev)dev_queue_xmit()DMA-map skb, fill TX descriptor, ring doorbell; returns NETDEV_TX_OK
ndo_get_stats64(dev, stats)ip -s link / ethtool -SCopy TX/RX counters from per-CPU ring stats
ndo_set_rx_mode(dev)promisc / multicast list changeUpdate NIC MAC filter table (unicast/multicast)
ndo_tx_timeout(dev, txqueue)TX watchdog firesReset hung TX queue, restart NIC

Minimal C Demo — TX Descriptor Ring

TX Descriptor Ring — fill, NIC send, reclaim — C Demo
stdin (optional)

4. § 8.3 — SR-IOV (Single Root I/O Virtualization)

Background — Why SR-IOV?

Without SR-IOV, VMs share a NIC through software (virtio + vhost-net), adding a vSwitch copy in the hypervisor. SR-IOV exposes multiple lightweight Virtual Functions (VFs) from a single physical card. Each VF looks like a real PCIe device with its own config space, BAR, queues, and MSI-X vectors. A VM or container binds a VF directly (via vfio-pci), gets near-native throughput, and the NIC hardware enforces isolation.

Flow Classification — MAC/VLAN Steering per VF

The PF driver programs the NIC's hardware filter table via ndo_set_vf_mac() and ndo_set_vf_vlan(). On RX, the NIC matches the destination MAC + VLAN tag against the filter table and DMA-writes the packet directly into the matching VF's RX ring — the PF kernel driver is never involved.

Minimal C Demo — SR-IOV MAC Filter Table

SR-IOV — MAC/VLAN flow classification — C Demo
stdin (optional)

5. § 8.4 — DDIO (Intel Data Direct I/O)

Background

Normally, a NIC DMA write lands in DRAM (~60 ns); only when the CPU reads the data does it get promoted to the Last Level Cache (~10 ns). For packet processing at 40+ Gbps, that DRAM round-trip is the bottleneck. DDIO (Intel) and the equivalent AMD/ARM features instruct the PCIe root complex to route incoming DMA writes directly into the LLC. The CPU finds the packet data already warm in cache, cutting read latency from ~70 ns to ~10 ns.

Without DDIO

With DDIO

FactorWithout DDIOWith DDIO
DMA write targetDRAMLLC (shared L3)
First CPU read latency~70 ns (DRAM + LLC fill)~10 ns (LLC hit)
DPDK mempool requirementDRAM-backed; cache coldHuge pages still needed, but data warm
CPU cache pressureLow (NIC fills DRAM)Higher — NIC evicts LLC lines on write
Throughput impactBottleneck at ~40 Gbps+Enables 100 Gbps+ line-rate
Enable/disableBIOS setting or PCIe configEnabled by default on Xeon E5+

6. § 8.5 — Character Devices

Registration Flow — from Module Init to /dev Node

A character device is the simplest driver interface: user space opens a /dev node and calls file operations directly. The kernel identifies the driver by major:minor number, looks up the struct cdev, and dispatches to its file_operations vtable.

file_operations callbackTriggered byTypical use
.openopen("/dev/mydev")Allocate per-file state; verify permissions
.readread(fd, buf, n)Copy device data to user; update position
.writewrite(fd, buf, n)Accept commands from user; write to device register
.unlocked_ioctlioctl(fd, cmd, arg)Device control; FIONREAD, custom commands
.mmapmmap(NULL, sz, ..., fd, 0)Map device MMIO or ring buffer into user VA (e.g. UIO BAR0)
.pollselect/poll/epoll on fdSignal readability/writability for event-driven drivers
.releaseclose(fd)Free per-file state; release hardware resources

UIO — Character Device for DPDK

igb_uio.ko is a minimal kernel module that registers a PCIe driver, then exposes the NIC BARs as mappable regions via a UIO character device (/dev/uio0). DPDK's EAL opens this device, calls mmap() to get direct register access, and issues DMA from user space — no kernel network stack involvement.

Minimal C Demo — file_operations Dispatch

Character Device — file_operations vtable dispatch — C Demo
stdin (optional)

7. § 8.6 — Block Devices

Block Layer Stack — VFS to Hardware

Unlike character devices, block devices go through a multi-layer stack that enables read-ahead, write-back caching (page cache), and I/O scheduling. The page cache absorbs reads and batches writes; the I/O scheduler merges adjacent requests to maximize throughput; and blk-mq (multi-queue block layer) dispatches requests to per-CPU hardware queues, matching the parallelism of modern NVMe SSDs.

blk-mq — Multi-Queue Submission

Classic single-queue block layer was a bottleneck: one lock, one queue, one CPU dispatching all I/O. blk-mq mirrors NVMe's own architecture: per-CPU software queues batch local requests, then flush to per-CPU hardware dispatch queues which map 1:1 to NVMe submission queues. No global lock in the hot path.

struct bio — Scatter-Gather I/O Descriptor

A bio is the fundamental unit of block I/O. It describes where on disk (sector + size) and what memory pages to read/write (a bio_vec[] scatter-gather array). A single submit_bio() can span multiple non-contiguous physical pages — the block driver maps them to DMA and issues one NVMe command.

Struct / CallPurpose
struct gendiskRepresents a physical disk; alloc_disk() + add_disk() registers it
struct request_queuePer-disk queue; holds I/O scheduler, blk-mq maps, plug/unplug
struct bioOne logical I/O: sector range + scatter-gather page list
struct bio_vecOne scatter-gather segment: page + offset + len
submit_bio(bio)Entry point: caller submits bio into block layer
blk_mq_ops.queue_rqDriver callback: receives request, issues to hardware (NVMe doorbell)
bio_for_each_segment(bv, bio, iter)Iterate all bio_vec segments; driver maps each to DMA

Minimal C Demo — bio Scatter-Gather + blk-mq

struct bio — scatter-gather I/O + blk-mq dispatch — C Demo
stdin (optional)

8. Kernel Source Pointers

File / SymbolWhat it contains
drivers/pci/pci.c — pci_enable_device()Enable bus mastering, power management
include/linux/pci.h — struct pci_driverprobe/remove callbacks, id_table
drivers/net/ethernet/intel/ixgbe/ — ixgbe_main.cFull TX/RX ring, NAPI, MSI-X setup for ixgbe NIC
drivers/net/ethernet/intel/i40e/ — i40e_main.ci40e (X710) driver — probe, SR-IOV, VF management
include/linux/netdevice.h — struct net_device_opsAll ndo_* callback declarations
drivers/vfio/pci/vfio_pci.cVFIO PCI driver — maps BAR to user space for DPDK
drivers/uio/uio.c + drivers/uio/igb_uio.cUIO framework + igb_uio NIC driver for DPDK
fs/char_dev.c — cdev_add(), cdev_get()Char device registration and major:minor lookup
block/blk-mq.c — blk_mq_submit_bio()blk-mq hot path: sw_queue → hw_queue dispatch
include/linux/bio.h — struct bio, bio_vecbio / scatter-gather data structures
block/bio.c — bio_alloc(), submit_bio()bio allocation and submission entry point
drivers/nvme/host/pci.c — nvme_queue_rq()NVMe blk-mq driver: maps bio to NVMe SQ command

9. Interview Prep

QuestionConcise Answer
What is a BAR and how does a kernel driver access NIC registers?BAR (Base Address Register) in PCIe config space describes a physical MMIO window. The driver calls pci_ioremap_bar() to map it into kernel virtual address space, then uses readl()/writel() to read/write NIC registers — each call generates a PCIe non-posted write TLP.
What is MSI-X and why is it better than legacy INTx for multi-queue NICs?MSI-X allocates one interrupt vector per queue. The NIC DMA-writes a small message to a CPU LAPIC address (no shared IRQ wire). Each vector can be pinned to a different CPU core, so queue 0 IRQ runs on CPU 0, queue 1 on CPU 1 — no lock contention, no cross-CPU cache bouncing.
Walk through a TX descriptor ring — what does the driver write and what does the NIC read?Driver fills desc[tail]: buf=DMA addr, len, cmd=RS|EOP|IFCS, dd=0; increments tail; writes new tail to TDT BAR register (doorbell). NIC reads desc[head], DMA-reads the packet bytes, sends on wire, sets dd=1, increments head. Driver's next TX completion interrupt walks the ring looking for dd=1 to free skb + unmap DMA.
What is SR-IOV and how does a VF differ from a PF?SR-IOV creates multiple Virtual Functions from one Physical Function. The PF has full config space access and runs in the host kernel; VFs have limited config space (no BAR resize), their own TX/RX queues and MSI-X vectors, and bind to VM guest drivers via vfio-pci. The NIC hardware enforces isolation by MAC/VLAN filter steering per VF.
What is DDIO and how does it reduce latency for DPDK?DDIO (Intel Data Direct I/O) routes PCIe DMA writes directly into the LLC instead of DRAM. Without DDIO: NIC → DRAM (~60 ns) → LLC fill → CPU = ~70 ns. With DDIO: NIC → LLC (~10 ns) → CPU. DPDK packet processing starts with data already cache-warm, critical for 100 Gbps line-rate.
How does igb_uio (UIO) work and why does DPDK use it?igb_uio.ko registers as a PCIe driver and exposes a /dev/uio0 character device with struct uio_info containing BAR physical addresses. DPDK EAL calls mmap(fd, offset=0) to get a user-space pointer to BAR0 MMIO — enabling direct register writes and doorbell rings without syscalls. Interrupts arrive via read() on the fd (event counter changes).
What is struct bio and how does blk-mq use per-CPU queues?A bio describes one logical block I/O: bi_sector (start LBA), bi_size, and a bio_vec[] scatter-gather array of physical pages. submit_bio() enters the block layer: blk-mq enqueues it on the calling CPU's per-CPU sw_queue, then flushes to a per-CPU hw_queue that maps to one NVMe submission queue — no global lock, full SSD parallelism.