§ 8.1 – 8.6 Linux Device Drivers & Hardware Interface
PCIe topology + BAR + MSI-X (§8.1) · Network driver TX/RX rings + NAPI (§8.2) · SR-IOV PF/VF + flow steering (§8.3) · DDIO cache-direct DMA (§8.4) · Character devices + UIO (§8.5) · Block layer + blk-mq + bio (§8.6)
1. Overview
Linux device drivers sit between the abstract kernel interfaces (VFS file operations, socket ops, block layer) and physical hardware. Every driver follows the same pattern: register with a bus (PCIe, USB, platform), get called by the bus via probe(), map hardware registers via BAR memory-mapped I/O, set up DMA ring buffers, and wire interrupts (MSI-X per queue). Character devices expose a file_operations vtable directly to user space; block devices add a request queue and I/O scheduler; network drivers plug into NAPI.
2. § 8.1 — PCI / PCIe Device Driver
PCIe Topology — Root Complex → Switch → Endpoints
PCIe is a point-to-point serial interconnect organized as a tree. The Root Complex (inside the CPU package) is the root of the tree; it owns the PCIe config space address space and generates MMIO transactions. Switches route TLPs (Transaction Layer Packets) by BDF (Bus:Device.Function). Each Endpoint (NIC, NVMe, GPU) exposes up to 6 BARs (Base Address Registers) in config space — each BAR describes a physical MMIO window that the kernel maps into virtual address space.
BAR Layout — MMIO Mapping
During pci_probe(), the driver calls pci_request_regions() to reserve the BARs, then pci_ioremap_bar() (or ioremap() directly) to create a kernel virtual mapping. All NIC register accesses thereafter are readl() / writel() to that virtual pointer — the CPU generates a PCIe non-posted write TLP.
| Call | What it does |
|---|---|
pci_enable_device(pdev) | Enable power, enable bus mastering (device can issue DMA) |
pci_request_regions(pdev, name) | Reserve all BARs so no other driver can map them |
pci_ioremap_bar(pdev, bar_num) | ioremap the given BAR into kernel virtual space → returns void * |
pci_enable_msix_range(pdev, entries, min, max) | Allocate MSI-X vectors; returns actual count allocated |
request_irq(vector, handler, 0, name, queue) | Hook Linux IRQ handler to one MSI-X vector |
dma_set_mask_and_coherent(dev, DMA_BIT_MASK(64)) | Tell kernel device supports 64-bit DMA addresses |
dma_map_single(dev, ptr, size, dir) | Pin page + get DMA address for a single buffer |
MSI-X — Per-Queue Interrupt Vectors
Legacy INTx uses a shared wire; under high PPS every queue fires the same IRQ and the handler must poll all queues. MSI-X gives each queue its own vector: the NIC DMA-writes a small message to a CPU LAPIC memory address, triggering a specific vector directly. The kernel can pin each vector to a different CPU core (via irq_set_affinity_hint()), eliminating cross-CPU cache bouncing.
Minimal C Demo — PCI BAR Register Access
3. § 8.2 — Network Driver Model
TX Descriptor Ring — Producer/Consumer
The NIC TX ring is a circular array of fixed-size descriptors in DMA-coherent memory. The driver (producer) fills descriptors with DMA addresses and writes the new tail to a BAR register (the doorbell). The NIC DMA engine (consumer) reads from the head, fetches the packet data over PCIe, transmits on the wire, and sets the DD (Descriptor Done) bit. The driver reclaims completed descriptors on the next TX completion interrupt, freeing the sk_buff and unmapping the DMA address.
RX Descriptor Ring — Pre-posted Buffers
The RX ring works in reverse: the driver pre-fills descriptors with empty sk_buff DMA addresses and advances the tail. When a packet arrives, the NIC DMA-writes directly into the pre-posted buffer, sets DD, and fires the MSI-X vector. The napi_poll() handler walks the ring, consumes completed descriptors, builds the sk_buff, calls netif_receive_skb(), then immediately refills the ring slot and advances the doorbell.
TX Path — End-to-End
| Callback | When called | What it does |
|---|---|---|
ndo_open(dev) | ifconfig up / ip link set up | Enable NIC, allocate TX/RX rings, request MSI-X IRQs, start NAPI |
ndo_stop(dev) | ifconfig down | Free rings, release IRQs, disable DMA, quiesce NAPI |
ndo_start_xmit(skb, dev) | dev_queue_xmit() | DMA-map skb, fill TX descriptor, ring doorbell; returns NETDEV_TX_OK |
ndo_get_stats64(dev, stats) | ip -s link / ethtool -S | Copy TX/RX counters from per-CPU ring stats |
ndo_set_rx_mode(dev) | promisc / multicast list change | Update NIC MAC filter table (unicast/multicast) |
ndo_tx_timeout(dev, txqueue) | TX watchdog fires | Reset hung TX queue, restart NIC |
Minimal C Demo — TX Descriptor Ring
4. § 8.3 — SR-IOV (Single Root I/O Virtualization)
Background — Why SR-IOV?
Without SR-IOV, VMs share a NIC through software (virtio + vhost-net), adding a vSwitch copy in the hypervisor. SR-IOV exposes multiple lightweight Virtual Functions (VFs) from a single physical card. Each VF looks like a real PCIe device with its own config space, BAR, queues, and MSI-X vectors. A VM or container binds a VF directly (via vfio-pci), gets near-native throughput, and the NIC hardware enforces isolation.
Flow Classification — MAC/VLAN Steering per VF
The PF driver programs the NIC's hardware filter table via ndo_set_vf_mac() and ndo_set_vf_vlan(). On RX, the NIC matches the destination MAC + VLAN tag against the filter table and DMA-writes the packet directly into the matching VF's RX ring — the PF kernel driver is never involved.
Minimal C Demo — SR-IOV MAC Filter Table
5. § 8.4 — DDIO (Intel Data Direct I/O)
Background
Normally, a NIC DMA write lands in DRAM (~60 ns); only when the CPU reads the data does it get promoted to the Last Level Cache (~10 ns). For packet processing at 40+ Gbps, that DRAM round-trip is the bottleneck. DDIO (Intel) and the equivalent AMD/ARM features instruct the PCIe root complex to route incoming DMA writes directly into the LLC. The CPU finds the packet data already warm in cache, cutting read latency from ~70 ns to ~10 ns.
Without DDIO
With DDIO
| Factor | Without DDIO | With DDIO |
|---|---|---|
| DMA write target | DRAM | LLC (shared L3) |
| First CPU read latency | ~70 ns (DRAM + LLC fill) | ~10 ns (LLC hit) |
| DPDK mempool requirement | DRAM-backed; cache cold | Huge pages still needed, but data warm |
| CPU cache pressure | Low (NIC fills DRAM) | Higher — NIC evicts LLC lines on write |
| Throughput impact | Bottleneck at ~40 Gbps+ | Enables 100 Gbps+ line-rate |
| Enable/disable | BIOS setting or PCIe config | Enabled by default on Xeon E5+ |
6. § 8.5 — Character Devices
Registration Flow — from Module Init to /dev Node
A character device is the simplest driver interface: user space opens a /dev node and calls file operations directly. The kernel identifies the driver by major:minor number, looks up the struct cdev, and dispatches to its file_operations vtable.
| file_operations callback | Triggered by | Typical use |
|---|---|---|
.open | open("/dev/mydev") | Allocate per-file state; verify permissions |
.read | read(fd, buf, n) | Copy device data to user; update position |
.write | write(fd, buf, n) | Accept commands from user; write to device register |
.unlocked_ioctl | ioctl(fd, cmd, arg) | Device control; FIONREAD, custom commands |
.mmap | mmap(NULL, sz, ..., fd, 0) | Map device MMIO or ring buffer into user VA (e.g. UIO BAR0) |
.poll | select/poll/epoll on fd | Signal readability/writability for event-driven drivers |
.release | close(fd) | Free per-file state; release hardware resources |
UIO — Character Device for DPDK
igb_uio.ko is a minimal kernel module that registers a PCIe driver, then exposes the NIC BARs as mappable regions via a UIO character device (/dev/uio0). DPDK's EAL opens this device, calls mmap() to get direct register access, and issues DMA from user space — no kernel network stack involvement.
Minimal C Demo — file_operations Dispatch
7. § 8.6 — Block Devices
Block Layer Stack — VFS to Hardware
Unlike character devices, block devices go through a multi-layer stack that enables read-ahead, write-back caching (page cache), and I/O scheduling. The page cache absorbs reads and batches writes; the I/O scheduler merges adjacent requests to maximize throughput; and blk-mq (multi-queue block layer) dispatches requests to per-CPU hardware queues, matching the parallelism of modern NVMe SSDs.
blk-mq — Multi-Queue Submission
Classic single-queue block layer was a bottleneck: one lock, one queue, one CPU dispatching all I/O. blk-mq mirrors NVMe's own architecture: per-CPU software queues batch local requests, then flush to per-CPU hardware dispatch queues which map 1:1 to NVMe submission queues. No global lock in the hot path.
struct bio — Scatter-Gather I/O Descriptor
A bio is the fundamental unit of block I/O. It describes where on disk (sector + size) and what memory pages to read/write (a bio_vec[] scatter-gather array). A single submit_bio() can span multiple non-contiguous physical pages — the block driver maps them to DMA and issues one NVMe command.
| Struct / Call | Purpose |
|---|---|
struct gendisk | Represents a physical disk; alloc_disk() + add_disk() registers it |
struct request_queue | Per-disk queue; holds I/O scheduler, blk-mq maps, plug/unplug |
struct bio | One logical I/O: sector range + scatter-gather page list |
struct bio_vec | One scatter-gather segment: page + offset + len |
submit_bio(bio) | Entry point: caller submits bio into block layer |
blk_mq_ops.queue_rq | Driver callback: receives request, issues to hardware (NVMe doorbell) |
bio_for_each_segment(bv, bio, iter) | Iterate all bio_vec segments; driver maps each to DMA |
Minimal C Demo — bio Scatter-Gather + blk-mq
8. Kernel Source Pointers
| File / Symbol | What it contains |
|---|---|
| drivers/pci/pci.c — pci_enable_device() | Enable bus mastering, power management |
| include/linux/pci.h — struct pci_driver | probe/remove callbacks, id_table |
| drivers/net/ethernet/intel/ixgbe/ — ixgbe_main.c | Full TX/RX ring, NAPI, MSI-X setup for ixgbe NIC |
| drivers/net/ethernet/intel/i40e/ — i40e_main.c | i40e (X710) driver — probe, SR-IOV, VF management |
| include/linux/netdevice.h — struct net_device_ops | All ndo_* callback declarations |
| drivers/vfio/pci/vfio_pci.c | VFIO PCI driver — maps BAR to user space for DPDK |
| drivers/uio/uio.c + drivers/uio/igb_uio.c | UIO framework + igb_uio NIC driver for DPDK |
| fs/char_dev.c — cdev_add(), cdev_get() | Char device registration and major:minor lookup |
| block/blk-mq.c — blk_mq_submit_bio() | blk-mq hot path: sw_queue → hw_queue dispatch |
| include/linux/bio.h — struct bio, bio_vec | bio / scatter-gather data structures |
| block/bio.c — bio_alloc(), submit_bio() | bio allocation and submission entry point |
| drivers/nvme/host/pci.c — nvme_queue_rq() | NVMe blk-mq driver: maps bio to NVMe SQ command |
9. Interview Prep
| Question | Concise Answer |
|---|---|
| What is a BAR and how does a kernel driver access NIC registers? | BAR (Base Address Register) in PCIe config space describes a physical MMIO window. The driver calls pci_ioremap_bar() to map it into kernel virtual address space, then uses readl()/writel() to read/write NIC registers — each call generates a PCIe non-posted write TLP. |
| What is MSI-X and why is it better than legacy INTx for multi-queue NICs? | MSI-X allocates one interrupt vector per queue. The NIC DMA-writes a small message to a CPU LAPIC address (no shared IRQ wire). Each vector can be pinned to a different CPU core, so queue 0 IRQ runs on CPU 0, queue 1 on CPU 1 — no lock contention, no cross-CPU cache bouncing. |
| Walk through a TX descriptor ring — what does the driver write and what does the NIC read? | Driver fills desc[tail]: buf=DMA addr, len, cmd=RS|EOP|IFCS, dd=0; increments tail; writes new tail to TDT BAR register (doorbell). NIC reads desc[head], DMA-reads the packet bytes, sends on wire, sets dd=1, increments head. Driver's next TX completion interrupt walks the ring looking for dd=1 to free skb + unmap DMA. |
| What is SR-IOV and how does a VF differ from a PF? | SR-IOV creates multiple Virtual Functions from one Physical Function. The PF has full config space access and runs in the host kernel; VFs have limited config space (no BAR resize), their own TX/RX queues and MSI-X vectors, and bind to VM guest drivers via vfio-pci. The NIC hardware enforces isolation by MAC/VLAN filter steering per VF. |
| What is DDIO and how does it reduce latency for DPDK? | DDIO (Intel Data Direct I/O) routes PCIe DMA writes directly into the LLC instead of DRAM. Without DDIO: NIC → DRAM (~60 ns) → LLC fill → CPU = ~70 ns. With DDIO: NIC → LLC (~10 ns) → CPU. DPDK packet processing starts with data already cache-warm, critical for 100 Gbps line-rate. |
| How does igb_uio (UIO) work and why does DPDK use it? | igb_uio.ko registers as a PCIe driver and exposes a /dev/uio0 character device with struct uio_info containing BAR physical addresses. DPDK EAL calls mmap(fd, offset=0) to get a user-space pointer to BAR0 MMIO — enabling direct register writes and doorbell rings without syscalls. Interrupts arrive via read() on the fd (event counter changes). |
| What is struct bio and how does blk-mq use per-CPU queues? | A bio describes one logical block I/O: bi_sector (start LBA), bi_size, and a bio_vec[] scatter-gather array of physical pages. submit_bio() enters the block layer: blk-mq enqueues it on the calling CPU's per-CPU sw_queue, then flushes to a per-CPU hw_queue that maps to one NVMe submission queue — no global lock, full SSD parallelism. |