Web3 Docs

1. Overview

Linux device drivers sit between the abstract kernel interfaces (VFS file operations, socket ops, block layer) and physical hardware. Every driver follows the same pattern: register with a bus (PCIe, USB, platform), get called by the bus via probe(), map hardware registers via BAR memory-mapped I/O, set up DMA ring buffers, and wire interrupts (MSI-X per queue). Character devices expose a file_operations vtable directly to user space; block devices add a request queue and I/O scheduler; network drivers plug into NAPI.

2. § 8.1 — PCI / PCIe Device Driver

PCIe Topology — Root Complex → Switch → Endpoints

PCIe is a point-to-point serial interconnect organized as a tree. The Root Complex (inside the CPU package) is the root of the tree; it owns the PCIe config space address space and generates MMIO transactions. Switches route TLPs (Transaction Layer Packets) by BDF (Bus:Device.Function). Each Endpoint (NIC, NVMe, GPU) exposes up to 6 BARs (Base Address Registers) in config space — each BAR describes a physical MMIO window that the kernel maps into virtual address space.

BAR Layout — MMIO Mapping

During pci_probe(), the driver calls pci_request_regions() to reserve the BARs, then pci_ioremap_bar() (or ioremap() directly) to create a kernel virtual mapping. All NIC register accesses thereafter are readl() / writel() to that virtual pointer — the CPU generates a PCIe non-posted write TLP.

Call	What it does
`pci_enable_device(pdev)`	Enable power, enable bus mastering (device can issue DMA)
`pci_request_regions(pdev, name)`	Reserve all BARs so no other driver can map them
`pci_ioremap_bar(pdev, bar_num)`	ioremap the given BAR into kernel virtual space → returns void *
`pci_enable_msix_range(pdev, entries, min, max)`	Allocate MSI-X vectors; returns actual count allocated
`request_irq(vector, handler, 0, name, queue)`	Hook Linux IRQ handler to one MSI-X vector
`dma_set_mask_and_coherent(dev, DMA_BIT_MASK(64))`	Tell kernel device supports 64-bit DMA addresses
`dma_map_single(dev, ptr, size, dir)`	Pin page + get DMA address for a single buffer

MSI-X — Per-Queue Interrupt Vectors

Legacy INTx uses a shared wire; under high PPS every queue fires the same IRQ and the handler must poll all queues. MSI-X gives each queue its own vector: the NIC DMA-writes a small message to a CPU LAPIC memory address, triggering a specific vector directly. The kernel can pin each vector to a different CPU core (via irq_set_affinity_hint()), eliminating cross-CPU cache bouncing.

Minimal C Demo — PCI BAR Register Access

PCIe BAR MMIO — probe + doorbell write — C Demo

stdin (optional)

3. § 8.2 — Network Driver Model

TX Descriptor Ring — Producer/Consumer

The NIC TX ring is a circular array of fixed-size descriptors in DMA-coherent memory. The driver (producer) fills descriptors with DMA addresses and writes the new tail to a BAR register (the doorbell). The NIC DMA engine (consumer) reads from the head, fetches the packet data over PCIe, transmits on the wire, and sets the DD (Descriptor Done) bit. The driver reclaims completed descriptors on the next TX completion interrupt, freeing the sk_buff and unmapping the DMA address.

RX Descriptor Ring — Pre-posted Buffers

The RX ring works in reverse: the driver pre-fills descriptors with empty sk_buff DMA addresses and advances the tail. When a packet arrives, the NIC DMA-writes directly into the pre-posted buffer, sets DD, and fires the MSI-X vector. The napi_poll() handler walks the ring, consumes completed descriptors, builds the sk_buff, calls netif_receive_skb(), then immediately refills the ring slot and advances the doorbell.

TX Path — End-to-End

Callback	When called	What it does
`ndo_open(dev)`	ifconfig up / ip link set up	Enable NIC, allocate TX/RX rings, request MSI-X IRQs, start NAPI
`ndo_stop(dev)`	ifconfig down	Free rings, release IRQs, disable DMA, quiesce NAPI
`ndo_start_xmit(skb, dev)`	dev_queue_xmit()	DMA-map skb, fill TX descriptor, ring doorbell; returns NETDEV_TX_OK
`ndo_get_stats64(dev, stats)`	ip -s link / ethtool -S	Copy TX/RX counters from per-CPU ring stats
`ndo_set_rx_mode(dev)`	promisc / multicast list change	Update NIC MAC filter table (unicast/multicast)
`ndo_tx_timeout(dev, txqueue)`	TX watchdog fires	Reset hung TX queue, restart NIC

Minimal C Demo — TX Descriptor Ring

TX Descriptor Ring — fill, NIC send, reclaim — C Demo

stdin (optional)

4. § 8.3 — SR-IOV (Single Root I/O Virtualization)

Background — Why SR-IOV?

Without SR-IOV, VMs share a NIC through software (virtio + vhost-net), adding a vSwitch copy in the hypervisor. SR-IOV exposes multiple lightweight Virtual Functions (VFs) from a single physical card. Each VF looks like a real PCIe device with its own config space, BAR, queues, and MSI-X vectors. A VM or container binds a VF directly (via vfio-pci), gets near-native throughput, and the NIC hardware enforces isolation.

Flow Classification — MAC/VLAN Steering per VF

The PF driver programs the NIC's hardware filter table via ndo_set_vf_mac() and ndo_set_vf_vlan(). On RX, the NIC matches the destination MAC + VLAN tag against the filter table and DMA-writes the packet directly into the matching VF's RX ring — the PF kernel driver is never involved.

Minimal C Demo — SR-IOV MAC Filter Table

SR-IOV — MAC/VLAN flow classification — C Demo

stdin (optional)

5. § 8.4 — DDIO (Intel Data Direct I/O)

Background

Normally, a NIC DMA write lands in DRAM (~60 ns); only when the CPU reads the data does it get promoted to the Last Level Cache (~10 ns). For packet processing at 40+ Gbps, that DRAM round-trip is the bottleneck. DDIO (Intel) and the equivalent AMD/ARM features instruct the PCIe root complex to route incoming DMA writes directly into the LLC. The CPU finds the packet data already warm in cache, cutting read latency from ~70 ns to ~10 ns.

Without DDIO

With DDIO

Factor	Without DDIO	With DDIO
DMA write target	DRAM	LLC (shared L3)
First CPU read latency	~70 ns (DRAM + LLC fill)	~10 ns (LLC hit)
DPDK mempool requirement	DRAM-backed; cache cold	Huge pages still needed, but data warm
CPU cache pressure	Low (NIC fills DRAM)	Higher — NIC evicts LLC lines on write
Throughput impact	Bottleneck at ~40 Gbps+	Enables 100 Gbps+ line-rate
Enable/disable	BIOS setting or PCIe config	Enabled by default on Xeon E5+

6. § 8.5 — Character Devices

Registration Flow — from Module Init to /dev Node

A character device is the simplest driver interface: user space opens a /dev node and calls file operations directly. The kernel identifies the driver by major:minor number, looks up the struct cdev, and dispatches to its file_operations vtable.

file_operations callback	Triggered by	Typical use
`.open`	open("/dev/mydev")	Allocate per-file state; verify permissions
`.read`	read(fd, buf, n)	Copy device data to user; update position
`.write`	write(fd, buf, n)	Accept commands from user; write to device register
`.unlocked_ioctl`	ioctl(fd, cmd, arg)	Device control; FIONREAD, custom commands
`.mmap`	mmap(NULL, sz, ..., fd, 0)	Map device MMIO or ring buffer into user VA (e.g. UIO BAR0)
`.poll`	select/poll/epoll on fd	Signal readability/writability for event-driven drivers
`.release`	close(fd)	Free per-file state; release hardware resources

UIO — Character Device for DPDK

igb_uio.ko is a minimal kernel module that registers a PCIe driver, then exposes the NIC BARs as mappable regions via a UIO character device (/dev/uio0). DPDK's EAL opens this device, calls mmap() to get direct register access, and issues DMA from user space — no kernel network stack involvement.

Minimal C Demo — file_operations Dispatch

Character Device — file_operations vtable dispatch — C Demo

stdin (optional)

7. § 8.6 — Block Devices

Block Layer Stack — VFS to Hardware

Unlike character devices, block devices go through a multi-layer stack that enables read-ahead, write-back caching (page cache), and I/O scheduling. The page cache absorbs reads and batches writes; the I/O scheduler merges adjacent requests to maximize throughput; and blk-mq (multi-queue block layer) dispatches requests to per-CPU hardware queues, matching the parallelism of modern NVMe SSDs.

blk-mq — Multi-Queue Submission

Classic single-queue block layer was a bottleneck: one lock, one queue, one CPU dispatching all I/O. blk-mq mirrors NVMe's own architecture: per-CPU software queues batch local requests, then flush to per-CPU hardware dispatch queues which map 1:1 to NVMe submission queues. No global lock in the hot path.

struct bio — Scatter-Gather I/O Descriptor

A bio is the fundamental unit of block I/O. It describes where on disk (sector + size) and what memory pages to read/write (a bio_vec[] scatter-gather array). A single submit_bio() can span multiple non-contiguous physical pages — the block driver maps them to DMA and issues one NVMe command.

Struct / Call	Purpose
`struct gendisk`	Represents a physical disk; alloc_disk() + add_disk() registers it
`struct request_queue`	Per-disk queue; holds I/O scheduler, blk-mq maps, plug/unplug
`struct bio`	One logical I/O: sector range + scatter-gather page list
`struct bio_vec`	One scatter-gather segment: page + offset + len
`submit_bio(bio)`	Entry point: caller submits bio into block layer
`blk_mq_ops.queue_rq`	Driver callback: receives request, issues to hardware (NVMe doorbell)
`bio_for_each_segment(bv, bio, iter)`	Iterate all bio_vec segments; driver maps each to DMA

Minimal C Demo — bio Scatter-Gather + blk-mq

struct bio — scatter-gather I/O + blk-mq dispatch — C Demo

stdin (optional)

8. Kernel Source Pointers

File / Symbol	What it contains
drivers/pci/pci.c — pci_enable_device()	Enable bus mastering, power management
include/linux/pci.h — struct pci_driver	probe/remove callbacks, id_table
drivers/net/ethernet/intel/ixgbe/ — ixgbe_main.c	Full TX/RX ring, NAPI, MSI-X setup for ixgbe NIC
drivers/net/ethernet/intel/i40e/ — i40e_main.c	i40e (X710) driver — probe, SR-IOV, VF management
include/linux/netdevice.h — struct net_device_ops	All ndo_* callback declarations
drivers/vfio/pci/vfio_pci.c	VFIO PCI driver — maps BAR to user space for DPDK
drivers/uio/uio.c + drivers/uio/igb_uio.c	UIO framework + igb_uio NIC driver for DPDK
fs/char_dev.c — cdev_add(), cdev_get()	Char device registration and major:minor lookup
block/blk-mq.c — blk_mq_submit_bio()	blk-mq hot path: sw_queue → hw_queue dispatch
include/linux/bio.h — struct bio, bio_vec	bio / scatter-gather data structures
block/bio.c — bio_alloc(), submit_bio()	bio allocation and submission entry point
drivers/nvme/host/pci.c — nvme_queue_rq()	NVMe blk-mq driver: maps bio to NVMe SQ command

9. Interview Prep

Question	Concise Answer
What is a BAR and how does a kernel driver access NIC registers?	BAR (Base Address Register) in PCIe config space describes a physical MMIO window. The driver calls pci_ioremap_bar() to map it into kernel virtual address space, then uses readl()/writel() to read/write NIC registers — each call generates a PCIe non-posted write TLP.
What is MSI-X and why is it better than legacy INTx for multi-queue NICs?	MSI-X allocates one interrupt vector per queue. The NIC DMA-writes a small message to a CPU LAPIC address (no shared IRQ wire). Each vector can be pinned to a different CPU core, so queue 0 IRQ runs on CPU 0, queue 1 on CPU 1 — no lock contention, no cross-CPU cache bouncing.
Walk through a TX descriptor ring — what does the driver write and what does the NIC read?	Driver fills desc[tail]: buf=DMA addr, len, cmd=RS\|EOP\|IFCS, dd=0; increments tail; writes new tail to TDT BAR register (doorbell). NIC reads desc[head], DMA-reads the packet bytes, sends on wire, sets dd=1, increments head. Driver's next TX completion interrupt walks the ring looking for dd=1 to free skb + unmap DMA.
What is SR-IOV and how does a VF differ from a PF?	SR-IOV creates multiple Virtual Functions from one Physical Function. The PF has full config space access and runs in the host kernel; VFs have limited config space (no BAR resize), their own TX/RX queues and MSI-X vectors, and bind to VM guest drivers via vfio-pci. The NIC hardware enforces isolation by MAC/VLAN filter steering per VF.
What is DDIO and how does it reduce latency for DPDK?	DDIO (Intel Data Direct I/O) routes PCIe DMA writes directly into the LLC instead of DRAM. Without DDIO: NIC → DRAM (~60 ns) → LLC fill → CPU = ~70 ns. With DDIO: NIC → LLC (~10 ns) → CPU. DPDK packet processing starts with data already cache-warm, critical for 100 Gbps line-rate.
How does igb_uio (UIO) work and why does DPDK use it?	igb_uio.ko registers as a PCIe driver and exposes a /dev/uio0 character device with struct uio_info containing BAR physical addresses. DPDK EAL calls mmap(fd, offset=0) to get a user-space pointer to BAR0 MMIO — enabling direct register writes and doorbell rings without syscalls. Interrupts arrive via read() on the fd (event counter changes).
What is struct bio and how does blk-mq use per-CPU queues?	A bio describes one logical block I/O: bi_sector (start LBA), bi_size, and a bio_vec[] scatter-gather array of physical pages. submit_bio() enters the block layer: blk-mq enqueues it on the calling CPU's per-CPU sw_queue, then flushes to a per-CPU hw_queue that maps to one NVMe submission queue — no global lock, full SSD parallelism.

§ 8.1 – 8.6 Linux Device Drivers & Hardware Interface

1. Overview

2. § 8.1 — PCI / PCIe Device Driver

PCIe Topology — Root Complex → Switch → Endpoints

BAR Layout — MMIO Mapping

MSI-X — Per-Queue Interrupt Vectors

Minimal C Demo — PCI BAR Register Access

3. § 8.2 — Network Driver Model

TX Descriptor Ring — Producer/Consumer

RX Descriptor Ring — Pre-posted Buffers

TX Path — End-to-End

Minimal C Demo — TX Descriptor Ring

4. § 8.3 — SR-IOV (Single Root I/O Virtualization)

Background — Why SR-IOV?

Flow Classification — MAC/VLAN Steering per VF

Minimal C Demo — SR-IOV MAC Filter Table

5. § 8.4 — DDIO (Intel Data Direct I/O)

Background

Without DDIO

With DDIO

6. § 8.5 — Character Devices

Registration Flow — from Module Init to /dev Node

UIO — Character Device for DPDK

Minimal C Demo — file_operations Dispatch

7. § 8.6 — Block Devices

Block Layer Stack — VFS to Hardware

blk-mq — Multi-Queue Submission

struct bio — Scatter-Gather I/O Descriptor

Minimal C Demo — bio Scatter-Gather + blk-mq

8. Kernel Source Pointers

9. Interview Prep