Part IX — DPDK

§ 9.1 – 9.3 DPDK: Kernel Bypass · EAL Init · Poll Mode Driver

Why kernel networking fails at line rate (§9.1) · EAL startup 10-step flow (§9.2) · PMD RX/TX descriptor rings + burst polling (§9.3)

1. Overview

DPDK (Data Plane Development Kit) eliminates every per-packet overhead in the Linux kernel networking stack: interrupts, system calls, sk_buff allocation, copy_to_user(), and lock contention in netfilter and the routing table. Instead, a Poll Mode Driver (PMD) runs entirely in userspace, mapping NIC registers directly via UIO or VFIO, and spins in a tight loop checking descriptor DD (Descriptor Done) bits — no interrupts, no syscalls, no copies.

The result: from ~1 Mpps maximum in the kernel to 14.88 Mpps line rate at 64-byte frames on a single 10 GbE port with a single core.

2. § 9.1 — Why DPDK: Kernel Bypass Architecture

The Six Kernel Networking Bottlenecks

At 10 Gbps with 64-byte frames, the NIC delivers 14.88 million packets per second. Each packet gets only 67 nanoseconds of CPU time. The Linux kernel path burns most of that budget before the application even sees the data.

BottleneckCostDPDK Elimination
IRQ per packet~2–5 µs total (handler + softirq schedule)Poll mode — DD bit check, zero interrupts
sk_buff allocationslab allocator per packet, ~200 ns, cache missPre-allocated mbuf pool in huge pages, O(1) from per-lcore cache
copy_to_user()memcpy kernel→user, pollutes cachePacket stays in huge-page mbuf, app reads in-place — zero copy
recv() syscallcontext switch ~100–200 nsNo syscall — PMD loop is pure userspace
Lock contentionnetfilter, routing table, socket hash under high PPSNo kernel stack at all — app owns the data path
Cache pollutionkernel stack traversal touches many cold linesHuge pages + DDIO → packets land in LLC before CPU reads them

UIO vs VFIO vs AF_XDP — Bypass Mechanisms Compared

The NIC must be detached from its kernel driver and handed to userspace. Three mechanisms exist, each with different security and performance trade-offs.

MechanismModuleIOMMURoot neededContainer-safeBest for
UIO (igb_uio)igb_uio.koNo — DMA unrestrictedYesNoDev/test, trusted bare-metal
VFIO (vfio-pci)vfio-pci.koYes — IOMMU group isolationYes (or CAP_SYS_ADMIN)YesProduction, SR-IOV VFs, containers
AF_XDP (kernel)built-inKernel handles DMANo (CAP_NET_RAW)YesKeep kernel features, selectively accelerate
Kernel PMD (af_packet)built-inN/ANoYesDebug, low PPS, compatibility

How VFIO Works

VFIO groups devices by IOMMU group (devices that share an IOMMU context must be in the same group). The workflow: unbind the NIC from its kernel driver → echo vfio-pci > /sys/bus/pci/.../driver_override → DPDK EAL opens /dev/vfio/<group> → calls VFIO_MAP_DMA ioctl to register hugepage memory with the IOMMU → NIC can only DMA into registered regions. UIO skips the IOMMU entirely: faster to set up, but a buggy (or malicious) userspace process can DMA to any physical address.

3. § 9.2 — DPDK Startup Flow: EAL Initialization

rte_eal_init() — 10-Step Sequence

Every DPDK application calls rte_eal_init(argc, argv) as its first act. This one call bootstraps the entire DPDK runtime: CPU topology, memory, devices, and worker threads. If it returns a negative number, the application must exit — the environment is not usable.

CLI FlagPurpose
--lcores 0-3Use logical cores 0, 1, 2, 3 — EAL creates one pthread per lcore
--socket-mem 4096,4096Allocate 4 GB of huge pages on NUMA socket 0 and socket 1
--file-prefix myappNamespace for hugepage files — allows multiple DPDK instances on same host
--proc-type primaryThis process owns hugepages and devices; secondary processes attach later
--proc-type secondaryAttach to an existing primary's shared memory (hot upgrade pattern)
--allow 01:00.0Whitelist (probe) only this PCI device; all others are ignored
--vdev net_ring0Create a virtual device (ring PMD) — useful for testing without real NIC

Multi-Process Mode — Shared Hugepage Memory

DPDK supports running multiple cooperating processes on one host. The primary process allocates hugepages and initializes devices; one or more secondary processes attach by mmap()-ing the same hugepage files. Data structures stored in named rte_memzone regions (mempool, rings, flow tables) are accessible from both — at the same virtual addresses, because DPDK maps the files at a fixed base address. This is the foundation of DPDK hot upgrade.

Minimal C Demo — EAL Startup Simulation

Real rte_eal_init() requires DPDK libraries and hardware. This simulation traces the same 10 steps in plain C so you can follow the sequence mentally.

EAL Initialization — 10-step startup simulation — C Demo
stdin (optional)

4. § 9.3 — Poll Mode Driver (PMD) — Deep Dive

PMD Initialization Sequence

After EAL init, the application configures each NIC port in three steps: set queue counts and offload flags, allocate descriptor rings, then start the device. Each step maps directly to a NIC register write via the mapped BAR.

API CallWhat it does to the NIC
rte_eth_dev_configure(port, nb_rxq, nb_txq, &conf)Writes NIC control registers: queue count, RSS enable, offload flags
rte_eth_rx_queue_setup(port, q, nb_desc, socket, &rxconf, mp)Allocates desc ring (DMA-coherent), fills each desc with a mempool mbuf's iova
rte_eth_tx_queue_setup(port, q, nb_desc, socket, &txconf)Allocates TX desc ring; sets tx_free_thresh (batch-free completed mbufs)
rte_eth_dev_start(port)Enables NIC, configures MAC filter, enables RX/TX, links up
rte_eth_rx_burst(port, q, mbufs, 32)Hot path: scans DD bits, harvests up to 32 mbufs, refills ring, rings doorbell
rte_eth_tx_burst(port, q, mbufs, n)Hot path: fills TX descs with mbuf iovas, rings TDT doorbell, checks tx_free_thresh

RX Descriptor Ring — NIC Fills, PMD Drains

The RX ring is a circular array of fixed-size descriptors in DMA-accessible memory (inside huge pages). The PMD pre-fills every slot with the physical address (buf_iova) of an empty mbuf from the mempool. When a packet arrives, the NIC DMA-writes the packet bytes into the pointed-to buffer and sets DD=1. The PMD polls, harvests completed descriptors, and immediately refills each slot with a fresh mbuf before ringing the doorbell (writing the new tail index to the RDT BAR register).

TX Descriptor Ring — PMD Fills, NIC Drains

TX is symmetric. The PMD fills each descriptor with the outgoing mbuf's buf_iova, the packet length, and command flags (EOP = end of packet, RS = report status, IFCS = insert CRC). It then writes the new tail to the TDT BAR register (the TX doorbell). The NIC DMA-reads the packet bytes over PCIe and transmits. The PMD reclaims completed descriptors (DD=1) in batches of tx_free_thresh (default 32) to amortize the mempool free cost.

PMD RX Burst — Step-by-Step Code Path

Burst Design — Why 32 Packets?

rte_eth_rx_burst() processes up to 32 packets per call. This is not arbitrary:

  • Amortizes the doorbell write — one PCIe transaction to update RDT costs ~100 ns; doing it once per 32 packets costs 3 ns per packet amortized.
  • Fits in a cache line prefetch window — with a prefetch-ahead distance of 3–4, 32 descriptors keep the CPU pipeline full without exceeding L1 capacity.
  • Aligns with SIMD width — 32 × 64-bit descriptors = 256 bytes, fitting in 4 AVX2 registers for batch DD-bit checking.

Minimal C Demo — PMD RX Descriptor Ring

PMD RX ring — pre-fill, NIC receive, burst harvest — C Demo
stdin (optional)

Minimal C Demo — PMD TX Path

PMD TX path — fill descriptors, doorbell, batch-free — C Demo
stdin (optional)

5. Kernel Source Pointers

File / SymbolWhat it contains
lib/eal/linux/eal.c — rte_eal_init()The 10-step EAL initialization sequence; calls sub-functions below
lib/eal/linux/eal_hugepage_info.c — eal_hugepage_init()mmap() hugetlbfs pages, build memseg list per NUMA socket
lib/eal/common/eal_common_lcore.cLcore thread creation, CPU affinity setup (pthread_setaffinity_np)
lib/eal/linux/eal_pci.c — rte_pci_scan()Enumerate /sys/bus/pci/devices/, match PCI IDs to PMD table
drivers/net/ixgbe/ixgbe_rxtx.c — ixgbe_recv_pkts()ixgbe PMD rx_burst: DD bit scan, mbuf harvest, ring refill, doorbell
drivers/net/ixgbe/ixgbe_rxtx.c — ixgbe_xmit_pkts()ixgbe PMD tx_burst: fill TX descs, TDT doorbell, tx_free_thresh cleanup
drivers/net/mlx5/mlx5_rx.c — mlx5_rx_burst()Mellanox ConnectX PMD rx_burst using Completion Queue (CQ) model
lib/ethdev/rte_ethdev.c — rte_eth_rx_burst() (inline)Dispatch: calls port->rx_pkt_burst function pointer (per-PMD hot path)
drivers/bus/pci/linux/pci_uio.cigb_uio BAR mmap: maps NIC register space into userspace process
drivers/bus/pci/linux/pci_vfio.cVFIO device open, IOMMU group handling, DMA mapping via ioctl

6. Interview Prep

QuestionConcise Answer
Why does kernel networking fail above ~1 Mpps? Name 5 bottlenecks.1) IRQ per packet (~2–5 µs handler + softirq). 2) sk_buff slab alloc per packet (~200 ns, cache miss). 3) copy_to_user() memcpy polluting cache. 4) recv() syscall context switch (~100–200 ns). 5) Lock contention in netfilter / routing table / socket hash under high PPS. At 10 GbE 64B frames, each packet gets only 67 ns — the kernel stack exceeds that budget.
Walk through rte_eal_init() — all 10 steps.1) Parse CLI (--lcores, --socket-mem). 2) Load PMD plugin .so files. 3) Read CPU topology from /proc/cpuinfo, build core+NUMA map. 4) mmap()+mlock() huge pages, build memseg list per NUMA socket. 5) rte_memzone_init() — named regions in huge pages. 6) Create one pthread per lcore, pin via pthread_setaffinity_np(). 7) Enumerate /sys/bus/pci/devices/, match PCI IDs to PMD table. 8) rte_pci_probe() → PMD eth_dev_init() for each matched NIC. 9) Start service cores (timer, interrupt). 10) rte_eal_mp_remote_launch() — worker functions start polling.
What is the DD bit and what is the doorbell in DPDK PMD?DD (Descriptor Done): a status bit in each RX or TX descriptor that the NIC sets to 1 when it has finished with that slot (RX: packet DMA-written; TX: packet transmitted). The PMD polls DD instead of waiting for an interrupt. Doorbell: a write to a NIC BAR register (RDT for RX, TDT for TX) telling the NIC the new head/tail pointer — i.e., how many new buffers the PMD has made available. In DPDK, the doorbell is batched once per rx_burst/tx_burst call to amortize the PCIe transaction cost.
What is the difference between UIO and VFIO? When should you use VFIO?UIO (igb_uio.ko): exposes /dev/uioN; DPDK mmap()s BAR directly. No IOMMU — the DMA address space is physical memory, so a bug lets the NIC DMA anywhere. Requires root. VFIO (vfio-pci.ko): groups the device by IOMMU domain; DPDK registers hugepage memory with the IOMMU via VFIO_MAP_DMA ioctl; the NIC can only DMA into those regions. Preferred for production (SR-IOV VFs, containers, secure multi-tenant environments). Use VFIO whenever the system has an IOMMU and you care about isolation.
Why does rte_eth_rx_burst() process packets in batches of 32?Three reasons: 1) Amortize doorbell write — one PCIe RDT update per 32 packets costs ~3 ns/pkt vs ~100 ns/pkt for per-packet updates. 2) Prefetch pipeline — prefetching 3–4 mbufs ahead while processing the current one hides DRAM latency; 32 fits without L1 overflow. 3) SIMD alignment — 32 × 64-bit descriptors = 256 bytes, checkable with 4 AVX2 registers in one pass for the DD bit.