§ 9.1 – 9.3 DPDK: Kernel Bypass · EAL Init · Poll Mode Driver
Why kernel networking fails at line rate (§9.1) · EAL startup 10-step flow (§9.2) · PMD RX/TX descriptor rings + burst polling (§9.3)
1. Overview
DPDK (Data Plane Development Kit) eliminates every per-packet overhead in the Linux kernel networking stack: interrupts, system calls, sk_buff allocation, copy_to_user(), and lock contention in netfilter and the routing table. Instead, a Poll Mode Driver (PMD) runs entirely in userspace, mapping NIC registers directly via UIO or VFIO, and spins in a tight loop checking descriptor DD (Descriptor Done) bits — no interrupts, no syscalls, no copies.
The result: from ~1 Mpps maximum in the kernel to 14.88 Mpps line rate at 64-byte frames on a single 10 GbE port with a single core.
2. § 9.1 — Why DPDK: Kernel Bypass Architecture
The Six Kernel Networking Bottlenecks
At 10 Gbps with 64-byte frames, the NIC delivers 14.88 million packets per second. Each packet gets only 67 nanoseconds of CPU time. The Linux kernel path burns most of that budget before the application even sees the data.
| Bottleneck | Cost | DPDK Elimination |
|---|---|---|
| IRQ per packet | ~2–5 µs total (handler + softirq schedule) | Poll mode — DD bit check, zero interrupts |
| sk_buff allocation | slab allocator per packet, ~200 ns, cache miss | Pre-allocated mbuf pool in huge pages, O(1) from per-lcore cache |
| copy_to_user() | memcpy kernel→user, pollutes cache | Packet stays in huge-page mbuf, app reads in-place — zero copy |
| recv() syscall | context switch ~100–200 ns | No syscall — PMD loop is pure userspace |
| Lock contention | netfilter, routing table, socket hash under high PPS | No kernel stack at all — app owns the data path |
| Cache pollution | kernel stack traversal touches many cold lines | Huge pages + DDIO → packets land in LLC before CPU reads them |
UIO vs VFIO vs AF_XDP — Bypass Mechanisms Compared
The NIC must be detached from its kernel driver and handed to userspace. Three mechanisms exist, each with different security and performance trade-offs.
| Mechanism | Module | IOMMU | Root needed | Container-safe | Best for |
|---|---|---|---|---|---|
| UIO (igb_uio) | igb_uio.ko | No — DMA unrestricted | Yes | No | Dev/test, trusted bare-metal |
| VFIO (vfio-pci) | vfio-pci.ko | Yes — IOMMU group isolation | Yes (or CAP_SYS_ADMIN) | Yes | Production, SR-IOV VFs, containers |
| AF_XDP (kernel) | built-in | Kernel handles DMA | No (CAP_NET_RAW) | Yes | Keep kernel features, selectively accelerate |
| Kernel PMD (af_packet) | built-in | N/A | No | Yes | Debug, low PPS, compatibility |
How VFIO Works
VFIO groups devices by IOMMU group (devices that share an IOMMU context must be in the same group). The workflow: unbind the NIC from its kernel driver → echo vfio-pci > /sys/bus/pci/.../driver_override → DPDK EAL opens /dev/vfio/<group> → calls VFIO_MAP_DMA ioctl to register hugepage memory with the IOMMU → NIC can only DMA into registered regions. UIO skips the IOMMU entirely: faster to set up, but a buggy (or malicious) userspace process can DMA to any physical address.
3. § 9.2 — DPDK Startup Flow: EAL Initialization
rte_eal_init() — 10-Step Sequence
Every DPDK application calls rte_eal_init(argc, argv) as its first act. This one call bootstraps the entire DPDK runtime: CPU topology, memory, devices, and worker threads. If it returns a negative number, the application must exit — the environment is not usable.
| CLI Flag | Purpose |
|---|---|
--lcores 0-3 | Use logical cores 0, 1, 2, 3 — EAL creates one pthread per lcore |
--socket-mem 4096,4096 | Allocate 4 GB of huge pages on NUMA socket 0 and socket 1 |
--file-prefix myapp | Namespace for hugepage files — allows multiple DPDK instances on same host |
--proc-type primary | This process owns hugepages and devices; secondary processes attach later |
--proc-type secondary | Attach to an existing primary's shared memory (hot upgrade pattern) |
--allow 01:00.0 | Whitelist (probe) only this PCI device; all others are ignored |
--vdev net_ring0 | Create a virtual device (ring PMD) — useful for testing without real NIC |
Multi-Process Mode — Shared Hugepage Memory
DPDK supports running multiple cooperating processes on one host. The primary process allocates hugepages and initializes devices; one or more secondary processes attach by mmap()-ing the same hugepage files. Data structures stored in named rte_memzone regions (mempool, rings, flow tables) are accessible from both — at the same virtual addresses, because DPDK maps the files at a fixed base address. This is the foundation of DPDK hot upgrade.
Minimal C Demo — EAL Startup Simulation
Real rte_eal_init() requires DPDK libraries and hardware. This simulation traces the same 10 steps in plain C so you can follow the sequence mentally.
4. § 9.3 — Poll Mode Driver (PMD) — Deep Dive
PMD Initialization Sequence
After EAL init, the application configures each NIC port in three steps: set queue counts and offload flags, allocate descriptor rings, then start the device. Each step maps directly to a NIC register write via the mapped BAR.
| API Call | What it does to the NIC |
|---|---|
| rte_eth_dev_configure(port, nb_rxq, nb_txq, &conf) | Writes NIC control registers: queue count, RSS enable, offload flags |
| rte_eth_rx_queue_setup(port, q, nb_desc, socket, &rxconf, mp) | Allocates desc ring (DMA-coherent), fills each desc with a mempool mbuf's iova |
| rte_eth_tx_queue_setup(port, q, nb_desc, socket, &txconf) | Allocates TX desc ring; sets tx_free_thresh (batch-free completed mbufs) |
| rte_eth_dev_start(port) | Enables NIC, configures MAC filter, enables RX/TX, links up |
| rte_eth_rx_burst(port, q, mbufs, 32) | Hot path: scans DD bits, harvests up to 32 mbufs, refills ring, rings doorbell |
| rte_eth_tx_burst(port, q, mbufs, n) | Hot path: fills TX descs with mbuf iovas, rings TDT doorbell, checks tx_free_thresh |
RX Descriptor Ring — NIC Fills, PMD Drains
The RX ring is a circular array of fixed-size descriptors in DMA-accessible memory (inside huge pages). The PMD pre-fills every slot with the physical address (buf_iova) of an empty mbuf from the mempool. When a packet arrives, the NIC DMA-writes the packet bytes into the pointed-to buffer and sets DD=1. The PMD polls, harvests completed descriptors, and immediately refills each slot with a fresh mbuf before ringing the doorbell (writing the new tail index to the RDT BAR register).
TX Descriptor Ring — PMD Fills, NIC Drains
TX is symmetric. The PMD fills each descriptor with the outgoing mbuf's buf_iova, the packet length, and command flags (EOP = end of packet, RS = report status, IFCS = insert CRC). It then writes the new tail to the TDT BAR register (the TX doorbell). The NIC DMA-reads the packet bytes over PCIe and transmits. The PMD reclaims completed descriptors (DD=1) in batches of tx_free_thresh (default 32) to amortize the mempool free cost.
PMD RX Burst — Step-by-Step Code Path
Burst Design — Why 32 Packets?
rte_eth_rx_burst() processes up to 32 packets per call. This is not arbitrary:
- Amortizes the doorbell write — one PCIe transaction to update RDT costs ~100 ns; doing it once per 32 packets costs 3 ns per packet amortized.
- Fits in a cache line prefetch window — with a prefetch-ahead distance of 3–4, 32 descriptors keep the CPU pipeline full without exceeding L1 capacity.
- Aligns with SIMD width — 32 × 64-bit descriptors = 256 bytes, fitting in 4 AVX2 registers for batch DD-bit checking.
Minimal C Demo — PMD RX Descriptor Ring
Minimal C Demo — PMD TX Path
5. Kernel Source Pointers
| File / Symbol | What it contains |
|---|---|
| lib/eal/linux/eal.c — rte_eal_init() | The 10-step EAL initialization sequence; calls sub-functions below |
| lib/eal/linux/eal_hugepage_info.c — eal_hugepage_init() | mmap() hugetlbfs pages, build memseg list per NUMA socket |
| lib/eal/common/eal_common_lcore.c | Lcore thread creation, CPU affinity setup (pthread_setaffinity_np) |
| lib/eal/linux/eal_pci.c — rte_pci_scan() | Enumerate /sys/bus/pci/devices/, match PCI IDs to PMD table |
| drivers/net/ixgbe/ixgbe_rxtx.c — ixgbe_recv_pkts() | ixgbe PMD rx_burst: DD bit scan, mbuf harvest, ring refill, doorbell |
| drivers/net/ixgbe/ixgbe_rxtx.c — ixgbe_xmit_pkts() | ixgbe PMD tx_burst: fill TX descs, TDT doorbell, tx_free_thresh cleanup |
| drivers/net/mlx5/mlx5_rx.c — mlx5_rx_burst() | Mellanox ConnectX PMD rx_burst using Completion Queue (CQ) model |
| lib/ethdev/rte_ethdev.c — rte_eth_rx_burst() (inline) | Dispatch: calls port->rx_pkt_burst function pointer (per-PMD hot path) |
| drivers/bus/pci/linux/pci_uio.c | igb_uio BAR mmap: maps NIC register space into userspace process |
| drivers/bus/pci/linux/pci_vfio.c | VFIO device open, IOMMU group handling, DMA mapping via ioctl |
6. Interview Prep
| Question | Concise Answer |
|---|---|
| Why does kernel networking fail above ~1 Mpps? Name 5 bottlenecks. | 1) IRQ per packet (~2–5 µs handler + softirq). 2) sk_buff slab alloc per packet (~200 ns, cache miss). 3) copy_to_user() memcpy polluting cache. 4) recv() syscall context switch (~100–200 ns). 5) Lock contention in netfilter / routing table / socket hash under high PPS. At 10 GbE 64B frames, each packet gets only 67 ns — the kernel stack exceeds that budget. |
| Walk through rte_eal_init() — all 10 steps. | 1) Parse CLI (--lcores, --socket-mem). 2) Load PMD plugin .so files. 3) Read CPU topology from /proc/cpuinfo, build core+NUMA map. 4) mmap()+mlock() huge pages, build memseg list per NUMA socket. 5) rte_memzone_init() — named regions in huge pages. 6) Create one pthread per lcore, pin via pthread_setaffinity_np(). 7) Enumerate /sys/bus/pci/devices/, match PCI IDs to PMD table. 8) rte_pci_probe() → PMD eth_dev_init() for each matched NIC. 9) Start service cores (timer, interrupt). 10) rte_eal_mp_remote_launch() — worker functions start polling. |
| What is the DD bit and what is the doorbell in DPDK PMD? | DD (Descriptor Done): a status bit in each RX or TX descriptor that the NIC sets to 1 when it has finished with that slot (RX: packet DMA-written; TX: packet transmitted). The PMD polls DD instead of waiting for an interrupt. Doorbell: a write to a NIC BAR register (RDT for RX, TDT for TX) telling the NIC the new head/tail pointer — i.e., how many new buffers the PMD has made available. In DPDK, the doorbell is batched once per rx_burst/tx_burst call to amortize the PCIe transaction cost. |
| What is the difference between UIO and VFIO? When should you use VFIO? | UIO (igb_uio.ko): exposes /dev/uioN; DPDK mmap()s BAR directly. No IOMMU — the DMA address space is physical memory, so a bug lets the NIC DMA anywhere. Requires root. VFIO (vfio-pci.ko): groups the device by IOMMU domain; DPDK registers hugepage memory with the IOMMU via VFIO_MAP_DMA ioctl; the NIC can only DMA into those regions. Preferred for production (SR-IOV VFs, containers, secure multi-tenant environments). Use VFIO whenever the system has an IOMMU and you care about isolation. |
| Why does rte_eth_rx_burst() process packets in batches of 32? | Three reasons: 1) Amortize doorbell write — one PCIe RDT update per 32 packets costs ~3 ns/pkt vs ~100 ns/pkt for per-packet updates. 2) Prefetch pipeline — prefetching 3–4 mbufs ahead while processing the current one hides DRAM latency; 32 fits without L1 overflow. 3) SIMD alignment — 32 × 64-bit descriptors = 256 bytes, checkable with 4 AVX2 registers in one pass for the DD bit. |