Part II — Memory

§ 2.1 – 2.2 Physical Layout & Buddy Allocator

How Linux carves physical RAM into NUMA nodes, zones, and struct page frames — and how the buddy allocator hands out contiguous blocks in O(log n) time.

1. Overview

When the kernel boots it uses the e820 memory map (x86) or Device Tree /memory nodes (ARM) to learn which physical address ranges are usable RAM. It then builds a three-level hierarchy: NUMA nodes → zones → page frames. Each 4 KB page frame gets one struct page descriptor stored in a contiguous array called mem_map. The buddy allocator manages free blocks within each zone using power-of-2 free lists — splitting large blocks when allocating and coalescing adjacent buddies when freeing.

2. Key Data Structures

struct page — one per physical frame

The kernel allocates one struct page for every physical page frame at boot. On a 64 GB machine that is 16 million descriptors — about 1 GB of memory just for metadata. The struct is carefully padded to exactly 64 bytes so that a page index directly maps to an array offset.

FieldTypePurpose
flagsunsigned longBitfield: PG_locked, PG_dirty, PG_uptodate, PG_referenced, PG_slab…
_refcountatomic_tPhysical page reference count; free_pages() decrements; 0 → reclaim
mappingstruct address_space *Page-cache owner; odd pointer = anon vma; NULL = free
indexpgoff_tByte offset / PAGE_SIZE within the address_space
lrustruct list_headLinks page into zone's LRU active or inactive list
_mapcountatomic_tNumber of user-space PTEs mapping this page (-1 = none)
privateunsigned longSlab: freelist ptr; buffer_head: b_page; anon: swap entry

Memory Zones & Watermarks

Each zone maintains three watermarks — min, low, and high — expressed as page counts. The watermarks control when background reclaim (kswapd) wakes up and when allocators must block for direct reclaim.

ZoneAddress Range (x86-64)Purpose
ZONE_DMA0x0 – 0xFFFFFF (16 MB)Legacy ISA DMA hardware that cannot address above 16 MB
ZONE_DMA3216 MB – 4 GB32-bit PCI bus masters; also used as fallback on machines without DMA32 devices
ZONE_NORMAL4 GB – end of RAMDefault zone; kernel virtual addresses map here 1:1
ZONE_HIGHMEMabove kernel VA limit (32-bit only)32-bit kernels only; requires kmap() for temporary access
ZONE_MOVABLEconfigurablePages that can be migrated; enables memory hotplug

Buddy Free Lists

Each zone holds MAX_ORDER (typically 11) free lists, indexed by order. Order k holds blocks of 2k contiguous pages. A freshly booted machine might have one order-10 block (1024 pages = 4 MB) in ZONE_NORMAL.

OrderBlock sizeTypical use
04 KB (1 page)Single page: slab objects, anonymous mappings
18 KB (2 pages)Small kernel stacks (pre-5.8)
216 KB (4 pages)Kernel stacks (CONFIG_THREAD_INFO_IN_TASK)
3–532–128 KBLarge kmalloc requests
92 MBHuge page backing (before THP splits)
104 MBMaximum contiguous allocation; DMA buffers

3. Core Mechanism — Buddy Allocator Split & Coalesce

Allocation path — alloc_pages()

Background: The kernel frequently needs 1–16 contiguous page frames for kernel stacks, DMA buffers, and slab slabs. A simple bitmap scan would be O(n). The buddy system maintains power-of-2 free lists so that both allocation and freeing are O(log n) — with n = max order.

Plan:

  1. Look in free_list[requested_order]. If non-empty, dequeue and return.
  2. Walk up to free_list[requested_order + 1], +2, … until a non-empty list is found.
  3. Dequeue the large block. Split it in half: lower half goes to the caller; upper half (“buddy”) is added to free_list[k-1].
  4. Repeat splitting until the block is exactly the requested order.

Example — allocating order-1 (2 pages) from a pool with only an order-2 block at 0x1000:

Stepfree_list[0]free_list[1]free_list[2]Action
Initial0x1000 (4 pages)One order-2 block
Check [1]empty0x1000Miss → look higher
Take [2]Dequeue 0x1000 from order-2
Split0x3000 (buddy)Upper half 0x3000 → free_list[1]
Return0x3000Lower half 0x1000 (2 pages) → caller

Free path — free_pages() — coalescing

Background:Returning a single page back to order-0 causes fragmentation over time. The buddy system coalesces adjacent same-size blocks (“buddies”) on every free, pushing the block up to the highest possible order.

Key insight — buddy address: For a block of size 2k pages at address addr, the buddy is at addr XOR (1 << (k + PAGE_SHIFT)). Because blocks are power-of-2 aligned, toggling that single bit flips between the block and its buddy.

Anti-fragmentation — migrate types: The buddy system groups pages by mobility within each free list to prevent unmovable pages from fragmenting the pool of movable pages.

Migrate TypePagesWhy separated
MIGRATE_UNMOVABLEKernel data, slabCannot be moved — if mixed with movable pages, they pin large regions forever
MIGRATE_MOVABLEUser-space anonymous, file pagesCan be migrated by memory compaction — freeing up contiguous ranges for huge pages
MIGRATE_RECLAIMABLEPage cache, inode cacheCan be reclaimed (written back or dropped) — not moved, but frees memory under pressure
MIGRATE_HIGHATOMICEmergency reserveHeld back for GFP_ATOMIC allocations that cannot sleep

4. Minimal C Demo

The demo below implements the core buddy split and coalesce algorithm with a 16-page pool (order-4). The same logic — walk up to find a large-enough block, split down, coalesce on free — is used by the kernel's __rmqueue() and __free_one_page().

Buddy Allocator — split & coalesce — C Demo
stdin (optional)

The next demo simulates the zone watermark check — the logic inside zone_watermark_ok() that decides whether an allocation can proceed or must wait for reclaim.

Zone Watermark Check — C Demo
stdin (optional)

5. Kernel Source Pointers

File / FunctionWhat it does
include/linux/mm_types.h :: struct pageDefinition of the 64-byte page descriptor; union covers slab, compound, pagecache uses
include/linux/mmzone.h :: struct zoneZone definition: free_area[] (buddy lists), watermarks, spanned/present pages
mm/page_alloc.c :: __alloc_pages_nodemask()Main allocator entry; selects zone list, calls get_page_from_freelist()
mm/page_alloc.c :: __rmqueue()Dequeues a block from the buddy free list; calls expand() to split
mm/page_alloc.c :: __free_one_page()Coalesces buddies on free; walks up to ORDER_MAX
mm/page_alloc.c :: zone_watermark_ok()Checks free pages against watermarks before allocating
mm/compaction.c :: compact_zone()Memory compaction: migrates MOVABLE pages to coalesce higher-order free blocks
mm/oom_kill.c :: out_of_memory()Called when all zones are below min watermark; selects victim process

6. Interview Prep

#QuestionConcise Answer
Q1Explain the buddy allocator split and coalesce algorithm.Allocation: walk free_list[] from requested order upward until a non-empty list is found, dequeue the block, split in half, return lower half and add upper half (buddy) to next-lower free list. Free: compute buddy address as addr XOR order_size; if buddy is also free, remove it and merge into a double-size block, repeat until no buddy is free or max order reached.
Q2Why are memory zones separated (DMA, DMA32, NORMAL)?Legacy ISA DMA controllers can only address 24-bit (16 MB) physical addresses. 32-bit PCI masters are limited to 32-bit. The kernel must satisfy DMA allocations from the correct zone; normal kernel allocations prefer ZONE_NORMAL. Without the separation, a DMA request might get a page above 16 MB that the device can never reach.
Q3What are zone watermarks and what happens at each level?Three thresholds — high, low, min. Above high: allocate freely. Below high but above low: kswapd background reclaim wakes up; allocation still succeeds. Below low but above min: calling thread enters direct reclaim (may sleep). Below min: either OOM killer fires or the allocation fails (GFP_ATOMIC callers get a reserved pool).
Q4What is MIGRATE_MOVABLE and why does it help huge pages?Pages classified MIGRATE_MOVABLE (user-space anonymous and file-backed) can be physically relocated by the compaction thread. By segregating them from UNMOVABLE kernel pages, the allocator can compact a region entirely of movable pages into one end, leaving a contiguous order-9/10 block free for huge page allocation. Without this, one pinned kernel page ruins a 2 MB range.
Q5How does struct page stay at 64 bytes despite holding so many fields?Most fields are mutually exclusive: a page is either a slab object, a file-backed page, an anonymous page, or a compound page tail — never all at once. The kernel uses a union over the same 64 bytes; the page flags (PG_slab, etc.) tell code which union member is active. The lru list_head is reused as a freelist pointer inside slab.
Part II — Memory

§ 2.3 – 2.4 Slab/Slub Allocator & Virtual Memory

How the kernel caches fixed-size objects with lock-free per-CPU freelists, and how 4-level page tables map virtual addresses to physical frames — with TLB tricks and huge pages to minimize translation overhead.

7. Overview — Slab/Slub + Paging

The buddy allocator gives out whole pages (4 KB minimum). But most kernel allocations are tiny: a struct inode is ~600 B, a struct sk_buff is ~232 B. Rounding each up to 4 KB wastes 80–99 % of every allocation. The slab allocator (modern Linux uses SLUB by default) carves each slab page into equal-size objects and keeps a per-CPU freelist — allocation is a single pointer pop, lock-free on the hot path.

Virtual memory adds translation: every load/store goes through a 4-level page table (PGD → PUD → PMD → PTE). A TLB caches recent VA→PA mappings; a miss triggers a hardware walk costing 100–300 cycles. Huge pages (2 MB / 1 GB) cover more address space per TLB entry — the core reason DPDK pins 1 GB pages at startup.

8. Key Data Structures — Slab

kmem_cache — one descriptor per object type

Every slab cache has one kmem_cache descriptor. Linux creates hundreds at boot — inode_cache, skbuff_head_cache, mm_struct, etc. List them with cat /proc/slabinfo.

FieldTypePurpose
namechar[]Human-readable name shown in /proc/slabinfo and crash dumps
object_sizeunsigned intActual bytes needed per object before alignment
sizeunsigned intAligned size (may include SLAB_RED_ZONE padding in debug builds)
cpu_slabstruct kmem_cache_cpu *Per-CPU state: pointer to active slab page + embedded freelist head
node[NUMA_NODES]struct kmem_cache_node *Per-NUMA-node lists of partial slabs (some free, some in-use objects)
min_partialunsigned longMin partial slabs to keep per node before returning pages to buddy
ctorvoid (*)(void *)Optional constructor called once per object when a fresh slab is allocated
flagsslab_flags_tSLAB_HWCACHE_ALIGN, SLAB_POISON (debug write 0x6b), SLAB_RECLAIM_ACCOUNT

SLAB vs SLUB vs SLOB

AllocatorDefault kernelKey trade-off
SLABPre-3.14 serversPer-CPU caches with cache-line coloring. High memory overhead per cache type.
SLUBLinux ≥ 2.6.23 (today's default)No coloring. Embeds freelist pointer inside free objects. Lower overhead, better NUMA locality. Preferred for production.
SLOBTiny/embedded kernelsSimple first-fit. Minimal metadata but poor cache behavior — not suitable for servers.

9. Core Mechanism — Slab Allocation Fast & Slow Path

Background: The kernel allocates network buffers, dentries, and inodes millions of times per second. Taking a spinlock on every allocation serializes CPUs. SLUB avoids this with a per-CPU embedded freelist: the first sizeof(void *) bytes of every free object store a pointer to the next free object — so the freelist has zero memory overhead and fast- path alloc/free are both a single pointer update.

Plan:

  1. Fast path — check cpu_slab->freelist. If non-NULL: read the embedded next pointer, advance freelist head, return object. No lock, no atomic.
  2. Slow path — partial slab — lock node->list_lock, grab a partial slab from the NUMA-node list, set it as the CPU's active slab, unlock, retry fast path.
  3. Slow path — new slab — if no partial slab exists, call alloc_pages() to get a fresh page from the buddy allocator, initialize the freelist through all objects in that page, add to node partial list.

Example — allocating 10 sk_buff objects from a fresh cache:

EventCPU freelistNode partial listAction
Cache createdNULLemptykmem_cache_create() — no pages yet, lazy init
1st allocNULL → slow pathempty → must allocatealloc_pages() → init 64 objects in page, install as CPU active slab
2nd–64th allocobj[1]→obj[2]→…1 partial slabFast path each time: pop freelist head, zero-init, return
65th allocNULL (slab exhausted)slab now fullSlow path: alloc another buddy page, init new freelist
free(obj[10])obj[10] → old headunchangedFast path: push to freelist head, no lock needed

10. Virtual Memory & Paging

4-Level Page Table Walk (x86-64)

x86-64 uses a 4-level page table to map 48-bit virtual addresses to physical frames. The CPU register CR3 holds the physical address of the top-level Page Global Directory (PGD). Each level is a 4 KB page containing 512 8-byte entries — indexed by 9 bits of the VA. On every TLB miss the hardware walker reads all four levels, fills the TLB, and retries the instruction.

LevelVA bitsTypePoints to
PGD — Page Global Directory[47:39]pgd_tPhysical base of PUD table (512 × 8 B = 4 KB)
PUD — Page Upper Directory[38:30]pud_tPhysical base of PMD table OR 1 GB huge page (PS=1)
PMD — Page Middle Directory[29:21]pmd_tPhysical base of PTE table OR 2 MB huge page (PS=1)
PTE — Page Table Entry[20:12]pte_tPhysical frame number [51:12] + flags [11:0]
Page offset[11:0]Byte offset within the 4 KB physical page

TLB — Translation Lookaside Buffer

Walking 4 page-table levels requires 4 memory accesses — up to 4 × 100 ns = 400 ns on a cold cache. The TLB is a small fully-associative hardware cache (64–1024 entries per core) storing recent VA→PA translations. A hit costs 1–4 cycles; a miss triggers the full hardware walk. flush_tlb_mm() invalidates all TLB entries for a process on context switch; on SMP systems a TLB shootdown IPI is sent to remote CPUs.

Huge Pages & TLB Pressure

Page sizeTLB entries to cover 1 GBUse case
4 KB (normal)262,144 entriesGeneral use; fine-grained reclaim; virtual desktop workloads
2 MB (HugeTLB / THP)512 entriesDatabases, JVMs, DPDK packet pools — 512× fewer TLB misses for the same range
1 GB (1G pages)1 entryDPDK 1G mempool; GPU framebuffers; any static long-lived DMA mapping

Transparent Huge Pages (THP) — the khugepaged kernel thread promotes runs of 512 contiguous 4 KB anonymous pages to a single 2 MB page automatically; no application change required. HugeTLBfs — explicit huge pages reserved at boot via hugepages=N or /proc/sys/vm/nr_hugepages. DPDK calls rte_eal_init() which uses mmap(MAP_HUGETLB) to pin 1 GB pages, guaranteeing TLB entries are never evicted for the lifetime of the application.

11. Minimal C Demo

The first demo implements the SLUB embedded-freelist pattern. Every free object stores its "next free" pointer in its own first bytes — freelist overhead is zero, and alloc/free are each a single pointer update.

SLUB Embedded Freelist — slab_alloc / slab_free — C Demo
stdin (optional)

The second demo extracts each page-table index from a 48-bit virtual address — the same decomposition the hardware page walker performs on every TLB miss.

4-Level Page Table Walk — VA → PGD/PUD/PMD/PTE indices — C Demo
stdin (optional)

12. Kernel Source Pointers

File / FunctionWhat it does
mm/slub.c :: kmem_cache_alloc()SLUB fast-path: loads cpu_slab->freelist, pops head — inlined into a few instructions in the hot path
mm/slub.c :: __slab_alloc()SLUB slow path: acquires node lock, refills cpu freelist from partial slab or allocates new slab page
mm/slub.c :: kmem_cache_free()SLUB free: pushes object back to cpu freelist; if slab becomes empty, may return pages to buddy
mm/slab_common.c :: kmem_cache_create()Allocates and registers a new kmem_cache; sets object_size, alignment, NUMA node arrays
include/linux/slab.h :: kmalloc()General-purpose allocator: routes to per-size SLUB caches for ≤8 KB, buddy for larger
arch/x86/mm/fault.c :: do_page_fault()x86 #PF handler: checks vma validity, triggers do_anonymous_page(), do_cow_fault(), or SIGSEGV
mm/memory.c :: handle_mm_fault()Central fault dispatcher: delegates to huge-page or PTE-level fault handlers
arch/x86/include/asm/tlbflush.h :: flush_tlb_mm()Invalidates all TLB entries for an mm_struct; sends IPI to remote CPUs (TLB shootdown)
mm/hugetlb.c :: alloc_huge_page()Allocates from the pre-reserved HugeTLB pool; used by mmap(MAP_HUGETLB) and DPDK
mm/khugepaged.c :: khugepaged_scan_mm_slot()Scans for 512-page-aligned anonymous regions to promote to THP

13. Interview Prep

#QuestionConcise Answer
Q6What is the difference between SLAB and SLUB?SLAB maintains per-CPU caches as separate arrays plus cache coloring to reduce false sharing. SLUB is simpler: it embeds the freelist pointer inside free objects (zero metadata overhead), uses per-CPU active slab pointers, and maintains per-NUMA-node partial lists. SLUB has lower memory usage and less code complexity; SLAB's coloring can slightly improve cache utilization on workloads with many small caches.
Q7Why is the SLUB fast path lock-free?Each CPU has its own cpu_slab pointer and freelist head. As long as the active slab has free objects, alloc/free only update the freelist head — a per-CPU variable no other CPU touches. Lock contention only appears on the slow path when the CPU must borrow a partial slab from the NUMA node list (guarded by node->list_lock).
Q8Walk me through a 4-level page table translation on x86-64.CPU extracts four 9-bit indices from VA bits [47:39], [38:30], [29:21], [20:12] plus a 12-bit offset. Hardware reads CR3 for PGD base, dereferences PGD[idx] for PUD base, PUD[idx] for PMD base, PMD[idx] for PTE base, PTE[idx] for physical frame number. Physical address = frame<<12 | offset. Each level can short-circuit if the PS (huge page) bit is set — PMD with PS=1 gives a 2 MB mapping, PUD with PS=1 gives 1 GB.
Q9What is a TLB shootdown and when does it happen?When a CPU modifies a page table entry (e.g. unmaps a page with munmap), other CPUs may have stale TLB entries for that VA. The kernel sends an inter-processor interrupt (IPI) to all CPUs that might hold the stale entry, instructing them to execute INVLPG or flush their TLBs. This is a TLB shootdown. It is expensive — O(CPU count) IPI latency — so mTHP and hugetlb minimize shootdowns by using fewer, larger entries.
Q10Why do huge pages reduce TLB pressure? How does DPDK use them?A 4 KB TLB entry covers 4 KB; a 2 MB entry covers 512× more address space with a single TLB slot. A packet processing loop that touches a 512 MB mbuf pool would need 131,072 TLB entries at 4 KB or just 256 at 2 MB. DPDK calls mmap(MAP_HUGETLB|MAP_HUGE_1GB) to pin 1 GB pages into its mempool, guaranteeing the entire pool fits in a handful of L2 TLB entries and TLB misses disappear from the packet forwarding hot loop.
Part II — Memory

§ 2.5 – 2.7 Process Address Space, Page Cache & DMA

How each process sees its own private virtual address space via mm_struct and VMAs, how the kernel caches file data in a two-list LRU page cache, and how devices perform DMA through the IOMMU.

14. Overview

Every Linux process has its own virtual address space described by a struct mm_struct. Within that space, each contiguous region with the same permissions is one struct vm_area_struct (VMA). The kernel builds the VMA set lazily — pages are physically allocated only on first access (demand paging). Copy-on-Write makes fork() nearly free: child and parent share physical pages until one writes.

File I/O is served from the page cache — a per-inode struct address_space backed by an XArray of cached pages. A two-list LRU (active / inactive) decides which pages to keep hot and which to reclaim. Dirty pages are written back by per-device worker threads on a configurable schedule.

DMA lets hardware write directly to physical RAM without CPU involvement. The IOMMUinterposes on device bus addresses — protecting arbitrary memory from buggy or malicious devices and enabling DPDK's kernel-bypass via VFIO.

15. Key Data Structures

struct mm_struct — per-process address space

All threads of a process share one mm_struct. It holds the page-table root (loaded into CR3 on context switch), the ordered set of VMAs, and accounting counters. mmap_lock (an rwsem) serialises structural changes — taking it as a writer during mmap() / munmap() and as a reader during page fault handling.

struct vm_area_struct — one VMA per region

A call to mmap() creates one VMA. The kernel keeps VMAs in both a sorted linked list (for iteration) and a red-black tree (for O(log n) lookup by address — used in page fault handling to find the VMA owning the faulting address in microseconds).

Mapping typevm_filevm_flagsPage fault action
Anonymous (heap/stack)NULLVM_READ|VM_WRITEalloc_page() + zero-fill (demand zero)
File-backed privatestruct file *VM_READ|VM_WRITE|VM_MAYWRITEread from page cache; COW on write
File-backed sharedstruct file *VM_READ|VM_WRITE|VM_SHAREDread/write through to page cache
Executable (.text)struct file *VM_READ|VM_EXECread page from ELF file on first access

struct address_space — page-cache anchor

Every file inode has one address_space embedded. The XArray i_pages maps page offsets (0, 1, 2…) to struct page *. A page fault on a file-backed VMA first looks here — cache hit costs a single XArray lookup; cache miss triggers a_ops->readpage() to load from disk.

16. Core Mechanism — Copy-on-Write (COW)

Background: fork()must give the child a full copy of the parent's address space. Physically copying all mapped pages upfront would take hundreds of milliseconds for a process with gigabytes of mappings. COW defers the copy to the moment of first write — and many child processes never write most pages (e.g., exec() immediately replaces the image).

Plan:

  1. fork() calls copy_page_range() — walks the parent PTEs and copies entries into the child page table, but marks all writable PTEs read-only in both parent and child. Increments physical page _refcount for each shared page.
  2. Either process writes → CPU sees PTE.W=0 → raises #PF (protection fault).
  3. do_cow_fault(): allocate a fresh page, copy 4 KB, update the faulting PTE to point to the new page with PTE.W=1, decrement the shared page's refcount.
  4. If refcount drops to 1 after the COW (only one PTE remains), the other process's PTE is re-armed writable — no future fault needed.

Example — fork then child writes one page:

StepParent PTEChild PTEPhys page refcnt
Before forkW=1 → 0xA0001
After fork()W=0 → 0xA000W=0 → 0xA0002
Child writes VAW=0 → 0xA000#PF fired2
COW fault: alloc 0xB000W=0 → 0xA000W=1 → 0xB0001 (parent)
Parent writes VA#PF fired (refcnt=1)1
Re-arm parent PTEW=1 → 0xA000W=1 → 0xB0001 each

17. Core Mechanism — Page Cache & LRU Reclaim

Background:Modern servers have tens of gigabytes of RAM and workloads that touch far more data than fits. The kernel must decide which cached file pages to keep (hot) and which to discard (cold) under memory pressure. A simple FIFO evicts pages that were loaded recently but haven't been reused — killing streaming workloads. A pure LRU has the thrashing problem: a single sequential scan of a large file evicts all hot pages.

Plan — two-list strategy:

  1. New pages start on the inactive list (cold). They are protected from immediate eviction but not from reclaim under pressure.
  2. If a page is accessed a second time while still on the inactive list, it is promoted to the active list (hot). Only truly reused pages earn active status.
  3. Over time, active pages that are not accessed are demoted back to the inactive list tail by kswapd's shrink_active_list().
  4. Reclaim always takes from the inactive list tail — oldest cold pages first. If dirty, they are written back; if clean, dropped immediately.

Dirty Page Writeback Lifecycle

write() modifies the page cache, marks pages PG_dirty, and returns — it does not wait for disk. A per-device kernel thread (wb_workqueue) flushes dirty pages after dirty_expire_interval (default 30 s) or when the dirty ratio exceeds dirty_background_ratio(default 10 %). If dirty pages reach dirty_ratio (default 20 %), the writing task itself is throttled inside balance_dirty_pages().

APIBehaviourWhen to use
write()Writes to page cache; returns immediately (buffered)Normal file writes; OS handles flushing
fsync(fd)Blocks until all dirty pages for fd are on diskAfter committing a database transaction
sync_file_range()Flush a specific byte range without metadataLarge sequential writes (avoid stalling on metadata)
O_DIRECTBypasses page cache entirely; DMA directly to user bufferDatabases with their own cache (avoids double-buffering)
O_SYNC / O_DSYNCEach write() blocks until data reaches diskJournals, WAL files — highest durability overhead

18. Memory-Mapped I/O & DMA

Three Address Spaces

Hardware operates in three distinct address spaces. Confusing them is the single most common bug in driver code.

Address spaceWho uses itKernel API
Virtual address (VA)CPU, kernel code, user-space pointersOrdinary C pointers; kmalloc() returns VA
Physical address (PA)RAM chips; no entity actually "uses" it directly__pa(va) / __va(pa) on x86; page_to_phys()
Bus address / IOVAPCIe devices — what they DMA to/fromdma_map_single() → dma_addr_t; program into device BAR/ring

DMA API

FunctionWhat it doesUse case
dma_alloc_coherent(dev, size, &dma_addr, GFP_KERNEL)Allocate memory + coherent DMA mapping; cache is not an issue (MESI/IOMMU)Descriptor rings, control structures accessed by both CPU and device
dma_map_single(dev, va, size, dir)Map an existing kernel VA for one DMA transfer; returns bus addrSingle scatter-gather DMA from kmalloc'd buffer
dma_unmap_single(dev, dma_addr, size, dir)Tear down the IOMMU mapping; allow CPU to read resultAfter device signals DMA completion (via IRQ/doorbell)
dma_map_sg(dev, sg, count, dir)Map a scatter-gather list — non-contiguous physical pages as one DMA transferNetwork sk_buff, storage bi_io_vec chains
ioremap(phys_addr, size)Map device MMIO registers into kernel VA (non-cacheable)Access PCIe BAR registers from driver probe()

IOMMU — DMA Remapping

Without an IOMMU, a buggy or malicious PCIe device can DMA to any physical address — including kernel code. The IOMMU (Intel VT-d, AMD-Vi) interposes on every device DMA access: it maintains a per-device page table mapping bus addressesto physical pages, and faults on unmapped access. DPDK's VFIO mode uses the IOMMU to give user-space drivers safe, direct DMA without a kernel driver in the forwarding path.

19. Minimal C Demo

The first demo simulates Copy-on-Write: parent and child share a physical page after fork; the child's write triggers a copy. The same logic lives in do_cow_fault() in mm/memory.c.

Copy-on-Write — fork + write simulation — C Demo
stdin (optional)

The second demo simulates the two-list LRU: pages start in the inactive list and are promoted to active on a second access. Memory pressure evicts from the inactive tail — the same policy used by shrink_inactive_list().

Two-List LRU — active / inactive page reclaim — C Demo
stdin (optional)

The third demo shows how dma_map_single() converts a kernel virtual address to a bus address. On x86 without an IOMMU the bus address equals the physical address; with an IOMMU the device sees a remapped address.

DMA address translation — VA → PA → bus address — C Demo
stdin (optional)

20. Kernel Source Pointers

File / FunctionWhat it does
include/linux/mm_types.h :: struct mm_structProcess address space: PGD root, VMA list/tree, mmap_lock, accounting
include/linux/mm_types.h :: struct vm_area_structSingle VMA: VA range, flags, backing file, vm_ops fault handler
mm/mmap.c :: do_mmap()Core mmap() implementation: allocates and inserts a new VMA, merges adjacent compatible VMAs
mm/memory.c :: do_cow_fault()COW fault handler: alloc page, copy_user_highpage(), update PTE to new private page
mm/memory.c :: handle_mm_fault()Central page fault dispatcher: routes to do_anonymous_page, do_fault, do_swap_page
mm/filemap.c :: filemap_fault()File-backed page fault: looks up page in address_space xarray, triggers readpage() on miss
mm/vmscan.c :: shrink_inactive_list()LRU reclaim: scans inactive list, writes back dirty pages, frees clean pages to buddy
mm/vmscan.c :: shrink_active_list()Demotes cold active pages to inactive list; balances active/inactive ratio
fs/fs-writeback.c :: wb_writeback()Per-device writeback: flushes dirty pages for one bdi_writeback; called by wb_workqueue
include/linux/dma-mapping.h :: dma_map_single()Maps kernel VA for DMA; installs IOMMU mapping; returns bus address for device
drivers/iommu/intel/iommu.c :: intel_iommu_map()Intel VT-d: installs a (device, bus_addr → PA) mapping into the device's IOMMU page table
kernel/dma/direct.c :: dma_direct_map_page()No-IOMMU path: bus_addr = phys_addr; may flush cache on non-coherent architectures (ARM)

21. Interview Prep

#QuestionConcise Answer
Q11What is Copy-on-Write and when does it trigger?After fork(), parent and child share physical pages. Both PTEs are marked read-only. The first write to a shared page raises a protection fault. The kernel allocates a new physical page, copies the content, updates the faulter's PTE to the new page with write permission, and decrements the old page's refcount. If refcount drops to 1, the remaining PTE is re-armed writable — future accesses need no fault.
Q12How does the Linux two-list LRU prevent thrashing from sequential scans?New pages go to the inactive list. Promotion to the active list requires a second access — a sequential scan that touches each page once never promotes any page, so it doesn't evict hot working-set pages. Only genuinely reused pages reach the active list. Under memory pressure, reclaim takes only from the inactive tail.
Q13What is the difference between O_DIRECT and buffered I/O?Buffered I/O goes through the page cache: write() copies to a cached page and returns; the kernel flushes asynchronously. O_DIRECT bypasses the page cache — DMA transfers go directly from/to a user-space aligned buffer. O_DIRECT avoids double-buffering (useful when the application has its own cache, e.g. a database) but requires aligned buffers and loses the read-ahead benefit. fsync() flushes buffered dirty pages; with O_DIRECT the data is already on disk when write() returns.
Q14What is the difference between a physical address, virtual address, and bus address?Virtual addresses (VA) are what user-space and kernel code use — translated to physical addresses (PA) by the MMU via page tables. Physical addresses are what DRAM responds to. Bus addresses (IOVA) are what PCIe devices use when they initiate DMA — translated to PA by the IOMMU. Without an IOMMU, bus address == PA on x86. dma_map_single() returns the bus address after installing an IOMMU mapping.
Q15What is the IOMMU and why does DPDK's VFIO mode require it?The IOMMU (Intel VT-d / AMD-Vi) translates device bus addresses to physical addresses, enforcing per-device access control. Without it, any PCIe device can DMA to arbitrary physical memory — a security and safety risk. DPDK's VFIO mode moves the NIC driver to user space. VFIO uses the IOMMU to give user-space safe, direct DMA: the NIC can only access pre-registered memory regions (its huge-page mempools). Without IOMMU, VFIO would allow the user-space process to compromise the kernel.