Web3 Docs

1. Overview

When the kernel boots it uses the e820 memory map (x86) or Device Tree /memory nodes (ARM) to learn which physical address ranges are usable RAM. It then builds a three-level hierarchy: NUMA nodes → zones → page frames. Each 4 KB page frame gets one struct page descriptor stored in a contiguous array called mem_map. The buddy allocator manages free blocks within each zone using power-of-2 free lists — splitting large blocks when allocating and coalescing adjacent buddies when freeing.

2. Key Data Structures

`struct page` — one per physical frame

The kernel allocates one struct page for every physical page frame at boot. On a 64 GB machine that is 16 million descriptors — about 1 GB of memory just for metadata. The struct is carefully padded to exactly 64 bytes so that a page index directly maps to an array offset.

Field	Type	Purpose
`flags`	unsigned long	Bitfield: PG_locked, PG_dirty, PG_uptodate, PG_referenced, PG_slab…
`_refcount`	atomic_t	Physical page reference count; free_pages() decrements; 0 → reclaim
`mapping`	struct address_space *	Page-cache owner; odd pointer = anon vma; NULL = free
`index`	pgoff_t	Byte offset / PAGE_SIZE within the address_space
`lru`	struct list_head	Links page into zone's LRU active or inactive list
`_mapcount`	atomic_t	Number of user-space PTEs mapping this page (-1 = none)
`private`	unsigned long	Slab: freelist ptr; buffer_head: b_page; anon: swap entry

Memory Zones & Watermarks

Each zone maintains three watermarks — min, low, and high — expressed as page counts. The watermarks control when background reclaim (kswapd) wakes up and when allocators must block for direct reclaim.

Zone	Address Range (x86-64)	Purpose
`ZONE_DMA`	0x0 – 0xFFFFFF (16 MB)	Legacy ISA DMA hardware that cannot address above 16 MB
`ZONE_DMA32`	16 MB – 4 GB	32-bit PCI bus masters; also used as fallback on machines without DMA32 devices
`ZONE_NORMAL`	4 GB – end of RAM	Default zone; kernel virtual addresses map here 1:1
`ZONE_HIGHMEM`	above kernel VA limit (32-bit only)	32-bit kernels only; requires kmap() for temporary access
`ZONE_MOVABLE`	configurable	Pages that can be migrated; enables memory hotplug

Buddy Free Lists

Each zone holds MAX_ORDER (typically 11) free lists, indexed by order. Order k holds blocks of 2^k contiguous pages. A freshly booted machine might have one order-10 block (1024 pages = 4 MB) in ZONE_NORMAL.

Order	Block size	Typical use
`0`	4 KB (1 page)	Single page: slab objects, anonymous mappings
`1`	8 KB (2 pages)	Small kernel stacks (pre-5.8)
`2`	16 KB (4 pages)	Kernel stacks (CONFIG_THREAD_INFO_IN_TASK)
`3–5`	32–128 KB	Large kmalloc requests
`9`	2 MB	Huge page backing (before THP splits)
`10`	4 MB	Maximum contiguous allocation; DMA buffers

3. Core Mechanism — Buddy Allocator Split & Coalesce

Allocation path — `alloc_pages()`

Background: The kernel frequently needs 1–16 contiguous page frames for kernel stacks, DMA buffers, and slab slabs. A simple bitmap scan would be O(n). The buddy system maintains power-of-2 free lists so that both allocation and freeing are O(log n) — with n = max order.

Plan:

Look in free_list[requested_order]. If non-empty, dequeue and return.
Walk up to free_list[requested_order + 1], +2, … until a non-empty list is found.
Dequeue the large block. Split it in half: lower half goes to the caller; upper half (“buddy”) is added to free_list[k-1].
Repeat splitting until the block is exactly the requested order.

Example — allocating order-1 (2 pages) from a pool with only an order-2 block at 0x1000:

Step	free_list[0]	free_list[1]	free_list[2]	Action
Initial	—	—	0x1000 (4 pages)	One order-2 block
Check [1]	—	empty	0x1000	Miss → look higher
Take [2]	—	—	—	Dequeue 0x1000 from order-2
Split	—	0x3000 (buddy)	—	Upper half 0x3000 → free_list[1]
Return	—	0x3000	—	Lower half 0x1000 (2 pages) → caller

Free path — `free_pages()` — coalescing

Background:Returning a single page back to order-0 causes fragmentation over time. The buddy system coalesces adjacent same-size blocks (“buddies”) on every free, pushing the block up to the highest possible order.

Key insight — buddy address: For a block of size 2^k pages at address addr, the buddy is at addr XOR (1 << (k + PAGE_SHIFT)). Because blocks are power-of-2 aligned, toggling that single bit flips between the block and its buddy.

Anti-fragmentation — migrate types: The buddy system groups pages by mobility within each free list to prevent unmovable pages from fragmenting the pool of movable pages.

Migrate Type	Pages	Why separated
`MIGRATE_UNMOVABLE`	Kernel data, slab	Cannot be moved — if mixed with movable pages, they pin large regions forever
`MIGRATE_MOVABLE`	User-space anonymous, file pages	Can be migrated by memory compaction — freeing up contiguous ranges for huge pages
`MIGRATE_RECLAIMABLE`	Page cache, inode cache	Can be reclaimed (written back or dropped) — not moved, but frees memory under pressure
`MIGRATE_HIGHATOMIC`	Emergency reserve	Held back for GFP_ATOMIC allocations that cannot sleep

4. Minimal C Demo

The demo below implements the core buddy split and coalesce algorithm with a 16-page pool (order-4). The same logic — walk up to find a large-enough block, split down, coalesce on free — is used by the kernel's __rmqueue() and __free_one_page().

Buddy Allocator — split & coalesce — C Demo

stdin (optional)

The next demo simulates the zone watermark check — the logic inside zone_watermark_ok() that decides whether an allocation can proceed or must wait for reclaim.

Zone Watermark Check — C Demo

stdin (optional)

5. Kernel Source Pointers

File / Function	What it does
`include/linux/mm_types.h :: struct page`	Definition of the 64-byte page descriptor; union covers slab, compound, pagecache uses
`include/linux/mmzone.h :: struct zone`	Zone definition: free_area[] (buddy lists), watermarks, spanned/present pages
`mm/page_alloc.c :: __alloc_pages_nodemask()`	Main allocator entry; selects zone list, calls get_page_from_freelist()
`mm/page_alloc.c :: __rmqueue()`	Dequeues a block from the buddy free list; calls expand() to split
`mm/page_alloc.c :: __free_one_page()`	Coalesces buddies on free; walks up to ORDER_MAX
`mm/page_alloc.c :: zone_watermark_ok()`	Checks free pages against watermarks before allocating
`mm/compaction.c :: compact_zone()`	Memory compaction: migrates MOVABLE pages to coalesce higher-order free blocks
`mm/oom_kill.c :: out_of_memory()`	Called when all zones are below min watermark; selects victim process

6. Interview Prep

#	Question	Concise Answer
Q1	Explain the buddy allocator split and coalesce algorithm.	Allocation: walk free_list[] from requested order upward until a non-empty list is found, dequeue the block, split in half, return lower half and add upper half (buddy) to next-lower free list. Free: compute buddy address as addr XOR order_size; if buddy is also free, remove it and merge into a double-size block, repeat until no buddy is free or max order reached.
Q2	Why are memory zones separated (DMA, DMA32, NORMAL)?	Legacy ISA DMA controllers can only address 24-bit (16 MB) physical addresses. 32-bit PCI masters are limited to 32-bit. The kernel must satisfy DMA allocations from the correct zone; normal kernel allocations prefer ZONE_NORMAL. Without the separation, a DMA request might get a page above 16 MB that the device can never reach.
Q3	What are zone watermarks and what happens at each level?	Three thresholds — high, low, min. Above high: allocate freely. Below high but above low: kswapd background reclaim wakes up; allocation still succeeds. Below low but above min: calling thread enters direct reclaim (may sleep). Below min: either OOM killer fires or the allocation fails (GFP_ATOMIC callers get a reserved pool).
Q4	What is MIGRATE_MOVABLE and why does it help huge pages?	Pages classified MIGRATE_MOVABLE (user-space anonymous and file-backed) can be physically relocated by the compaction thread. By segregating them from UNMOVABLE kernel pages, the allocator can compact a region entirely of movable pages into one end, leaving a contiguous order-9/10 block free for huge page allocation. Without this, one pinned kernel page ruins a 2 MB range.
Q5	How does struct page stay at 64 bytes despite holding so many fields?	Most fields are mutually exclusive: a page is either a slab object, a file-backed page, an anonymous page, or a compound page tail — never all at once. The kernel uses a union over the same 64 bytes; the page flags (PG_slab, etc.) tell code which union member is active. The lru list_head is reused as a freelist pointer inside slab.

7. Overview — Slab/Slub + Paging

The buddy allocator gives out whole pages (4 KB minimum). But most kernel allocations are tiny: a struct inode is ~600 B, a struct sk_buff is ~232 B. Rounding each up to 4 KB wastes 80–99 % of every allocation. The slab allocator (modern Linux uses SLUB by default) carves each slab page into equal-size objects and keeps a per-CPU freelist — allocation is a single pointer pop, lock-free on the hot path.

Virtual memory adds translation: every load/store goes through a 4-level page table (PGD → PUD → PMD → PTE). A TLB caches recent VA→PA mappings; a miss triggers a hardware walk costing 100–300 cycles. Huge pages (2 MB / 1 GB) cover more address space per TLB entry — the core reason DPDK pins 1 GB pages at startup.

8. Key Data Structures — Slab

`kmem_cache` — one descriptor per object type

Every slab cache has one kmem_cache descriptor. Linux creates hundreds at boot — inode_cache, skbuff_head_cache, mm_struct, etc. List them with cat /proc/slabinfo.

Field	Type	Purpose
`name`	char[]	Human-readable name shown in /proc/slabinfo and crash dumps
`object_size`	unsigned int	Actual bytes needed per object before alignment
`size`	unsigned int	Aligned size (may include SLAB_RED_ZONE padding in debug builds)
`cpu_slab`	struct kmem_cache_cpu *	Per-CPU state: pointer to active slab page + embedded freelist head
`node[NUMA_NODES]`	struct kmem_cache_node *	Per-NUMA-node lists of partial slabs (some free, some in-use objects)
`min_partial`	unsigned long	Min partial slabs to keep per node before returning pages to buddy
`ctor`	void ()(void )	Optional constructor called once per object when a fresh slab is allocated
`flags`	slab_flags_t	SLAB_HWCACHE_ALIGN, SLAB_POISON (debug write 0x6b), SLAB_RECLAIM_ACCOUNT

SLAB vs SLUB vs SLOB

Allocator	Default kernel	Key trade-off
`SLAB`	Pre-3.14 servers	Per-CPU caches with cache-line coloring. High memory overhead per cache type.
`SLUB`	Linux ≥ 2.6.23 (today's default)	No coloring. Embeds freelist pointer inside free objects. Lower overhead, better NUMA locality. Preferred for production.
`SLOB`	Tiny/embedded kernels	Simple first-fit. Minimal metadata but poor cache behavior — not suitable for servers.

9. Core Mechanism — Slab Allocation Fast & Slow Path

Background: The kernel allocates network buffers, dentries, and inodes millions of times per second. Taking a spinlock on every allocation serializes CPUs. SLUB avoids this with a per-CPU embedded freelist: the first sizeof(void *) bytes of every free object store a pointer to the next free object — so the freelist has zero memory overhead and fast- path alloc/free are both a single pointer update.

Plan:

Fast path — check cpu_slab->freelist. If non-NULL: read the embedded next pointer, advance freelist head, return object. No lock, no atomic.
Slow path — partial slab — lock node->list_lock, grab a partial slab from the NUMA-node list, set it as the CPU's active slab, unlock, retry fast path.
Slow path — new slab — if no partial slab exists, call alloc_pages() to get a fresh page from the buddy allocator, initialize the freelist through all objects in that page, add to node partial list.

Example — allocating 10 sk_buff objects from a fresh cache:

Event	CPU freelist	Node partial list	Action
Cache created	NULL	empty	kmem_cache_create() — no pages yet, lazy init
1st alloc	NULL → slow path	empty → must allocate	alloc_pages() → init 64 objects in page, install as CPU active slab
2nd–64th alloc	obj[1]→obj[2]→…	1 partial slab	Fast path each time: pop freelist head, zero-init, return
65th alloc	NULL (slab exhausted)	slab now full	Slow path: alloc another buddy page, init new freelist
free(obj[10])	obj[10] → old head	unchanged	Fast path: push to freelist head, no lock needed

10. Virtual Memory & Paging

4-Level Page Table Walk (x86-64)

x86-64 uses a 4-level page table to map 48-bit virtual addresses to physical frames. The CPU register CR3 holds the physical address of the top-level Page Global Directory (PGD). Each level is a 4 KB page containing 512 8-byte entries — indexed by 9 bits of the VA. On every TLB miss the hardware walker reads all four levels, fills the TLB, and retries the instruction.

Level	VA bits	Type	Points to
`PGD — Page Global Directory`	[47:39]	pgd_t	Physical base of PUD table (512 × 8 B = 4 KB)
`PUD — Page Upper Directory`	[38:30]	pud_t	Physical base of PMD table OR 1 GB huge page (PS=1)
`PMD — Page Middle Directory`	[29:21]	pmd_t	Physical base of PTE table OR 2 MB huge page (PS=1)
`PTE — Page Table Entry`	[20:12]	pte_t	Physical frame number [51:12] + flags [11:0]
`Page offset`	[11:0]	—	Byte offset within the 4 KB physical page

TLB — Translation Lookaside Buffer

Walking 4 page-table levels requires 4 memory accesses — up to 4 × 100 ns = 400 ns on a cold cache. The TLB is a small fully-associative hardware cache (64–1024 entries per core) storing recent VA→PA translations. A hit costs 1–4 cycles; a miss triggers the full hardware walk. flush_tlb_mm() invalidates all TLB entries for a process on context switch; on SMP systems a TLB shootdown IPI is sent to remote CPUs.

Huge Pages & TLB Pressure

Page size	TLB entries to cover 1 GB	Use case
`4 KB (normal)`	262,144 entries	General use; fine-grained reclaim; virtual desktop workloads
`2 MB (HugeTLB / THP)`	512 entries	Databases, JVMs, DPDK packet pools — 512× fewer TLB misses for the same range
`1 GB (1G pages)`	1 entry	DPDK 1G mempool; GPU framebuffers; any static long-lived DMA mapping

Transparent Huge Pages (THP) — the khugepaged kernel thread promotes runs of 512 contiguous 4 KB anonymous pages to a single 2 MB page automatically; no application change required. HugeTLBfs — explicit huge pages reserved at boot via hugepages=N or /proc/sys/vm/nr_hugepages. DPDK calls rte_eal_init() which uses mmap(MAP_HUGETLB) to pin 1 GB pages, guaranteeing TLB entries are never evicted for the lifetime of the application.

11. Minimal C Demo

The first demo implements the SLUB embedded-freelist pattern. Every free object stores its "next free" pointer in its own first bytes — freelist overhead is zero, and alloc/free are each a single pointer update.

SLUB Embedded Freelist — slab_alloc / slab_free — C Demo

stdin (optional)

The second demo extracts each page-table index from a 48-bit virtual address — the same decomposition the hardware page walker performs on every TLB miss.

4-Level Page Table Walk — VA → PGD/PUD/PMD/PTE indices — C Demo

stdin (optional)

12. Kernel Source Pointers

File / Function	What it does
`mm/slub.c :: kmem_cache_alloc()`	SLUB fast-path: loads cpu_slab->freelist, pops head — inlined into a few instructions in the hot path
`mm/slub.c :: __slab_alloc()`	SLUB slow path: acquires node lock, refills cpu freelist from partial slab or allocates new slab page
`mm/slub.c :: kmem_cache_free()`	SLUB free: pushes object back to cpu freelist; if slab becomes empty, may return pages to buddy
`mm/slab_common.c :: kmem_cache_create()`	Allocates and registers a new kmem_cache; sets object_size, alignment, NUMA node arrays
`include/linux/slab.h :: kmalloc()`	General-purpose allocator: routes to per-size SLUB caches for ≤8 KB, buddy for larger
`arch/x86/mm/fault.c :: do_page_fault()`	x86 #PF handler: checks vma validity, triggers do_anonymous_page(), do_cow_fault(), or SIGSEGV
`mm/memory.c :: handle_mm_fault()`	Central fault dispatcher: delegates to huge-page or PTE-level fault handlers
`arch/x86/include/asm/tlbflush.h :: flush_tlb_mm()`	Invalidates all TLB entries for an mm_struct; sends IPI to remote CPUs (TLB shootdown)
`mm/hugetlb.c :: alloc_huge_page()`	Allocates from the pre-reserved HugeTLB pool; used by mmap(MAP_HUGETLB) and DPDK
`mm/khugepaged.c :: khugepaged_scan_mm_slot()`	Scans for 512-page-aligned anonymous regions to promote to THP

13. Interview Prep

#	Question	Concise Answer
Q6	What is the difference between SLAB and SLUB?	SLAB maintains per-CPU caches as separate arrays plus cache coloring to reduce false sharing. SLUB is simpler: it embeds the freelist pointer inside free objects (zero metadata overhead), uses per-CPU active slab pointers, and maintains per-NUMA-node partial lists. SLUB has lower memory usage and less code complexity; SLAB's coloring can slightly improve cache utilization on workloads with many small caches.
Q7	Why is the SLUB fast path lock-free?	Each CPU has its own cpu_slab pointer and freelist head. As long as the active slab has free objects, alloc/free only update the freelist head — a per-CPU variable no other CPU touches. Lock contention only appears on the slow path when the CPU must borrow a partial slab from the NUMA node list (guarded by node->list_lock).
Q8	Walk me through a 4-level page table translation on x86-64.	CPU extracts four 9-bit indices from VA bits [47:39], [38:30], [29:21], [20:12] plus a 12-bit offset. Hardware reads CR3 for PGD base, dereferences PGD[idx] for PUD base, PUD[idx] for PMD base, PMD[idx] for PTE base, PTE[idx] for physical frame number. Physical address = frame<<12 \| offset. Each level can short-circuit if the PS (huge page) bit is set — PMD with PS=1 gives a 2 MB mapping, PUD with PS=1 gives 1 GB.
Q9	What is a TLB shootdown and when does it happen?	When a CPU modifies a page table entry (e.g. unmaps a page with munmap), other CPUs may have stale TLB entries for that VA. The kernel sends an inter-processor interrupt (IPI) to all CPUs that might hold the stale entry, instructing them to execute INVLPG or flush their TLBs. This is a TLB shootdown. It is expensive — O(CPU count) IPI latency — so mTHP and hugetlb minimize shootdowns by using fewer, larger entries.
Q10	Why do huge pages reduce TLB pressure? How does DPDK use them?	A 4 KB TLB entry covers 4 KB; a 2 MB entry covers 512× more address space with a single TLB slot. A packet processing loop that touches a 512 MB mbuf pool would need 131,072 TLB entries at 4 KB or just 256 at 2 MB. DPDK calls mmap(MAP_HUGETLB\|MAP_HUGE_1GB) to pin 1 GB pages into its mempool, guaranteeing the entire pool fits in a handful of L2 TLB entries and TLB misses disappear from the packet forwarding hot loop.

14. Overview

Every Linux process has its own virtual address space described by a struct mm_struct. Within that space, each contiguous region with the same permissions is one struct vm_area_struct (VMA). The kernel builds the VMA set lazily — pages are physically allocated only on first access (demand paging). Copy-on-Write makes fork() nearly free: child and parent share physical pages until one writes.

File I/O is served from the page cache — a per-inode struct address_space backed by an XArray of cached pages. A two-list LRU (active / inactive) decides which pages to keep hot and which to reclaim. Dirty pages are written back by per-device worker threads on a configurable schedule.

DMA lets hardware write directly to physical RAM without CPU involvement. The IOMMUinterposes on device bus addresses — protecting arbitrary memory from buggy or malicious devices and enabling DPDK's kernel-bypass via VFIO.

15. Key Data Structures

`struct mm_struct` — per-process address space

All threads of a process share one mm_struct. It holds the page-table root (loaded into CR3 on context switch), the ordered set of VMAs, and accounting counters. mmap_lock (an rwsem) serialises structural changes — taking it as a writer during mmap() / munmap() and as a reader during page fault handling.

`struct vm_area_struct` — one VMA per region

A call to mmap() creates one VMA. The kernel keeps VMAs in both a sorted linked list (for iteration) and a red-black tree (for O(log n) lookup by address — used in page fault handling to find the VMA owning the faulting address in microseconds).

Mapping type	vm_file	vm_flags	Page fault action
Anonymous (heap/stack)	NULL	`VM_READ\|VM_WRITE`	alloc_page() + zero-fill (demand zero)
File-backed private	struct file *	`VM_READ\|VM_WRITE\|VM_MAYWRITE`	read from page cache; COW on write
File-backed shared	struct file *	`VM_READ\|VM_WRITE\|VM_SHARED`	read/write through to page cache
Executable (.text)	struct file *	`VM_READ\|VM_EXEC`	read page from ELF file on first access

`struct address_space` — page-cache anchor

Every file inode has one address_space embedded. The XArray i_pages maps page offsets (0, 1, 2…) to struct page *. A page fault on a file-backed VMA first looks here — cache hit costs a single XArray lookup; cache miss triggers a_ops->readpage() to load from disk.

16. Core Mechanism — Copy-on-Write (COW)

Background: fork()must give the child a full copy of the parent's address space. Physically copying all mapped pages upfront would take hundreds of milliseconds for a process with gigabytes of mappings. COW defers the copy to the moment of first write — and many child processes never write most pages (e.g., exec() immediately replaces the image).

Plan:

fork() calls copy_page_range() — walks the parent PTEs and copies entries into the child page table, but marks all writable PTEs read-only in both parent and child. Increments physical page _refcount for each shared page.
Either process writes → CPU sees PTE.W=0 → raises #PF (protection fault).
do_cow_fault(): allocate a fresh page, copy 4 KB, update the faulting PTE to point to the new page with PTE.W=1, decrement the shared page's refcount.
If refcount drops to 1 after the COW (only one PTE remains), the other process's PTE is re-armed writable — no future fault needed.

Example — fork then child writes one page:

Step	Parent PTE	Child PTE	Phys page refcnt
Before fork	W=1 → 0xA000	—	1
After fork()	W=0 → 0xA000	W=0 → 0xA000	2
Child writes VA	W=0 → 0xA000	#PF fired	2
COW fault: alloc 0xB000	W=0 → 0xA000	W=1 → 0xB000	1 (parent)
Parent writes VA	#PF fired (refcnt=1)	—	1
Re-arm parent PTE	W=1 → 0xA000	W=1 → 0xB000	1 each

17. Core Mechanism — Page Cache & LRU Reclaim

Background:Modern servers have tens of gigabytes of RAM and workloads that touch far more data than fits. The kernel must decide which cached file pages to keep (hot) and which to discard (cold) under memory pressure. A simple FIFO evicts pages that were loaded recently but haven't been reused — killing streaming workloads. A pure LRU has the thrashing problem: a single sequential scan of a large file evicts all hot pages.

Plan — two-list strategy:

New pages start on the inactive list (cold). They are protected from immediate eviction but not from reclaim under pressure.
If a page is accessed a second time while still on the inactive list, it is promoted to the active list (hot). Only truly reused pages earn active status.
Over time, active pages that are not accessed are demoted back to the inactive list tail by kswapd's shrink_active_list().
Reclaim always takes from the inactive list tail — oldest cold pages first. If dirty, they are written back; if clean, dropped immediately.

Dirty Page Writeback Lifecycle

write() modifies the page cache, marks pages PG_dirty, and returns — it does not wait for disk. A per-device kernel thread (wb_workqueue) flushes dirty pages after dirty_expire_interval (default 30 s) or when the dirty ratio exceeds dirty_background_ratio(default 10 %). If dirty pages reach dirty_ratio (default 20 %), the writing task itself is throttled inside balance_dirty_pages().

API	Behaviour	When to use
`write()`	Writes to page cache; returns immediately (buffered)	Normal file writes; OS handles flushing
`fsync(fd)`	Blocks until all dirty pages for fd are on disk	After committing a database transaction
`sync_file_range()`	Flush a specific byte range without metadata	Large sequential writes (avoid stalling on metadata)
`O_DIRECT`	Bypasses page cache entirely; DMA directly to user buffer	Databases with their own cache (avoids double-buffering)
`O_SYNC / O_DSYNC`	Each write() blocks until data reaches disk	Journals, WAL files — highest durability overhead

18. Memory-Mapped I/O & DMA

Three Address Spaces

Hardware operates in three distinct address spaces. Confusing them is the single most common bug in driver code.

Address space	Who uses it	Kernel API
Virtual address (VA)	CPU, kernel code, user-space pointers	`Ordinary C pointers; kmalloc() returns VA`
Physical address (PA)	RAM chips; no entity actually "uses" it directly	`__pa(va) / __va(pa) on x86; page_to_phys()`
Bus address / IOVA	PCIe devices — what they DMA to/from	`dma_map_single() → dma_addr_t; program into device BAR/ring`

DMA API

Function	What it does	Use case
`dma_alloc_coherent(dev, size, &dma_addr, GFP_KERNEL)`	Allocate memory + coherent DMA mapping; cache is not an issue (MESI/IOMMU)	Descriptor rings, control structures accessed by both CPU and device
`dma_map_single(dev, va, size, dir)`	Map an existing kernel VA for one DMA transfer; returns bus addr	Single scatter-gather DMA from kmalloc'd buffer
`dma_unmap_single(dev, dma_addr, size, dir)`	Tear down the IOMMU mapping; allow CPU to read result	After device signals DMA completion (via IRQ/doorbell)
`dma_map_sg(dev, sg, count, dir)`	Map a scatter-gather list — non-contiguous physical pages as one DMA transfer	Network sk_buff, storage bi_io_vec chains
`ioremap(phys_addr, size)`	Map device MMIO registers into kernel VA (non-cacheable)	Access PCIe BAR registers from driver probe()

IOMMU — DMA Remapping

Without an IOMMU, a buggy or malicious PCIe device can DMA to any physical address — including kernel code. The IOMMU (Intel VT-d, AMD-Vi) interposes on every device DMA access: it maintains a per-device page table mapping bus addressesto physical pages, and faults on unmapped access. DPDK's VFIO mode uses the IOMMU to give user-space drivers safe, direct DMA without a kernel driver in the forwarding path.

19. Minimal C Demo

The first demo simulates Copy-on-Write: parent and child share a physical page after fork; the child's write triggers a copy. The same logic lives in do_cow_fault() in mm/memory.c.

Copy-on-Write — fork + write simulation — C Demo

stdin (optional)

The second demo simulates the two-list LRU: pages start in the inactive list and are promoted to active on a second access. Memory pressure evicts from the inactive tail — the same policy used by shrink_inactive_list().

Two-List LRU — active / inactive page reclaim — C Demo

stdin (optional)

The third demo shows how dma_map_single() converts a kernel virtual address to a bus address. On x86 without an IOMMU the bus address equals the physical address; with an IOMMU the device sees a remapped address.

DMA address translation — VA → PA → bus address — C Demo

stdin (optional)

20. Kernel Source Pointers

File / Function	What it does
`include/linux/mm_types.h :: struct mm_struct`	Process address space: PGD root, VMA list/tree, mmap_lock, accounting
`include/linux/mm_types.h :: struct vm_area_struct`	Single VMA: VA range, flags, backing file, vm_ops fault handler
`mm/mmap.c :: do_mmap()`	Core mmap() implementation: allocates and inserts a new VMA, merges adjacent compatible VMAs
`mm/memory.c :: do_cow_fault()`	COW fault handler: alloc page, copy_user_highpage(), update PTE to new private page
`mm/memory.c :: handle_mm_fault()`	Central page fault dispatcher: routes to do_anonymous_page, do_fault, do_swap_page
`mm/filemap.c :: filemap_fault()`	File-backed page fault: looks up page in address_space xarray, triggers readpage() on miss
`mm/vmscan.c :: shrink_inactive_list()`	LRU reclaim: scans inactive list, writes back dirty pages, frees clean pages to buddy
`mm/vmscan.c :: shrink_active_list()`	Demotes cold active pages to inactive list; balances active/inactive ratio
`fs/fs-writeback.c :: wb_writeback()`	Per-device writeback: flushes dirty pages for one bdi_writeback; called by wb_workqueue
`include/linux/dma-mapping.h :: dma_map_single()`	Maps kernel VA for DMA; installs IOMMU mapping; returns bus address for device
`drivers/iommu/intel/iommu.c :: intel_iommu_map()`	Intel VT-d: installs a (device, bus_addr → PA) mapping into the device's IOMMU page table
`kernel/dma/direct.c :: dma_direct_map_page()`	No-IOMMU path: bus_addr = phys_addr; may flush cache on non-coherent architectures (ARM)

21. Interview Prep

#	Question	Concise Answer
Q11	What is Copy-on-Write and when does it trigger?	After fork(), parent and child share physical pages. Both PTEs are marked read-only. The first write to a shared page raises a protection fault. The kernel allocates a new physical page, copies the content, updates the faulter's PTE to the new page with write permission, and decrements the old page's refcount. If refcount drops to 1, the remaining PTE is re-armed writable — future accesses need no fault.
Q12	How does the Linux two-list LRU prevent thrashing from sequential scans?	New pages go to the inactive list. Promotion to the active list requires a second access — a sequential scan that touches each page once never promotes any page, so it doesn't evict hot working-set pages. Only genuinely reused pages reach the active list. Under memory pressure, reclaim takes only from the inactive tail.
Q13	What is the difference between O_DIRECT and buffered I/O?	Buffered I/O goes through the page cache: write() copies to a cached page and returns; the kernel flushes asynchronously. O_DIRECT bypasses the page cache — DMA transfers go directly from/to a user-space aligned buffer. O_DIRECT avoids double-buffering (useful when the application has its own cache, e.g. a database) but requires aligned buffers and loses the read-ahead benefit. fsync() flushes buffered dirty pages; with O_DIRECT the data is already on disk when write() returns.
Q14	What is the difference between a physical address, virtual address, and bus address?	Virtual addresses (VA) are what user-space and kernel code use — translated to physical addresses (PA) by the MMU via page tables. Physical addresses are what DRAM responds to. Bus addresses (IOVA) are what PCIe devices use when they initiate DMA — translated to PA by the IOMMU. Without an IOMMU, bus address == PA on x86. dma_map_single() returns the bus address after installing an IOMMU mapping.
Q15	What is the IOMMU and why does DPDK's VFIO mode require it?	The IOMMU (Intel VT-d / AMD-Vi) translates device bus addresses to physical addresses, enforcing per-device access control. Without it, any PCIe device can DMA to arbitrary physical memory — a security and safety risk. DPDK's VFIO mode moves the NIC driver to user space. VFIO uses the IOMMU to give user-space safe, direct DMA: the NIC can only access pre-registered memory regions (its huge-page mempools). Without IOMMU, VFIO would allow the user-space process to compromise the kernel.

§ 2.1 – 2.2 Physical Layout & Buddy Allocator

1. Overview

2. Key Data Structures

struct page — one per physical frame

Memory Zones & Watermarks

Buddy Free Lists

3. Core Mechanism — Buddy Allocator Split & Coalesce

Allocation path — alloc_pages()

Free path — free_pages() — coalescing

4. Minimal C Demo

5. Kernel Source Pointers

6. Interview Prep

§ 2.3 – 2.4 Slab/Slub Allocator & Virtual Memory

7. Overview — Slab/Slub + Paging

8. Key Data Structures — Slab

kmem_cache — one descriptor per object type

SLAB vs SLUB vs SLOB

9. Core Mechanism — Slab Allocation Fast & Slow Path

10. Virtual Memory & Paging

4-Level Page Table Walk (x86-64)

TLB — Translation Lookaside Buffer

Huge Pages & TLB Pressure

11. Minimal C Demo

12. Kernel Source Pointers

13. Interview Prep

§ 2.5 – 2.7 Process Address Space, Page Cache & DMA

14. Overview

15. Key Data Structures

struct mm_struct — per-process address space

struct vm_area_struct — one VMA per region

struct address_space — page-cache anchor

16. Core Mechanism — Copy-on-Write (COW)

17. Core Mechanism — Page Cache & LRU Reclaim

Dirty Page Writeback Lifecycle

18. Memory-Mapped I/O & DMA

Three Address Spaces

DMA API

IOMMU — DMA Remapping

19. Minimal C Demo

20. Kernel Source Pointers

21. Interview Prep

`struct page` — one per physical frame

Allocation path — `alloc_pages()`

Free path — `free_pages()` — coalescing

`kmem_cache` — one descriptor per object type

`struct mm_struct` — per-process address space

`struct vm_area_struct` — one VMA per region

`struct address_space` — page-cache anchor