Part VI — Parallel Programming & Synchronization

§6.1 – 6.10 · Hardware Foundations · MESI · Barriers · Atomics · Counting · Locking · Spinlock · Mutex · Seqlock · RCU

Based on Paul McKenney's Is Parallel Programming Hard (PPH) Ch.2–3, 15 and Linux kernel locking internals (LKD Ch.9–10).

Overview

Correct concurrent programming starts at the hardware level. A CPU does not execute instructions in the order you wrote them — it reorders stores and loads for performance. Multiple CPUs sharing memory introduce cache coherence traffic that can make a naive counter slower than a single-threaded one. Every kernel synchronization primitive — from a spinlock to RCU — exists to manage these hardware-level effects. This part builds the foundation: hardware behavior (§6.1), the MESI protocol that keeps caches consistent (§6.2), and the memory barriers that enforce ordering across CPUs (§6.3).

1. §6.1 — Hardware Foundations for Parallelism

CPU Pipeline: Store Buffer & Invalidation Queue

A modern CPU core does not write to the cache immediately. When a store instruction executes, the value goes into the store buffer — a small, private FIFO inside the core. The cache line is only updated when the store buffer drains, which can be many nanoseconds later. Meanwhile, other CPUs cannot see the write yet.

Similarly, when a cache line is invalidated by another CPU's write, the invalidation message is first dropped into the invalidation queuerather than applied immediately. This batching improves throughput but means a CPU might read a stale value from its L1 cache even after the message arrived.

Consequence for software: Even on a sequentially consistent machine like x86, the store buffer creates a window where a value written by CPU0 is invisible to CPU1. Memory barriers flush the store buffer, closing this window.

Memory Hierarchy & Latency

Cache misses are the dominant cost in concurrent code. A lock acquisition that transfers a cache line from a remote CPU costs ~100–300 ns — as slow as a DRAM access. Designing concurrent code means designing to minimize cache-line transfers.

Level	Size	Latency	Scope	Kernel API
L1	32–64 KB	~1 ns / 4 cycles	Per-core, private	`prefetch(), cache-aligned structs`
L2	256 KB – 1 MB	~3–5 ns / 12 cy	Per-core, private	`__builtin_prefetch()`
L3 / LLC	8–64 MB	~10–30 ns / 40 cy	All cores share	`NUMA-aware allocation`
DRAM	GBs	~60–100 ns	Global	`get_free_pages(), vmalloc()`
Remote NUMA	GBs	~150–300 ns	Other NUMA node	`alloc_pages_node()`

False Sharing — The Hidden Performance Killer

Background: Two threads each have their own counter — counter_A (owned by CPU0) and counter_B(owned by CPU1). They never touch each other's counter. Yet a benchmark shows they are 10× slower than expected.

Root cause: Both counters happen to sit within the same 64-byte cache line. Every time CPU0 writes counter_A, the MESI protocol marks CPU1's copy of that line Invalid — forcing CPU1 to re-fetch it before it can write counter_B. They ping-pong the line back and forth even though they never share data.

Detection / Prevention	Tool / Technique
Detect false sharing	perf c2c report — shows cache-to-cache transfer hotspots
Pad struct to cache line	struct { int val; char pad[60]; } — ensures one field per line
Kernel annotation	__cacheline_aligned_in_smp — aligns struct start to 64 B
Per-CPU variables	DEFINE_PER_CPU — each CPU has own copy, zero coherence traffic
DPDK equivalent	RTE_CACHE_LINE_SIZE (64), rte_cache_aligned attribute

2. §6.2 — Cache Coherence: MESI Protocol

Every cache line in every L1/L2 cache is tagged with one of four MESI states. The protocol guarantees a globally consistent view of memory across all cores — but at the cost of bus traffic on every state transition.

State	Meaning	In Memory?	Other Copies?
Modified (M)	Dirty — this CPU wrote to it, memory is stale	No (stale)	No
Exclusive (E)	Clean — only this CPU has it, matches memory	Yes	No
Shared (S)	Clean — multiple CPUs have read-only copies	Yes	Yes
Invalid (I)	Not in this cache; must fetch before use	—	Maybe

Key Transitions to Memorize

Trigger	Transition	Bus Traffic?
Local write to Exclusive line	`E → M`	None (silent)
Local write to Shared line	`S → M`	Send invalidation to all sharers
Remote CPU reads Modified line	`M → S`	Writeback to memory, share
Remote CPU writes any cached line	`M/E/S → I`	Receive invalidation
Read miss, no other copy	`I → E`	Fetch from memory
Read miss, other copies exist	`I → S`	Fetch + notify sharers

Cache-Line Bouncing: `atomic_inc()` Under Contention

Background: Two CPUs repeatedly call atomic_inc() on the same global counter — a common pattern in reference counting or statistics collection. Each increment requires exclusive ownership of the cache line. The result is a round trip across the coherence bus for every single increment.

Real measurement: On a 16-core machine, a tight atomic_inc() loop on a single shared counter saturates at ~50 million ops/sec across all cores — the same throughput as 1–2 cores. A per-CPU counter scales linearly to 800+ million ops/sec because there is no coherence traffic.

MOESI Extension

AMD CPUs add an Owned (O) state — a line that is dirty but shared. The owner responds to snoops directly (CPU-to-CPU transfer) without writing back to DRAM first. This reduces memory bus traffic in read-heavy workloads. Intel uses a similar Forward state in some implementations.

3. §6.3 — Memory Barriers & Ordering

Background: A producer writes data = 99 then ready = 1. A consumer polls ready and then reads data. On x86 this usually works — but on ARM/POWER the CPU can reorder the two stores, and the consumer can see ready = 1 while data is still 0. Even on x86, the compiler can reorder them.

Two Levels of Reordering

Level	Cause	Fix
Compiler reordering	Compiler assumes single-thread view; may hoist/sink loads and stores	barrier() macro — empty asm volatile block forces compiler fence
CPU reordering	Out-of-order execution, store buffer, invalidation queue	Hardware memory fence instruction (mfence / dmb / sync)
Speculative loads	CPU prefetches data before branch resolves	READ_ONCE() prevents compiler from caching in register
Torn writes	Compiler splits 64-bit write into two 32-bit stores	WRITE_ONCE() forces single atomic-width store

Linux Kernel Barrier API

API	Direction	x86 instruction	ARM instruction	C11 equivalent
`smp_mb()`	Full (loads + stores)	`mfence`	`dmb ish`	`memory_order_seq_cst`
`smp_wmb()`	Write (stores only)	`sfence (or nop on TSO)`	`dmb ishst`	`memory_order_release (store)`
`smp_rmb()`	Read (loads only)	`lfence (or nop on TSO)`	`dmb ishld`	`memory_order_acquire (load)`
`smp_store_release()`	Write + release	`MOV (TSO)`	`stlr`	`atomic_store(release)`
`smp_load_acquire()`	Read + acquire	`MOV (TSO)`	`ldar`	`atomic_load(acquire)`
`READ_ONCE(x)`	Compiler barrier + single load	`—`	`—`	`volatile load`
`WRITE_ONCE(x,v)`	Compiler barrier + single store	`—`	`—`	`volatile store`

x86 TSO (Total Store Order)

x86 guarantees that stores are observed by other CPUs in program order (no store-store reordering visible to observers). It does allow one reordering: a CPU may observe its own earlier stores before other CPUs do. In practice this means smp_wmb() and smp_rmb() compile to nothing on x86 (the compiler barrier is still needed). Only smp_mb() compiles to mfence.

On ARM and POWER the memory model is much weaker — any reordering is possible. Correct code must use explicit barriers on every platform, not just x86.

Pairing Rules — the Golden Rule

A smp_wmb() on the writer must pair with a smp_rmb() on the reader. A smp_store_release() must pair with a smp_load_acquire(). Unpaired barriers provide no ordering guarantee — you get the fence overhead with none of the safety.

Minimal C Demo — Memory Barrier Simulation

Memory Barriers — producer/consumer with smp_wmb/smp_rmb — C Demo

stdin (optional)

Minimal C Demo — READ_ONCE / WRITE_ONCE

READ_ONCE / WRITE_ONCE — prevent compiler reordering — C Demo

stdin (optional)

Minimal C Demo — C11 memory_order: release/acquire

C11 atomics — memory_order_release / memory_order_acquire pair — C Demo

stdin (optional)

4. §6.4 — Atomic Operations

Memory barriers tell the CPU and compiler about ordering. But they do not make an increment atomic — two CPUs can still read the same value, both add 1, and write back the same result (lost update). Atomic operations solve this by combining the read-modify-write into a single indivisible bus transaction. On x86 this uses the LOCK prefix; on ARM it uses ldxr/stxr (load-linked / store-conditional).

atomic_t, atomic64_t, refcount_t

Operation	API	Notes
Read	`atomic_read(&v)`	READ_ONCE wrapper — prevents compiler from caching
Write	`atomic_set(&v, i)`	WRITE_ONCE wrapper
Increment	`atomic_inc(&v)`	Returns void; for RMW with result use atomic_inc_return()
Decrement & test	`atomic_dec_and_test(&v)`	Returns true if result is zero — used in kref release check
Add & return	`atomic_add_return(i, &v)`	Returns new value after add; implies full barrier
Compare & exchange	`atomic_cmpxchg(&v, old, new)`	Returns old value; swap only if *v == old
Fetch & add	`atomic_fetch_add(i, &v)`	Returns value before add (C11: fetch_add)
OR / AND / XOR	`atomic_or(mask, &v)`	Bitwise ops; useful for flag bitmaps

Core Mechanism: Compare-and-Swap (CAS) Loop

Background: You need to atomically apply a function f(old) → new to a shared variable without taking a lock. CAS lets you: read the old value, compute new, then attempt to commit — only succeeding if no other CPU changed the variable in between.

Plan:

Load current value into old
Compute new = f(old)
Execute cmpxchg(&v, old, new): atomically swap only if v == old
If cmpxchg returns old → success; else another CPU raced us — retry from step 1

Minimal C Demo — CAS Lock-Free Counter

CAS loop — lock-free counter with cmpxchg — C Demo

stdin (optional)

kref / refcount_t — Reference Counting

kref wraps a single atomic_t with a lifecycle discipline: every caller that uses an object takes a reference (kref_get), and releases it when done (kref_put). When the count reaches zero, kref_put calls the release function to free the object. refcount_t (since kernel 4.11) adds saturation — an overflow wraps to 1 rather than 0, preventing use-after-free bugs from counter wraparound.

Minimal C Demo — kref / refcount_t Lifecycle

kref / refcount_t — reference-counted object — C Demo

stdin (optional)

5. §6.5 — Counting at Scale

A single atomic_inc() call on a shared counter causes a cache-line round trip on every call (§6.2). At scale — 16 CPUs each doing millions of increments per second — that single cache line becomes the bottleneck for the entire machine. The kernel solves this with per-CPU counters.

percpu_counter vs local_t vs atomic_t

Type	Write cost	Read cost	Exact?	Use case
`atomic_t`	~100 ns (bus lock)	~1 ns	Yes	Reference counts, small shared flags
`percpu_counter`	~1 ns (local inc)	O(NR_CPUS)	Approximate (batch threshold)	FS block/inode usage, network stats
`local_t`	~1 ns (no atomic)	Per-CPU only	Per-CPU exact, global inexact	Perf event local counts
`DEFINE_PER_CPU(long)`	~1 ns	O(NR_CPUS) sum	Exact if summed carefully	Kernel stats, DPDK per-core counters

percpu_counter batch trick: percpu_counter_add() accumulates increments locally until the per-CPU batch hits ± percpu_counter_batch (default: 32 on small systems, scales with NR_CPUS). Only then does it propagate to the global sum. This means a reader calling percpu_counter_sum() may see a value off by up to NR_CPUS × batch. For filesystem block usage this is acceptable; for reference counts it is not.

Minimal C Demo — Per-CPU Counter

Per-CPU counters — zero-contention counting — C Demo

stdin (optional)

6. §6.6 — Locking Patterns

Choosing the wrong synchronization primitive is one of the most common performance mistakes in kernel and systems code. The decision tree below encodes the key questions — each primitive has a specific niche, and using a heavier one than necessary wastes CPU cycles on every call.

Primitive	Context	Sleep inside?	Overhead	Kernel example
`spinlock_t`	Any (including IRQ)	Never	~5–20 ns (uncontended)	sk_buff queues, run-queue lock
`mutex`	Process context only	Yes (blocking)	~100 ns + ctx switch	inode i_mutex, mm mmap_lock
`rwlock_t`	Any (no sleep)	Never	~10–30 ns	tasklist_lock, file table
`seqlock`	Any, read-heavy numerics	Never	Read: near-zero if no writer	jiffies, gettimeofday
`RCU`	Any (read side)	Never (read side)	Read: near-zero; update: ms-range grace period	network routing table, task list
`atomic_t`	Any	Never	~5 ns (bus lock)	file descriptor refcount
`per-CPU`	Any (preempt disabled)	Never	~1 ns (local)	FS stats, net device stats

Deadlock Prevention: Lock Ordering

Background: Thread A holds L1 and waits for L2. Thread B holds L2 and waits for L1. Neither can proceed — circular wait.

Fix: Enforce a global lock order. Every code path that needs both L1 and L2 must acquire L1 first. The kernel documents this in comments (e.g. "always take i_lock before page lock"). The tool lockdep tracks acquisition order at runtime and fires a warning on any cycle.

Minimal C Demo — Lock Ordering

Lock ordering — prevent deadlock with consistent acquire order — C Demo

stdin (optional)

trylock + Release-on-Fail Pattern

When a code path must acquire two locks but cannot enforce a global order (e.g., the locks belong to different subsystems with incompatible hierarchies), use trylock: attempt to take the second lock; on failure, release the first and restart. Add a short backoff (cpu_relax() or cond_resched()) to avoid livelock.

trylock + backoff — avoid deadlock without lock ordering — C Demo

stdin (optional)

Priority Inversion & PI Mutexes

A high-priority task blocks on a mutex held by a low-priority task. A medium-priority task preempts the low-priority holder, delaying the high-priority task indefinitely. The kernel's rt_mutex uses priority inheritance: while a task holds an rt_mutex, it temporarily runs at the highest priority of any waiter. This prevents medium priority tasks from causing unbounded delays. POSIX threads support this via PTHREAD_PRIO_INHERIT.

7. §6.7 — Spinlock Internals & Variants

A spinlock is the lowest-level mutual exclusion primitive in the kernel — it busy-waits instead of sleeping, making it safe in interrupt context where sleeping is forbidden. Three generations exist in the kernel, each fixing the previous one's scalability problem.

Generation 1: Test-and-Set — Simple but Unfair & Slow

Every waiting CPU repeatedly executes an atomic write (test-and-set) to the shared lock variable. Even when the lock is held and the attempt will fail, it still writes — generating a MESI invalidation that forces every other waiter to refetch the cache line before its next spin. With N waiters, each spin triggers N−1 invalidation round trips.

Generation 2: Ticket Spinlock — Fair but Still Bounces

A ticket spinlock assigns each caller a number (like a deli counter). The lock holds next (next ticket to issue) and owner (currently serving). Waiters spin on read-only copies of owner — no write per spin. Fair: CPUs are served FIFO. But unlock() increments owner, invalidating all spinners at once.

Generation 3: MCS Lock — Fully Local Spinning

Background: Even with ticket locks, unlock causes N cache-line invalidations (one per waiter). MCS eliminates this: each waiter allocates a private mcs_spinlock node and spins on its own lockedfield. On unlock, the holder writes only the direct successor's node — exactly one cache-line transfer.

qspinlock — Linux Default (kernel ≥ 4.2)

qspinlock combines all three ideas into a single 32-bit word. Uncontended acquires cost one CAS. A second waiter sets the pending bit and spins on the owner bit (no node allocation). The third and beyond join an MCS queue. This makes the common case (no contention) extremely cheap while scaling gracefully under high contention.

Spinlock Variants — When to Use Which

Variant	Also disables	Restores?	Use when
`spin_lock(l)`	Nothing extra	—	You know IRQs and BH are already off
`spin_lock_irqsave(l, flags)`	Local IRQs	Yes (flags)	Lock is also used in IRQ handlers on this CPU
`spin_lock_bh(l)`	Softirqs / tasklets	Yes	Lock is also used in softirq / BH context
`spin_trylock(l)`	Nothing	—	Non-blocking try — return 0 on failure, no deadlock risk

Golden rule: If the same lock is acquired in an interrupt handler, the process-context acquisition must use spin_lock_irqsave(). Without it, the IRQ can fire after the process-context lock and deadlock on the same lock.

Minimal C Demo — Test-and-Set Spinlock

Test-and-Set spinlock — spinning with writes (cache bouncing) — C Demo

stdin (optional)

Minimal C Demo — MCS Lock Acquire / Release

MCS lock — each waiter spins on own node — C Demo

stdin (optional)

8. §6.8 — Sleeping Locks: Mutex & Semaphore

When a critical section may hold a lock for milliseconds (e.g., disk I/O, copying from user space), busy-spinning wastes an entire CPU. A mutexsleeps the waiter and wakes it on unlock. The kernel's struct mutex adds adaptive (optimistic) spinning: if the lock owner is currently running on another CPU, a short spin can avoid the costly context switch.

struct mutex Key Fields

Field	Type	Purpose
`owner`	atomic_long_t	Task pointer of current owner + flag bits (hand-off, pickup, wait)
`wait_lock`	raw_spinlock_t	Protects the wait_list; held briefly inside mutex_lock/unlock
`wait_list`	struct list_head	FIFO queue of mutex_waiter structs (sleeping waiters)
`osq`	struct optimistic_spin_queue	MCS queue for optimistic spinners (avoids thundering herd on wakeup)

rw_semaphore — Reader-Writer Sleeping Lock

struct rw_semaphore allows multiple concurrent readers but exclusive writers. Key policy: once a writer is waiting, new readers are blocked (writer-preferring) to prevent writer starvation. percpu_rw_semaphore goes further — readers take a per-CPU lock (fast path), writers do a stop-machine-like barrier (slow path). Used for mm->mmap_lock, task_list_lock, and kernel module loading.

Minimal C Demo — Mutex Adaptive Spinning

mutex_lock() — fast / optimistic-spin / sleep paths — C Demo

stdin (optional)

9. §6.9 — Sequence Locks

A seqlock gives readers a lock-free fast path at the cost of possible retries. Writers hold an internal spinlock and increment a sequence counter (even → odd → even). Readers snapshot the counter before and after reading — if it changed, a writer was active and they retry. Used pervasively for time-keeping: jiffies, xtime, and the VDSO clock_gettime() fast path.

seqlock vs RCU vs rwlock

Primitive	Read cost (no writer)	Read cost (writer active)	Protects pointers?	Writer waits for readers?
`seqlock`	Near-zero (no lock)	Retry (cheap)	No — never use for pointers	No — readers retry
`rwlock`	~10 ns (bus lock on read)	Waits for writer or readers	Yes	Yes (writer waits for readers)
`RCU`	Near-zero (preempt disable)	No wait — reads old version	Yes — designed for this	No — readers never block

Why no pointers? Between read_seqbegin() and read_seqretry(), a writer can free the object the reader just dereferenced. The reader will detect the retry condition — but only after it has already accessed freed memory. Use RCU when the protected data contains pointers.

Minimal C Demo — seqlock Reader & Writer

seqlock — optimistic read with sequence counter retry — C Demo

stdin (optional)

10. §6.10 — RCU: Read-Copy-Update

RCU is the most important synchronization primitive introduced in the Linux kernel (2.5.43, 2002). The core insight: allow reads without any synchronization at all, and defer the cost of synchronization to the write side. Readers pay only a preemption disable (no cache line transfer, no atomic). Writers publish a new version atomically and wait for a grace period before freeing the old one.

RCU API — at a Glance

API	Side	What it does
`rcu_read_lock()`	Reader	Disables preemption (and IRQs on PREEMPT_RT). Opens an RCU read-side critical section.
`rcu_read_unlock()`	Reader	Re-enables preemption. The CPU has now passed a quiescent state.
`rcu_dereference(p)`	Reader	READ_ONCE + dependency ordering barrier. Safe pointer load from RCU-protected location.
`rcu_assign_pointer(p, v)`	Writer	smp_store_release: ensures new object is fully initialized before pointer is visible.
`synchronize_rcu()`	Writer	Blocks until all CPUs have passed through at least one quiescent state (grace period).
`call_rcu(head, fn)`	Writer	Schedules fn(head) to run after the grace period. Non-blocking — safe in IRQ context.
`kfree_rcu(ptr, field)`	Writer	Helper: calls call_rcu then kfree. Common pattern for simple object reclamation.

Core Mechanism: Publish-Subscribe + Grace Period

Background: A routing table entry must be updated while thousands of packets per second are looking it up. Taking a write lock would block all readers during the update — unacceptable latency. RCU lets all existing lookups complete against the old entry while new lookups use the new entry.

Plan:

Allocate and initialize the new object
rcu_assign_pointer() — atomically publish; all readers that start after this see the new object
Call synchronize_rcu() or call_rcu() — wait for all readers that still hold the old pointer to finish
Free the old object — now safe because no reader can reach it

Grace Period: How Does the Kernel Know Readers Are Done?

A quiescent stateis a point at which a CPU cannot be inside an RCU read-side critical section: a context switch, entry to userspace, or the idle loop. The kernel's RCU subsystem watches for each CPU to pass through at least one quiescent state after the writer called synchronize_rcu(). Once every CPU has done so, the grace period is complete and the old object can be freed.

RCU-Protected Linked Lists

The kernel provides a full set of RCU-safe list operations in include/linux/rculist.h. list_del_rcu() unlinks the node but does not free it — a concurrent reader might still hold a pointer to it. Only after the grace period is it safe to free.

SRCU — Sleepable RCU

Standard RCU forbids sleeping in read-side critical sections. When sleeping is needed (e.g., waiting for user-space to respond), srcu_struct provides a per-domain sleepable variant. Each domain has its own grace-period tracking, so a slow reader only delays reclamation in its own domain. Used in KVM, eBPF, and file system notifications.

Practical Uses in the Linux Kernel

Subsystem	What is RCU-protected	Write path
IP routing	fib_table route entries	netlink route add/del with write lock
Network namespace	net namespace list traversal	unshare(CLONE_NEWNET)
PID lookup	find_task_by_vpid() — task_struct	fork/exit with tasklist_lock write
Dentry cache	dcache negative entry lookup	d_add() / d_drop() with d_lock
Module loading	module list traversal	module_mutex + RCU publish

Minimal C Demo — RCU Reader & Writer

RCU reader/writer — publish-subscribe with grace period — C Demo

stdin (optional)

Minimal C Demo — RCU-Protected Linked List

RCU list — del without free, traverse without lock — C Demo

stdin (optional)

11. §6.11 — Lock-Free Data Structures

Lock-free data structures use Compare-and-Swap (CAS) loops instead of locks. If a CAS fails (another CPU changed the value), the thread retries — no thread can be blocked forever by the scheduler. The Linux kernel uses lock-free structures in hot paths: rte_ring (DPDK), RCU-protected lists, and the per-CPU run-queue arrays.

Treiber Stack — Lock-Free Push & Pop

The simplest lock-free structure. A single pointer topis updated atomically. Both push and pop are CAS loops: read the current top, compute the new value, CAS — retry if another CPU changed top first.

The ABA Problem

CAS checks value equality, not identity. If a pointer value is reused (address A is freed then reallocated back to A), a stale CAS will succeed even though the structure changed underneath. The standard fix is to embed a version counter in the same word as the pointer (double-word CAS, or tagged pointers using low bits). The kernel's kfree_rcu sidesteps ABA entirely by deferring frees until all readers finish.

ABA Rule: Never free a node that may still be referenced by a concurrent CAS loop. Use hazard pointers, epoch-based reclamation, or RCU to delay reclamation until all readers have observed the removal.

Minimal C Demo — Treiber Stack

Treiber Stack — Lock-Free Push / Pop — C Demo

stdin (optional)

Minimal C Demo — ABA Problem

ABA Problem — Illustrated — C Demo

stdin (optional)

Michael-Scott Queue (MPMC Lock-Free)

A lock-free FIFO queue with a head (consumer side) and tail(producer side), each independently CAS-ed. Enqueue: CAS tail to point to new node, then CAS tail forward. Dequeue: CAS head forward. Used in DPDK's rte_ringand Java's ConcurrentLinkedQueue.

Property	Treiber Stack	MS Queue	rte_ring (DPDK)
Ordering	LIFO	FIFO	FIFO
Producers	MPSC or MPMC	MPMC	SPSC / MPMC modes
Consumers	MPMC	MPMC	SPSC / MPMC modes
ABA risk	Yes — use tagged ptr	Yes — use tagged ptr	Index-based (no ptr) — immune
Memory reclamation	Hazard pointers / RCU	Hazard pointers / RCU	N/A — ring slots pre-allocated
Kernel usage	Lock-free lists (rculist)	Rare — RCU preferred	DPDK data plane

12. §6.12 — Per-CPU Data

Per-CPU variables give each CPU its own private copy of a variable. Accesses require no synchronization at all — no lock, no atomic, no cache-line bounce. The kernel uses them for run-queues, statistics counters, softirq work queues, and slab allocator magazines.

API at a Glance

Macro / Function	What it does	Preemption safe?
`DEFINE_PER_CPU(type, name)`	Declare a per-CPU variable in the per-CPU section	—
`DEFINE_PER_CPU_ALIGNED(type, name)`	Same, but aligned to cache line — prevents false sharing	—
`get_cpu_var(name)`	Disable preemption + return reference to this CPU's copy	Yes — must call put_cpu_var
`put_cpu_var(name)`	Re-enable preemption after get_cpu_var	Yes
`this_cpu_read(name)`	Read own CPU's copy — preemption must already be disabled	No — caller responsible
`this_cpu_inc(name)`	Increment own copy atomically (non-preemptible context)	No
`per_cpu(name, cpu)`	Access another CPU's copy — needs lock or irqsave	No — use carefully
`alloc_percpu(type)`	Dynamic per-CPU allocation (returns __percpu pointer)	—
`free_percpu(ptr)`	Free dynamic per-CPU allocation	—

Why Per-CPU Beats atomic_t at Scale

Every atomic_inc() generates a bus-locked LOCK XADD on x86. With 64 CPUs all hitting the same counter, the cache line bounces between them — throughput degrades as O(N). With per-CPU: each CPU writes its own cache line, zero coherence traffic — throughput scales as O(1). The read cost is O(NR_CPUS) (summing all slots), but reads are rare for statistics counters.

Minimal C Demo — Per-CPU Counter Simulation

Per-CPU Counter vs atomic_t — C Demo

stdin (optional)

Rule: Use get_cpu_var / put_cpu_var — they disable preemption, preventing the scheduler from migrating you to another CPU between the get and the access. Never use this_cpu_inc in preemptible context — the CPU number can change after the read.

13. §6.13 — Synchronization Design Patterns

Choosing the wrong primitive costs performance or correctness. The decision tree below encodes the kernel community's conventional wisdom: start with per-CPU data, escalate to RCU for read-heavy, and only reach for a sleeping lock when the critical section is long or blocking.

Producer-Consumer: Ring Buffer Pattern

The most common real-world pattern in network drivers and audio subsystems.

Variant	Sync needed	Kernel example
SPSC (1 producer, 1 consumer)	smp_wmb / smp_rmb only — no lock	`pipe, kfifo (SPSC mode)`
MPSC (N producers, 1 consumer)	Atomic CAS on head index	`io_uring submission ring`
MPMC (N producers, N consumers)	qspinlock or lock-free (MS queue)	`rte_ring DPDK`

SPSC Ring — Barrier-Only Synchronization

Background: Two threads share a ring buffer. The producer writes data then advances head; the consumer reads data then advances tail. No lock needed because only one thread touches each index.

Plan:

Producer: write data into buf[head % N]
Producer: smp_wmb() — ensure data visible before index
Producer: WRITE_ONCE(head, head + 1)
Consumer: h = READ_ONCE(head)
Consumer: smp_rmb() — ensure index seen before data
Consumer: read buf[tail % N]

Why barriers: Without smp_wmb() the CPU or compiler may publish the index before the data — the consumer would see the new index and read stale data.

Hazard Pointers — Pointer-Safe Memory Reclamation

An alternative to RCU for lock-free structures. Each thread announces the pointer it is currently accessing in a public hazard pointer slot. The reclamer scans all hazard pointer slots before freeing — if any slot holds the address, it defers the free. Advantages over RCU: no grace period concept, works without preemption semantics. Disadvantages: O(N) scan per free, hazard pointers must be carefully ordered with the access (store-load barrier).

Common Anti-Patterns to Avoid

Anti-pattern	Problem	Fix
Lock in IRQ handler for data shared with process context	If process context holds the lock, IRQ handler deadlocks on same CPU	Use spin_lock_irqsave in process context
mutex_lock inside spinlock critical section	mutex_lock may sleep → BUG: scheduling while atomic	Use spinlock for both, or restructure to avoid nesting
Bare pointer read under RCU without rcu_dereference()	Compiler may split the load; Alpha CPUs may speculate through pointer	Always use rcu_dereference() inside rcu_read_lock()
WRITE_ONCE missing on published flag	Compiler may cache flag in register; reader never sees update	Pair WRITE_ONCE (publisher) with READ_ONCE (subscriber)
Taking two locks in different orders in different code paths	Classic AB-BA deadlock; lockdep catches it at runtime	Document and enforce global lock ordering

14. §6.14 — Debugging Concurrency Issues

Concurrency bugs (data races, deadlocks, priority inversion, false sharing) are the hardest class of kernel bugs to reproduce. The kernel ships with four major dynamic analysis tools — enable them during development, not production.

Tool Overview

Tool	Detects	Config	Overhead
lockdep	Lock order cycles, missing unlock, irq-safe violations	`CONFIG_LOCKDEP`	~10% CPU, ~1 MB memory
KCSAN	Data races (concurrent conflicting accesses)	`CONFIG_KCSAN`	~10× slowdown — dev only
KASAN	Use-after-free, out-of-bounds (also useful for UAF in lock-free code)	`CONFIG_KASAN`	~2× slowdown
perf c2c	False sharing (cache-to-cache transfers on shared lines)	`perf c2c record/report`	~5% with sampling
ThreadSanitizer (user)	Data races in user-space lock-free code	`gcc/clang -fsanitize=thread`	~5–15× slowdown

lockdep — Lock Dependency Graph

lockdep builds a directed graph of lock acquisitions at runtime. An edge A → B means "lock A was held when lock B was acquired." If a cycle is ever detected (A → B and B → A from different code paths), lockdep fires WARN_ON with a full stack trace before the deadlock actually occurs — it detects the potential for deadlock, not just when it happens.

KCSAN — Kernel Concurrency Sanitizer

KCSAN instruments every load and store at compile time. At runtime it records concurrent accesses to the same address and reports a data race if at least one is a write. It uses a watchpoint-based approach — each CPU can hold a small number of watchpoints at a time, providing statistical coverage without full shadow memory.

KCSAN annotations:

data_race(x) — suppress a known benign race (e.g., statistical sample)
READ_ONCE(x) / WRITE_ONCE(x) — suppress races that are intentionally race-free at the C level
kcsan_nestable_atomic_begin/end() — mark atomic regions KCSAN should not instrument

perf c2c — False Sharing Detection

perf c2c record -- ./workload captures cache-to-cache transfer events (HITM — Hit In Modified state from another core). The report shows which cache lines are hot and which source lines produced the conflicting accesses. Typical fix: pad the hot variables to separate cache lines with __cacheline_aligned_in_smp or restructure structs so writer and reader fields do not share a line.

Minimal C Demo — lockdep Cycle Detection

lockdep — Cycle Detection (Simulated) — C Demo

stdin (optional)

Debugging Workflow

Step 1 — Enable lockdep (CONFIG_LOCKDEP) and run your workload. All deadlock-order violations are caught here without reproducing the race.

Step 2 — Enable KCSAN (CONFIG_KCSAN). Run with multiple CPUs and high load. KCSAN statistically samples — run for minutes, not seconds.

Step 3 — perf c2c if throughput regressed. Profile, identify hot shared lines, add padding.

Step 4 — KASAN if UAF suspected in lock-free code (RCU callback frees before last reader, ABA pointer recycle).

15. Kernel Source Pointers

File	Symbol / Macro	What it does
`include/linux/compiler.h`	`barrier()`	Compiler-only fence — stops compiler reordering; no CPU instruction
`include/linux/compiler.h`	`READ_ONCE(), WRITE_ONCE()`	Force single volatile-width load/store; prevent compiler optimizations
`include/asm-generic/barrier.h`	`smp_mb(), smp_rmb(), smp_wmb()`	Full / read / write memory barriers; arch-specific implementations
`arch/x86/include/asm/barrier.h`	`smp_mb() → mfence`	x86 TSO: smp_rmb/wmb are barriers only (no fence instruction)
`arch/arm64/include/asm/barrier.h`	`smp_mb() → dmb ish`	ARM64 full inner-shareable domain barrier
`include/linux/atomic.h`	`smp_load_acquire(), smp_store_release()`	One-directional barriers for publish-subscribe patterns
`tools/memory-model/`	`LKMM litmus tests`	Formal memory model; run with herd7 to verify ordering assumptions
`include/linux/atomic.h`	`atomic_t / atomic_inc / atomic_cmpxchg / atomic_dec_and_test`	Core atomic API; arch-specific impls in arch/*/include/asm/atomic.h
`include/linux/refcount.h`	`refcount_t / refcount_inc / refcount_dec_and_test`	Saturating refcount; CONFIG_REFCOUNT_FULL adds overflow detection
`include/linux/kref.h`	`kref / kref_init / kref_get / kref_put`	Reference-counted object lifecycle; uses atomic_t internally
`lib/percpu_counter.c`	`percpu_counter_add / percpu_counter_sum`	Per-CPU batched counter; batch size = percpu_counter_batch
`include/linux/spinlock.h`	`spin_lock / spin_unlock / spin_lock_irqsave`	Spinlock API; arch-specific ticket/queued spinlock implementations
`include/linux/mutex.h`	`mutex_lock / mutex_unlock / mutex_trylock`	Sleeping mutex; uses futex under the hood; not usable in IRQ context
`kernel/locking/lockdep.c`	`lockdep — lock dependency tracker`	Detects lock order cycles and missing unlock at runtime (CONFIG_LOCKDEP)
`kernel/locking/rtmutex.c`	`rt_mutex — priority inheritance mutex`	Prevents priority inversion; used by PREEMPT_RT and futex PI
`kernel/locking/qspinlock.c`	`queued_spin_lock() — fast / medium / slow paths`	Fast: single CAS; medium: pending bit; slow: MCS queue. Default spinlock since kernel 4.2
`kernel/locking/mcs_spinlock.h`	`mcs_spinlock / mcs_lock / mcs_unlock`	Per-CPU queue node; each waiter spins on own locked field — O(1) bus traffic per unlock
`include/linux/spinlock.h`	`spin_lock_irqsave / spin_lock_bh / spin_trylock`	irqsave: disables local IRQs + acquires lock; bh: disables softirqs + lock
`kernel/locking/mutex.c`	`mutex_lock / mutex_lock_interruptible / mutex_trylock`	Adaptive spinning: optimistic spin if owner on CPU; sleep if owner blocked. mutex_osq for spinner queue
`include/linux/mutex.h`	`struct mutex — owner, wait_lock, wait_list, osq`	owner field stores task pointer + flag bits (HANDOFF, PICKUP, WAITERS)
`include/linux/seqlock.h`	`seqlock_t / read_seqbegin / read_seqretry / write_seqlock`	Seq counter: even=stable, odd=write active; seqcount_t is the lockless variant
`include/linux/rcupdate.h`	`rcu_read_lock / rcu_read_unlock / rcu_dereference / rcu_assign_pointer`	Core RCU API; read-side is preemption-disable only — never acquires a lock
`include/linux/rcupdate.h`	`synchronize_rcu() / call_rcu() / kfree_rcu()`	synchronize_rcu: blocking grace period; call_rcu: async callback; kfree_rcu: call_rcu + kfree
`kernel/rcu/tree.c`	`RCU tree hierarchy — rcu_node / rcu_data`	Organizes CPUs into a fanout tree; grace period completes when all leaf nodes report quiescent
`include/linux/rculist.h`	`list_add_rcu / list_del_rcu / list_for_each_entry_rcu`	RCU-safe list ops; del unlinks but does not free — free only after synchronize_rcu / call_rcu
`include/linux/srcu.h`	`srcu_struct / srcu_read_lock / synchronize_srcu`	Sleepable RCU: allows sleeping in read-side critical section; per-domain grace period tracking
`include/linux/percpu.h`	`DEFINE_PER_CPU / get_cpu_var / put_cpu_var / this_cpu_inc`	Per-CPU variable API; get_cpu_var disables preemption; this_cpu_* are preemption-unsafe shortcuts
`include/linux/percpu.h`	`alloc_percpu() / free_percpu()`	Dynamic per-CPU allocation; returns __percpu-annotated pointer; use per_cpu_ptr() to dereference
`lib/percpu-refcount.c`	`percpu_ref — per-CPU reference count`	Scalable refcount: writers use per-CPU slots; percpu_ref_kill() switches to atomic for teardown
`include/linux/kfifo.h`	`kfifo — kernel lock-free SPSC ring buffer`	Single-producer single-consumer ring; uses only memory barriers (no lock) for synchronization
`kernel/locking/lockdep.c`	`lock_acquire / lock_release / lockdep_init_map`	lockdep core: records lock class graph, detects cycles. CONFIG_LOCKDEP enables at boot
`mm/kcsan/kcsan.c`	`kcsan_setup_watchpoint / kcsan_check_access`	KCSAN runtime: instruments each access; CONFIG_KCSAN; data_race() macro suppresses benign races
`tools/perf/builtin-c2c.c`	`perf c2c — cache-to-cache false sharing profiler`	Reports HITM counts per cache line; identifies false sharing between CPUs

16. Interview Prep

Q1: What is false sharing and how do you detect and fix it?

False sharing occurs when two CPUs write to different variables that share the same 64-byte cache line. Every write causes an MESI invalidation, forcing the other CPU to refetch the line. Detection: perf c2c shows cache-to-cache transfers. Fix: pad each hot variable to a full cache line, use __cacheline_aligned_in_smp, or use per-CPU data (DEFINE_PER_CPU).

Q2: Walk through the MESI states. What transition causes a write-back to DRAM?

Four states: Modified (dirty, exclusive), Exclusive (clean, exclusive), Shared (clean, multiple copies), Invalid. A write-back occurs when a Modified line transitions to Shared (another CPU requests a read — the modified CPU writes back and both go to Shared) or to Invalid (another CPU requests a write — the modified CPU writes back then gives up the line).

Q3: What does smp_mb() do and when must you use it vs smp_wmb()?

smp_mb() is a full barrier — no load or store on either side may cross it. smp_wmb() orders only stores. Use smp_wmb() when a producer writes data then a flag, paired with smp_rmb() on the consumer. Use smp_mb() when you need to prevent both load and store reordering — e.g., in lock implementations that must flush the store buffer and drain the invalidation queue.

Q4: Why do smp_wmb() and smp_rmb() compile to nothing on x86 (yet still work)?

x86 has TSO (Total Store Order): stores are observed in program order by other CPUs — no store-store reordering is visible. smp_wmb() still contains a compiler barrier (asm volatile), preventing the compiler from reordering. The CPU guarantee handles the hardware side. On ARM/POWER these become real fence instructions (dmb ishst / dmb ishld).

Q5: What is the difference between READ_ONCE() and smp_rmb()?

READ_ONCE() prevents the compiler from caching a variable in a register, tearing a read, or speculating it — it forces a single volatile-width load. smp_rmb() is a CPU-level read barrier that prevents the CPU's out-of-order engine from completing loads below the barrier before loads above it. READ_ONCE is about the compiler; smp_rmb is about the hardware. Both are usually needed together in a publish-subscribe pattern.

Q6: When should you use memory_order_seq_cst vs memory_order_acquire/release?

Use acquire/release (one-directional barriers) for most publish-subscribe synchronization — they are cheaper (no full fence on x86, just ldar/stlr on ARM). Use seq_cst only when you need a total global order across multiple atomic variables — e.g., Dekker-style mutual exclusion where both sides write and both sides read. seq_cst is significantly more expensive on ARM and POWER.

Q7: What does cmpxchg() return, and how do you use it in a CAS loop?

cmpxchg(&v, old, new) returns the value that was in v at the time of the operation. If the returned value equals old, the swap succeeded. If not, another CPU changed v before us — we reload, recompute, and retry. The pattern is: old = READ_ONCE(v); new = f(old); while (cmpxchg(&v, old, new) != old) { old = READ_ONCE(v); new = f(old); }. This is the basis for all lock-free data structures.

Q8: Why does the kernel use refcount_t instead of atomic_t for reference counts?

refcount_t adds saturation semantics: an increment on a zero-count object fires a warning (kref_get on a dead object is a use-after-free bug). An overflow wraps to 1 rather than 0, preventing an attacker from wrapping a counter to manufacture a fake reference. With CONFIG_REFCOUNT_FULL, every operation is checked. atomic_t has none of these protections.

Q9: When would you use a per-CPU counter instead of atomic_t? What are the trade-offs?

Use per-CPU counters when the write rate is high and readers can tolerate approximate values (filesystem usage, network statistics). Each CPU increments its local slot — zero cross-CPU coherence traffic — so throughput scales linearly with core count. Trade-offs: reads must sum all per-CPU slots (O(NR_CPUS) cost) and the total may be off by up to NR_CPUS × batch. Not suitable for reference counts or anything requiring exact, immediately consistent values.

Q10: How does the kernel prevent deadlocks in code that acquires multiple locks?

Lock ordering: every code path that needs locks A and B must always acquire them in the same global order (e.g., by address, by inode number, or by documented hierarchy). The lockdep subsystem tracks the directed graph of 'lock A was held when lock B was acquired' at runtime and fires a BUG if a cycle is detected. Alternative: trylock with release-on-fail — attempt to take the second lock; on failure release the first and retry with a backoff.

Q11: What is priority inversion and how does the kernel solve it?

Priority inversion: a high-priority task blocks on a mutex held by a low-priority task; a medium-priority task then preempts the low-priority holder, indirectly blocking the high task for an unbounded time. Solution: priority inheritance (rt_mutex). While a task holds an rt_mutex, its scheduling priority is temporarily boosted to the maximum priority of any waiter. This prevents medium-priority tasks from cutting in. Configured with PTHREAD_PRIO_INHERIT in userspace; built into rt_mutex in the kernel.

Q12: Walk through qspinlock fast, medium, and slow paths. Why does the slow path need MCS?

Fast path: a single CAS on the 32-bit lock word acquires the lock when uncontended. Medium path: if the lock is held and no one is waiting, the second waiter sets a pending bit and spins reading the owner bit — no node allocation, no queue. Slow path: the third and subsequent waiters join an MCS queue, each spinning on their own node's locked field. MCS is needed because without it, all waiters would spin on the same cache line, generating O(N) invalidations on every unlock — a bus storm that degrades with core count.

Q13: Why must you use spin_lock_irqsave() when the lock is also acquired in an IRQ handler?

If you call spin_lock() in process context and an IRQ fires on the same CPU while you hold the lock, the IRQ handler will try to acquire the same lock — deadlock, because the interrupted code cannot release it while the handler is running. spin_lock_irqsave() disables local IRQs before acquiring, preventing the IRQ from firing until unlock. The irqsave variant saves the previous IRQ flags so nested calls restore the correct state.

Q14: Why can you never sleep inside a spinlock critical section?

A spinlock disables preemption on the current CPU. If the holder calls schedule() (directly or via a blocking call like kmalloc(GFP_KERNEL), copy_from_user(), mutex_lock()), the scheduler will try to context-switch away — but it cannot, because preemption is disabled. On PREEMPT_RT kernels this causes a BUG; on non-RT kernels it causes a lockup. Additionally, sleeping on a spinlock-held CPU blocks any other spinlock waiter on that CPU forever.

Q15: What does mutex_lock() do when it finds the lock held but the owner is running on another CPU?

It enters the optimistic spinning phase. The waiter spins in a tight loop (cpu_relax()), re-checking the lock owner. If the owner releases the lock during this spin, the waiter acquires without ever going to sleep — saving the ~5–10 µs cost of a context switch. The waiter gives up spinning if: (1) the owner leaves the CPU (blocks or gets preempted), (2) a higher-priority task becomes runnable, or (3) an internal spin limit is hit. At that point it adds itself to the mutex wait_list and calls schedule().

Q16: What is seqlock? When should you use it instead of rwlock?

A seqlock uses a sequence counter (even = stable, odd = write in progress). Writers increment the counter at start and end of the update; readers snapshot it before and after — if the snapshots differ or the first was odd, they retry. Use seqlock when: (1) reads are far more frequent than writes (e.g., jiffies, clock_gettime), (2) the protected data is small plain numeric values, and (3) retrying is cheap. Do not use rwlock when readers are read-heavy: rwlock requires a bus-locked read on every acquisition. Seqlock readers pay almost nothing. But seqlock cannot protect pointers — a reader may dereference a freed pointer between seqbegin and seqretry.

Q17: Explain rcu_read_lock(). What does it actually do at the hardware level?

On non-PREEMPT_RT kernels, rcu_read_lock() expands to preempt_disable() — it increments the per-CPU preempt_count field. No lock is acquired, no atomic operation is performed, no cache line is transferred. This is why the read side is essentially free. On PREEMPT_RT (where spinlocks are converted to sleeping locks), rcu_read_lock() also disables local IRQs to prevent RT-mutex priority inheritance from running inside an RCU critical section. The preempt_disable prevents context switches, which are the quiescent states RCU detects — ensuring the CPU cannot report a quiescent state while a reader is active.

Q18: What is a grace period in RCU? Name three examples of quiescent states.

A grace period is the time after a writer calls synchronize_rcu() until every CPU has passed through at least one quiescent state. A quiescent state is a point at which a CPU is guaranteed to not be inside any RCU read-side critical section. Three examples: (1) a context switch — schedule() runs, so preempt_count must be 0; (2) entry to userspace — the kernel is not running any RCU-protected code; (3) the idle loop — the CPU is doing nothing. Once every CPU has passed a quiescent state, any reader that was running before the writer's rcu_assign_pointer() must have finished, making it safe to free the old object.

Q19: Compare synchronize_rcu() and call_rcu(). When must you use call_rcu()?

synchronize_rcu() blocks the caller until the grace period ends — it may sleep for milliseconds. Use it when you can afford to block (process context, no locks held). call_rcu(head, fn) registers a callback fn to be called after the grace period and returns immediately. Use it when you cannot block: inside an IRQ handler, a softirq, a BH, or while holding a spinlock. call_rcu is also more efficient when freeing many objects: callbacks from multiple writers are batched into a single grace period. kfree_rcu(ptr, field) is a convenience wrapper that calls call_rcu then kfree, avoiding boilerplate callback functions.

Q20: Why does rcu_dereference() exist — can you just dereference the pointer directly?

No. rcu_dereference() expands to READ_ONCE(p) plus a dependency ordering barrier. READ_ONCE prevents the compiler from re-reading the pointer from memory twice (TOCTOU) or from caching it in a register across the read-side critical section. The dependency barrier ensures that on architectures with weak memory ordering (Alpha, theoretically), the CPU does not prefetch data through the pointer before the pointer load is observed. On x86 (TSO) the hardware part is a no-op, but the compiler barrier is always needed. Without rcu_dereference(), a compiler optimization or a weak-arch CPU can produce use-after-free bugs that are impossible to reproduce on x86.

Q21: What is the ABA problem? How does the Linux kernel avoid it?

ABA: a CAS loop reads pointer A, gets preempted. Another thread pops A, frees it, allocates a new object that gets the same address A, then pushes it. When the original thread resumes, CAS(&top, A, next) succeeds — it sees the same address — but 'next' is now a dangling pointer. The kernel avoids ABA in lock-free lists primarily through RCU: kfree_rcu and call_rcu defer the free until all concurrent readers have finished, so the old address cannot be reallocated and reused while any CAS loop is still running. DPDK's rte_ring avoids ABA entirely by using integer indices instead of raw pointers — indices can wrap but the ring size prevents a false match.

Q22: Why does a Treiber stack's push loop retry if CAS fails? Is it lock-free or wait-free?

CAS fails when another thread changed 'top' between our read and our compare. We retry with the new top value — our thread makes progress because at least one thread (the one that beat our CAS) successfully pushed. This is lock-free: the system as a whole always makes progress, no thread can be blocked indefinitely by the scheduler. It is not wait-free: a single thread can theoretically retry infinitely if it keeps losing races. Wait-free algorithms guarantee each thread completes in a bounded number of steps — they require more complex constructions (fetch-and-increment, combining trees).

Q23: When should you use per-CPU data instead of atomic_t? What are the correctness rules?

Use per-CPU when: write rate is high and the value is per-thread statistics (counters, allocator magazines, run-queue state). Correctness rules: (1) always use get_cpu_var/put_cpu_var or preempt_disable/enable around accesses — if the task is migrated between the read and the write, you corrupt two CPUs' data. (2) Never use this_cpu_inc in preemptible context. (3) Reading another CPU's slot (per_cpu(var, cpu)) requires that CPU to not be updating it simultaneously — use spin_lock or memory barriers. (4) Totals computed by summing all slots may be transiently negative or larger than the true value — never use per-CPU for reference counts.

Q24: What does lockdep detect and how? Give an example of a false positive.

lockdep tracks 'lock A was held when lock B was acquired' as directed edges in a graph. If it detects a cycle (A→B and B→A from different code paths), it fires WARN_ON before the deadlock occurs. False positive example: two different inode mutexes with the same lockdep class (same static lock_class_key). Code path 1 locks inode-A then inode-B; code path 2 locks inode-B then inode-A. lockdep sees A→B and B→A and reports a cycle — but in practice both A and B are always acquired in address order (lock_two_nondirectories), so no real deadlock. Fix: use lockdep_set_subclass() or nest_lock() to annotate the ordering guarantee.

Q25: How does KCSAN detect data races? What is a benign race and how do you suppress it?

KCSAN instruments every load/store with __tsan_read/__tsan_write calls. Each CPU registers a watchpoint (address, type) for a short window. If another CPU concurrently accesses the same address and at least one access is a write, KCSAN reports a data race with stack traces from both sides. A benign race is a concurrent read of a value that is not protected by a lock but where the torn or stale read is intentional and harmless — e.g., reading a statistics counter without holding the update lock. Suppress with data_race(x) macro, which tells KCSAN 'this concurrent access is intentional.' For single-load/single-store races that are intentionally unsynchronized, use READ_ONCE/WRITE_ONCE instead — they silence KCSAN and also prevent compiler optimizations.

Q26: How do you detect false sharing with perf c2c? What does HITM mean?

Run: perf c2c record -- ./workload; perf c2c report. HITM = Hit In Modified — a load from CPU A that hits a cache line in the Modified state on CPU B, forcing a cache-to-cache transfer. This is the signature of false sharing or true sharing under write-write contention. The report shows per-cache-line HITM counts and the source code lines responsible for the reader and writer accesses. Fix: pad the struct so the hot writer and hot reader fields land on separate 64-byte cache lines, using __cacheline_aligned_in_smp or ____cacheline_aligned between fields.

Part VI — Parallel Programming & Synchronization

Overview

1. §6.1 — Hardware Foundations for Parallelism

CPU Pipeline: Store Buffer & Invalidation Queue

Memory Hierarchy & Latency

False Sharing — The Hidden Performance Killer

2. §6.2 — Cache Coherence: MESI Protocol

Key Transitions to Memorize

Cache-Line Bouncing: atomic_inc() Under Contention

MOESI Extension

3. §6.3 — Memory Barriers & Ordering

Two Levels of Reordering

Linux Kernel Barrier API

x86 TSO (Total Store Order)

Pairing Rules — the Golden Rule

Minimal C Demo — Memory Barrier Simulation

Minimal C Demo — READ_ONCE / WRITE_ONCE

Minimal C Demo — C11 memory_order: release/acquire

4. §6.4 — Atomic Operations

atomic_t, atomic64_t, refcount_t

Core Mechanism: Compare-and-Swap (CAS) Loop

Minimal C Demo — CAS Lock-Free Counter

kref / refcount_t — Reference Counting

Minimal C Demo — kref / refcount_t Lifecycle

5. §6.5 — Counting at Scale

percpu_counter vs local_t vs atomic_t

Minimal C Demo — Per-CPU Counter

6. §6.6 — Locking Patterns

Deadlock Prevention: Lock Ordering

Minimal C Demo — Lock Ordering

trylock + Release-on-Fail Pattern

Priority Inversion & PI Mutexes

7. §6.7 — Spinlock Internals & Variants

Generation 1: Test-and-Set — Simple but Unfair & Slow

Generation 2: Ticket Spinlock — Fair but Still Bounces

Generation 3: MCS Lock — Fully Local Spinning

qspinlock — Linux Default (kernel ≥ 4.2)

Spinlock Variants — When to Use Which

Minimal C Demo — Test-and-Set Spinlock

Minimal C Demo — MCS Lock Acquire / Release

8. §6.8 — Sleeping Locks: Mutex & Semaphore

struct mutex Key Fields

rw_semaphore — Reader-Writer Sleeping Lock

Minimal C Demo — Mutex Adaptive Spinning

9. §6.9 — Sequence Locks

seqlock vs RCU vs rwlock

Minimal C Demo — seqlock Reader & Writer

10. §6.10 — RCU: Read-Copy-Update

RCU API — at a Glance

Core Mechanism: Publish-Subscribe + Grace Period

Grace Period: How Does the Kernel Know Readers Are Done?

RCU-Protected Linked Lists

SRCU — Sleepable RCU

Practical Uses in the Linux Kernel

Minimal C Demo — RCU Reader & Writer

Minimal C Demo — RCU-Protected Linked List

11. §6.11 — Lock-Free Data Structures

Treiber Stack — Lock-Free Push & Pop

The ABA Problem

Minimal C Demo — Treiber Stack

Minimal C Demo — ABA Problem

Michael-Scott Queue (MPMC Lock-Free)

12. §6.12 — Per-CPU Data

API at a Glance

Why Per-CPU Beats atomic_t at Scale

Minimal C Demo — Per-CPU Counter Simulation

13. §6.13 — Synchronization Design Patterns

Producer-Consumer: Ring Buffer Pattern

SPSC Ring — Barrier-Only Synchronization

Hazard Pointers — Pointer-Safe Memory Reclamation

Common Anti-Patterns to Avoid

14. §6.14 — Debugging Concurrency Issues

Tool Overview

lockdep — Lock Dependency Graph

KCSAN — Kernel Concurrency Sanitizer

perf c2c — False Sharing Detection

Minimal C Demo — lockdep Cycle Detection

Debugging Workflow

15. Kernel Source Pointers

16. Interview Prep

Cache-Line Bouncing: `atomic_inc()` Under Contention