Part VI — Parallel Programming & Synchronization
§6.1 – 6.10 · Hardware Foundations · MESI · Barriers · Atomics · Counting · Locking · Spinlock · Mutex · Seqlock · RCU
Based on Paul McKenney's Is Parallel Programming Hard (PPH) Ch.2–3, 15 and Linux kernel locking internals (LKD Ch.9–10).
Overview
Correct concurrent programming starts at the hardware level. A CPU does not execute instructions in the order you wrote them — it reorders stores and loads for performance. Multiple CPUs sharing memory introduce cache coherence traffic that can make a naive counter slower than a single-threaded one. Every kernel synchronization primitive — from a spinlock to RCU — exists to manage these hardware-level effects. This part builds the foundation: hardware behavior (§6.1), the MESI protocol that keeps caches consistent (§6.2), and the memory barriers that enforce ordering across CPUs (§6.3).
1. §6.1 — Hardware Foundations for Parallelism
CPU Pipeline: Store Buffer & Invalidation Queue
A modern CPU core does not write to the cache immediately. When a store instruction executes, the value goes into the store buffer — a small, private FIFO inside the core. The cache line is only updated when the store buffer drains, which can be many nanoseconds later. Meanwhile, other CPUs cannot see the write yet.
Similarly, when a cache line is invalidated by another CPU's write, the invalidation message is first dropped into the invalidation queuerather than applied immediately. This batching improves throughput but means a CPU might read a stale value from its L1 cache even after the message arrived.
Memory Hierarchy & Latency
Cache misses are the dominant cost in concurrent code. A lock acquisition that transfers a cache line from a remote CPU costs ~100–300 ns — as slow as a DRAM access. Designing concurrent code means designing to minimize cache-line transfers.
| Level | Size | Latency | Scope | Kernel API |
|---|---|---|---|---|
| L1 | 32–64 KB | ~1 ns / 4 cycles | Per-core, private | prefetch(), cache-aligned structs |
| L2 | 256 KB – 1 MB | ~3–5 ns / 12 cy | Per-core, private | __builtin_prefetch() |
| L3 / LLC | 8–64 MB | ~10–30 ns / 40 cy | All cores share | NUMA-aware allocation |
| DRAM | GBs | ~60–100 ns | Global | get_free_pages(), vmalloc() |
| Remote NUMA | GBs | ~150–300 ns | Other NUMA node | alloc_pages_node() |
False Sharing — The Hidden Performance Killer
counter_A (owned by CPU0) and counter_B(owned by CPU1). They never touch each other's counter. Yet a benchmark shows they are 10× slower than expected.counter_A, the MESI protocol marks CPU1's copy of that line Invalid — forcing CPU1 to re-fetch it before it can write counter_B. They ping-pong the line back and forth even though they never share data.| Detection / Prevention | Tool / Technique |
|---|---|
| Detect false sharing | perf c2c report — shows cache-to-cache transfer hotspots |
| Pad struct to cache line | struct { int val; char pad[60]; } — ensures one field per line |
| Kernel annotation | __cacheline_aligned_in_smp — aligns struct start to 64 B |
| Per-CPU variables | DEFINE_PER_CPU — each CPU has own copy, zero coherence traffic |
| DPDK equivalent | RTE_CACHE_LINE_SIZE (64), rte_cache_aligned attribute |
2. §6.2 — Cache Coherence: MESI Protocol
Every cache line in every L1/L2 cache is tagged with one of four MESI states. The protocol guarantees a globally consistent view of memory across all cores — but at the cost of bus traffic on every state transition.
| State | Meaning | In Memory? | Other Copies? |
|---|---|---|---|
| Modified (M) | Dirty — this CPU wrote to it, memory is stale | No (stale) | No |
| Exclusive (E) | Clean — only this CPU has it, matches memory | Yes | No |
| Shared (S) | Clean — multiple CPUs have read-only copies | Yes | Yes |
| Invalid (I) | Not in this cache; must fetch before use | — | Maybe |
Key Transitions to Memorize
| Trigger | Transition | Bus Traffic? |
|---|---|---|
| Local write to Exclusive line | E → M | None (silent) |
| Local write to Shared line | S → M | Send invalidation to all sharers |
| Remote CPU reads Modified line | M → S | Writeback to memory, share |
| Remote CPU writes any cached line | M/E/S → I | Receive invalidation |
| Read miss, no other copy | I → E | Fetch from memory |
| Read miss, other copies exist | I → S | Fetch + notify sharers |
Cache-Line Bouncing: atomic_inc() Under Contention
atomic_inc() on the same global counter — a common pattern in reference counting or statistics collection. Each increment requires exclusive ownership of the cache line. The result is a round trip across the coherence bus for every single increment.atomic_inc() loop on a single shared counter saturates at ~50 million ops/sec across all cores — the same throughput as 1–2 cores. A per-CPU counter scales linearly to 800+ million ops/sec because there is no coherence traffic.MOESI Extension
AMD CPUs add an Owned (O) state — a line that is dirty but shared. The owner responds to snoops directly (CPU-to-CPU transfer) without writing back to DRAM first. This reduces memory bus traffic in read-heavy workloads. Intel uses a similar Forward state in some implementations.
3. §6.3 — Memory Barriers & Ordering
data = 99 then ready = 1. A consumer polls ready and then reads data. On x86 this usually works — but on ARM/POWER the CPU can reorder the two stores, and the consumer can see ready = 1 while data is still 0. Even on x86, the compiler can reorder them.Two Levels of Reordering
| Level | Cause | Fix |
|---|---|---|
| Compiler reordering | Compiler assumes single-thread view; may hoist/sink loads and stores | barrier() macro — empty asm volatile block forces compiler fence |
| CPU reordering | Out-of-order execution, store buffer, invalidation queue | Hardware memory fence instruction (mfence / dmb / sync) |
| Speculative loads | CPU prefetches data before branch resolves | READ_ONCE() prevents compiler from caching in register |
| Torn writes | Compiler splits 64-bit write into two 32-bit stores | WRITE_ONCE() forces single atomic-width store |
Linux Kernel Barrier API
| API | Direction | x86 instruction | ARM instruction | C11 equivalent |
|---|---|---|---|---|
smp_mb() | Full (loads + stores) | mfence | dmb ish | memory_order_seq_cst |
smp_wmb() | Write (stores only) | sfence (or nop on TSO) | dmb ishst | memory_order_release (store) |
smp_rmb() | Read (loads only) | lfence (or nop on TSO) | dmb ishld | memory_order_acquire (load) |
smp_store_release() | Write + release | MOV (TSO) | stlr | atomic_store(release) |
smp_load_acquire() | Read + acquire | MOV (TSO) | ldar | atomic_load(acquire) |
READ_ONCE(x) | Compiler barrier + single load | — | — | volatile load |
WRITE_ONCE(x,v) | Compiler barrier + single store | — | — | volatile store |
x86 TSO (Total Store Order)
x86 guarantees that stores are observed by other CPUs in program order (no store-store reordering visible to observers). It does allow one reordering: a CPU may observe its own earlier stores before other CPUs do. In practice this means smp_wmb() and smp_rmb() compile to nothing on x86 (the compiler barrier is still needed). Only smp_mb() compiles to mfence.
On ARM and POWER the memory model is much weaker — any reordering is possible. Correct code must use explicit barriers on every platform, not just x86.
Pairing Rules — the Golden Rule
smp_wmb() on the writer must pair with a smp_rmb() on the reader. A smp_store_release() must pair with a smp_load_acquire(). Unpaired barriers provide no ordering guarantee — you get the fence overhead with none of the safety.Minimal C Demo — Memory Barrier Simulation
Minimal C Demo — READ_ONCE / WRITE_ONCE
Minimal C Demo — C11 memory_order: release/acquire
4. §6.4 — Atomic Operations
Memory barriers tell the CPU and compiler about ordering. But they do not make an increment atomic — two CPUs can still read the same value, both add 1, and write back the same result (lost update). Atomic operations solve this by combining the read-modify-write into a single indivisible bus transaction. On x86 this uses the LOCK prefix; on ARM it uses ldxr/stxr (load-linked / store-conditional).
atomic_t, atomic64_t, refcount_t
| Operation | API | Notes |
|---|---|---|
| Read | atomic_read(&v) | READ_ONCE wrapper — prevents compiler from caching |
| Write | atomic_set(&v, i) | WRITE_ONCE wrapper |
| Increment | atomic_inc(&v) | Returns void; for RMW with result use atomic_inc_return() |
| Decrement & test | atomic_dec_and_test(&v) | Returns true if result is zero — used in kref release check |
| Add & return | atomic_add_return(i, &v) | Returns new value after add; implies full barrier |
| Compare & exchange | atomic_cmpxchg(&v, old, new) | Returns old value; swap only if *v == old |
| Fetch & add | atomic_fetch_add(i, &v) | Returns value before add (C11: fetch_add) |
| OR / AND / XOR | atomic_or(mask, &v) | Bitwise ops; useful for flag bitmaps |
Core Mechanism: Compare-and-Swap (CAS) Loop
f(old) → new to a shared variable without taking a lock. CAS lets you: read the old value, compute new, then attempt to commit — only succeeding if no other CPU changed the variable in between.- Load current value into
old - Compute
new = f(old) - Execute
cmpxchg(&v, old, new): atomically swap only ifv == old - If
cmpxchgreturnsold→ success; else another CPU raced us — retry from step 1
Minimal C Demo — CAS Lock-Free Counter
kref / refcount_t — Reference Counting
kref wraps a single atomic_t with a lifecycle discipline: every caller that uses an object takes a reference (kref_get), and releases it when done (kref_put). When the count reaches zero, kref_put calls the release function to free the object. refcount_t (since kernel 4.11) adds saturation — an overflow wraps to 1 rather than 0, preventing use-after-free bugs from counter wraparound.
Minimal C Demo — kref / refcount_t Lifecycle
5. §6.5 — Counting at Scale
A single atomic_inc() call on a shared counter causes a cache-line round trip on every call (§6.2). At scale — 16 CPUs each doing millions of increments per second — that single cache line becomes the bottleneck for the entire machine. The kernel solves this with per-CPU counters.
percpu_counter vs local_t vs atomic_t
| Type | Write cost | Read cost | Exact? | Use case |
|---|---|---|---|---|
atomic_t | ~100 ns (bus lock) | ~1 ns | Yes | Reference counts, small shared flags |
percpu_counter | ~1 ns (local inc) | O(NR_CPUS) | Approximate (batch threshold) | FS block/inode usage, network stats |
local_t | ~1 ns (no atomic) | Per-CPU only | Per-CPU exact, global inexact | Perf event local counts |
DEFINE_PER_CPU(long) | ~1 ns | O(NR_CPUS) sum | Exact if summed carefully | Kernel stats, DPDK per-core counters |
percpu_counter_add() accumulates increments locally until the per-CPU batch hits ± percpu_counter_batch (default: 32 on small systems, scales with NR_CPUS). Only then does it propagate to the global sum. This means a reader calling percpu_counter_sum() may see a value off by up to NR_CPUS × batch. For filesystem block usage this is acceptable; for reference counts it is not.Minimal C Demo — Per-CPU Counter
6. §6.6 — Locking Patterns
Choosing the wrong synchronization primitive is one of the most common performance mistakes in kernel and systems code. The decision tree below encodes the key questions — each primitive has a specific niche, and using a heavier one than necessary wastes CPU cycles on every call.
| Primitive | Context | Sleep inside? | Overhead | Kernel example |
|---|---|---|---|---|
spinlock_t | Any (including IRQ) | Never | ~5–20 ns (uncontended) | sk_buff queues, run-queue lock |
mutex | Process context only | Yes (blocking) | ~100 ns + ctx switch | inode i_mutex, mm mmap_lock |
rwlock_t | Any (no sleep) | Never | ~10–30 ns | tasklist_lock, file table |
seqlock | Any, read-heavy numerics | Never | Read: near-zero if no writer | jiffies, gettimeofday |
RCU | Any (read side) | Never (read side) | Read: near-zero; update: ms-range grace period | network routing table, task list |
atomic_t | Any | Never | ~5 ns (bus lock) | file descriptor refcount |
per-CPU | Any (preempt disabled) | Never | ~1 ns (local) | FS stats, net device stats |
Deadlock Prevention: Lock Ordering
L1 and waits for L2. Thread B holds L2 and waits for L1. Neither can proceed — circular wait.L1 and L2 must acquire L1 first. The kernel documents this in comments (e.g. "always take i_lock before page lock"). The tool lockdep tracks acquisition order at runtime and fires a warning on any cycle.Minimal C Demo — Lock Ordering
trylock + Release-on-Fail Pattern
When a code path must acquire two locks but cannot enforce a global order (e.g., the locks belong to different subsystems with incompatible hierarchies), use trylock: attempt to take the second lock; on failure, release the first and restart. Add a short backoff (cpu_relax() or cond_resched()) to avoid livelock.
Priority Inversion & PI Mutexes
A high-priority task blocks on a mutex held by a low-priority task. A medium-priority task preempts the low-priority holder, delaying the high-priority task indefinitely. The kernel's rt_mutex uses priority inheritance: while a task holds an rt_mutex, it temporarily runs at the highest priority of any waiter. This prevents medium priority tasks from causing unbounded delays. POSIX threads support this via PTHREAD_PRIO_INHERIT.
7. §6.7 — Spinlock Internals & Variants
A spinlock is the lowest-level mutual exclusion primitive in the kernel — it busy-waits instead of sleeping, making it safe in interrupt context where sleeping is forbidden. Three generations exist in the kernel, each fixing the previous one's scalability problem.
Generation 1: Test-and-Set — Simple but Unfair & Slow
Every waiting CPU repeatedly executes an atomic write (test-and-set) to the shared lock variable. Even when the lock is held and the attempt will fail, it still writes — generating a MESI invalidation that forces every other waiter to refetch the cache line before its next spin. With N waiters, each spin triggers N−1 invalidation round trips.
Generation 2: Ticket Spinlock — Fair but Still Bounces
A ticket spinlock assigns each caller a number (like a deli counter). The lock holds next (next ticket to issue) and owner (currently serving). Waiters spin on read-only copies of owner — no write per spin. Fair: CPUs are served FIFO. But unlock() increments owner, invalidating all spinners at once.
Generation 3: MCS Lock — Fully Local Spinning
mcs_spinlock node and spins on its own lockedfield. On unlock, the holder writes only the direct successor's node — exactly one cache-line transfer.qspinlock — Linux Default (kernel ≥ 4.2)
qspinlock combines all three ideas into a single 32-bit word. Uncontended acquires cost one CAS. A second waiter sets the pending bit and spins on the owner bit (no node allocation). The third and beyond join an MCS queue. This makes the common case (no contention) extremely cheap while scaling gracefully under high contention.
Spinlock Variants — When to Use Which
| Variant | Also disables | Restores? | Use when |
|---|---|---|---|
spin_lock(l) | Nothing extra | — | You know IRQs and BH are already off |
spin_lock_irqsave(l, flags) | Local IRQs | Yes (flags) | Lock is also used in IRQ handlers on this CPU |
spin_lock_bh(l) | Softirqs / tasklets | Yes | Lock is also used in softirq / BH context |
spin_trylock(l) | Nothing | — | Non-blocking try — return 0 on failure, no deadlock risk |
spin_lock_irqsave(). Without it, the IRQ can fire after the process-context lock and deadlock on the same lock.Minimal C Demo — Test-and-Set Spinlock
Minimal C Demo — MCS Lock Acquire / Release
8. §6.8 — Sleeping Locks: Mutex & Semaphore
When a critical section may hold a lock for milliseconds (e.g., disk I/O, copying from user space), busy-spinning wastes an entire CPU. A mutexsleeps the waiter and wakes it on unlock. The kernel's struct mutex adds adaptive (optimistic) spinning: if the lock owner is currently running on another CPU, a short spin can avoid the costly context switch.
struct mutex Key Fields
| Field | Type | Purpose |
|---|---|---|
owner | atomic_long_t | Task pointer of current owner + flag bits (hand-off, pickup, wait) |
wait_lock | raw_spinlock_t | Protects the wait_list; held briefly inside mutex_lock/unlock |
wait_list | struct list_head | FIFO queue of mutex_waiter structs (sleeping waiters) |
osq | struct optimistic_spin_queue | MCS queue for optimistic spinners (avoids thundering herd on wakeup) |
rw_semaphore — Reader-Writer Sleeping Lock
struct rw_semaphore allows multiple concurrent readers but exclusive writers. Key policy: once a writer is waiting, new readers are blocked (writer-preferring) to prevent writer starvation. percpu_rw_semaphore goes further — readers take a per-CPU lock (fast path), writers do a stop-machine-like barrier (slow path). Used for mm->mmap_lock, task_list_lock, and kernel module loading.
Minimal C Demo — Mutex Adaptive Spinning
9. §6.9 — Sequence Locks
A seqlock gives readers a lock-free fast path at the cost of possible retries. Writers hold an internal spinlock and increment a sequence counter (even → odd → even). Readers snapshot the counter before and after reading — if it changed, a writer was active and they retry. Used pervasively for time-keeping: jiffies, xtime, and the VDSO clock_gettime() fast path.
seqlock vs RCU vs rwlock
| Primitive | Read cost (no writer) | Read cost (writer active) | Protects pointers? | Writer waits for readers? |
|---|---|---|---|---|
seqlock | Near-zero (no lock) | Retry (cheap) | No — never use for pointers | No — readers retry |
rwlock | ~10 ns (bus lock on read) | Waits for writer or readers | Yes | Yes (writer waits for readers) |
RCU | Near-zero (preempt disable) | No wait — reads old version | Yes — designed for this | No — readers never block |
read_seqbegin() and read_seqretry(), a writer can free the object the reader just dereferenced. The reader will detect the retry condition — but only after it has already accessed freed memory. Use RCU when the protected data contains pointers.Minimal C Demo — seqlock Reader & Writer
10. §6.10 — RCU: Read-Copy-Update
RCU is the most important synchronization primitive introduced in the Linux kernel (2.5.43, 2002). The core insight: allow reads without any synchronization at all, and defer the cost of synchronization to the write side. Readers pay only a preemption disable (no cache line transfer, no atomic). Writers publish a new version atomically and wait for a grace period before freeing the old one.
RCU API — at a Glance
| API | Side | What it does |
|---|---|---|
rcu_read_lock() | Reader | Disables preemption (and IRQs on PREEMPT_RT). Opens an RCU read-side critical section. |
rcu_read_unlock() | Reader | Re-enables preemption. The CPU has now passed a quiescent state. |
rcu_dereference(p) | Reader | READ_ONCE + dependency ordering barrier. Safe pointer load from RCU-protected location. |
rcu_assign_pointer(p, v) | Writer | smp_store_release: ensures new object is fully initialized before pointer is visible. |
synchronize_rcu() | Writer | Blocks until all CPUs have passed through at least one quiescent state (grace period). |
call_rcu(head, fn) | Writer | Schedules fn(head) to run after the grace period. Non-blocking — safe in IRQ context. |
kfree_rcu(ptr, field) | Writer | Helper: calls call_rcu then kfree. Common pattern for simple object reclamation. |
Core Mechanism: Publish-Subscribe + Grace Period
- Allocate and initialize the new object
rcu_assign_pointer()— atomically publish; all readers that start after this see the new object- Call
synchronize_rcu()orcall_rcu()— wait for all readers that still hold the old pointer to finish - Free the old object — now safe because no reader can reach it
Grace Period: How Does the Kernel Know Readers Are Done?
A quiescent stateis a point at which a CPU cannot be inside an RCU read-side critical section: a context switch, entry to userspace, or the idle loop. The kernel's RCU subsystem watches for each CPU to pass through at least one quiescent state after the writer called synchronize_rcu(). Once every CPU has done so, the grace period is complete and the old object can be freed.
RCU-Protected Linked Lists
The kernel provides a full set of RCU-safe list operations in include/linux/rculist.h. list_del_rcu() unlinks the node but does not free it — a concurrent reader might still hold a pointer to it. Only after the grace period is it safe to free.
SRCU — Sleepable RCU
Standard RCU forbids sleeping in read-side critical sections. When sleeping is needed (e.g., waiting for user-space to respond), srcu_struct provides a per-domain sleepable variant. Each domain has its own grace-period tracking, so a slow reader only delays reclamation in its own domain. Used in KVM, eBPF, and file system notifications.
Practical Uses in the Linux Kernel
| Subsystem | What is RCU-protected | Write path |
|---|---|---|
| IP routing | fib_table route entries | netlink route add/del with write lock |
| Network namespace | net namespace list traversal | unshare(CLONE_NEWNET) |
| PID lookup | find_task_by_vpid() — task_struct | fork/exit with tasklist_lock write |
| Dentry cache | dcache negative entry lookup | d_add() / d_drop() with d_lock |
| Module loading | module list traversal | module_mutex + RCU publish |
Minimal C Demo — RCU Reader & Writer
Minimal C Demo — RCU-Protected Linked List
11. §6.11 — Lock-Free Data Structures
Lock-free data structures use Compare-and-Swap (CAS) loops instead of locks. If a CAS fails (another CPU changed the value), the thread retries — no thread can be blocked forever by the scheduler. The Linux kernel uses lock-free structures in hot paths: rte_ring (DPDK), RCU-protected lists, and the per-CPU run-queue arrays.
Treiber Stack — Lock-Free Push & Pop
The simplest lock-free structure. A single pointer topis updated atomically. Both push and pop are CAS loops: read the current top, compute the new value, CAS — retry if another CPU changed top first.
The ABA Problem
CAS checks value equality, not identity. If a pointer value is reused (address A is freed then reallocated back to A), a stale CAS will succeed even though the structure changed underneath. The standard fix is to embed a version counter in the same word as the pointer (double-word CAS, or tagged pointers using low bits). The kernel's kfree_rcu sidesteps ABA entirely by deferring frees until all readers finish.
Minimal C Demo — Treiber Stack
Minimal C Demo — ABA Problem
Michael-Scott Queue (MPMC Lock-Free)
A lock-free FIFO queue with a head (consumer side) and tail(producer side), each independently CAS-ed. Enqueue: CAS tail to point to new node, then CAS tail forward. Dequeue: CAS head forward. Used in DPDK's rte_ringand Java's ConcurrentLinkedQueue.
| Property | Treiber Stack | MS Queue | rte_ring (DPDK) |
|---|---|---|---|
| Ordering | LIFO | FIFO | FIFO |
| Producers | MPSC or MPMC | MPMC | SPSC / MPMC modes |
| Consumers | MPMC | MPMC | SPSC / MPMC modes |
| ABA risk | Yes — use tagged ptr | Yes — use tagged ptr | Index-based (no ptr) — immune |
| Memory reclamation | Hazard pointers / RCU | Hazard pointers / RCU | N/A — ring slots pre-allocated |
| Kernel usage | Lock-free lists (rculist) | Rare — RCU preferred | DPDK data plane |
12. §6.12 — Per-CPU Data
Per-CPU variables give each CPU its own private copy of a variable. Accesses require no synchronization at all — no lock, no atomic, no cache-line bounce. The kernel uses them for run-queues, statistics counters, softirq work queues, and slab allocator magazines.
API at a Glance
| Macro / Function | What it does | Preemption safe? |
|---|---|---|
DEFINE_PER_CPU(type, name) | Declare a per-CPU variable in the per-CPU section | — |
DEFINE_PER_CPU_ALIGNED(type, name) | Same, but aligned to cache line — prevents false sharing | — |
get_cpu_var(name) | Disable preemption + return reference to this CPU's copy | Yes — must call put_cpu_var |
put_cpu_var(name) | Re-enable preemption after get_cpu_var | Yes |
this_cpu_read(name) | Read own CPU's copy — preemption must already be disabled | No — caller responsible |
this_cpu_inc(name) | Increment own copy atomically (non-preemptible context) | No |
per_cpu(name, cpu) | Access another CPU's copy — needs lock or irqsave | No — use carefully |
alloc_percpu(type) | Dynamic per-CPU allocation (returns __percpu pointer) | — |
free_percpu(ptr) | Free dynamic per-CPU allocation | — |
Why Per-CPU Beats atomic_t at Scale
Every atomic_inc() generates a bus-locked LOCK XADD on x86. With 64 CPUs all hitting the same counter, the cache line bounces between them — throughput degrades as O(N). With per-CPU: each CPU writes its own cache line, zero coherence traffic — throughput scales as O(1). The read cost is O(NR_CPUS) (summing all slots), but reads are rare for statistics counters.
Minimal C Demo — Per-CPU Counter Simulation
get_cpu_var / put_cpu_var — they disable preemption, preventing the scheduler from migrating you to another CPU between the get and the access. Never use this_cpu_inc in preemptible context — the CPU number can change after the read.13. §6.13 — Synchronization Design Patterns
Choosing the wrong primitive costs performance or correctness. The decision tree below encodes the kernel community's conventional wisdom: start with per-CPU data, escalate to RCU for read-heavy, and only reach for a sleeping lock when the critical section is long or blocking.
Producer-Consumer: Ring Buffer Pattern
The most common real-world pattern in network drivers and audio subsystems.
| Variant | Sync needed | Kernel example |
|---|---|---|
| SPSC (1 producer, 1 consumer) | smp_wmb / smp_rmb only — no lock | pipe, kfifo (SPSC mode) |
| MPSC (N producers, 1 consumer) | Atomic CAS on head index | io_uring submission ring |
| MPMC (N producers, N consumers) | qspinlock or lock-free (MS queue) | rte_ring DPDK |
SPSC Ring — Barrier-Only Synchronization
head; the consumer reads data then advances tail. No lock needed because only one thread touches each index.Plan:
- Producer: write data into
buf[head % N] - Producer:
smp_wmb()— ensure data visible before index - Producer:
WRITE_ONCE(head, head + 1) - Consumer:
h = READ_ONCE(head) - Consumer:
smp_rmb()— ensure index seen before data - Consumer: read
buf[tail % N]
Why barriers: Without
smp_wmb() the CPU or compiler may publish the index before the data — the consumer would see the new index and read stale data.Hazard Pointers — Pointer-Safe Memory Reclamation
An alternative to RCU for lock-free structures. Each thread announces the pointer it is currently accessing in a public hazard pointer slot. The reclamer scans all hazard pointer slots before freeing — if any slot holds the address, it defers the free. Advantages over RCU: no grace period concept, works without preemption semantics. Disadvantages: O(N) scan per free, hazard pointers must be carefully ordered with the access (store-load barrier).
Common Anti-Patterns to Avoid
| Anti-pattern | Problem | Fix |
|---|---|---|
| Lock in IRQ handler for data shared with process context | If process context holds the lock, IRQ handler deadlocks on same CPU | Use spin_lock_irqsave in process context |
| mutex_lock inside spinlock critical section | mutex_lock may sleep → BUG: scheduling while atomic | Use spinlock for both, or restructure to avoid nesting |
| Bare pointer read under RCU without rcu_dereference() | Compiler may split the load; Alpha CPUs may speculate through pointer | Always use rcu_dereference() inside rcu_read_lock() |
| WRITE_ONCE missing on published flag | Compiler may cache flag in register; reader never sees update | Pair WRITE_ONCE (publisher) with READ_ONCE (subscriber) |
| Taking two locks in different orders in different code paths | Classic AB-BA deadlock; lockdep catches it at runtime | Document and enforce global lock ordering |
14. §6.14 — Debugging Concurrency Issues
Concurrency bugs (data races, deadlocks, priority inversion, false sharing) are the hardest class of kernel bugs to reproduce. The kernel ships with four major dynamic analysis tools — enable them during development, not production.
Tool Overview
| Tool | Detects | Config | Overhead |
|---|---|---|---|
| lockdep | Lock order cycles, missing unlock, irq-safe violations | CONFIG_LOCKDEP | ~10% CPU, ~1 MB memory |
| KCSAN | Data races (concurrent conflicting accesses) | CONFIG_KCSAN | ~10× slowdown — dev only |
| KASAN | Use-after-free, out-of-bounds (also useful for UAF in lock-free code) | CONFIG_KASAN | ~2× slowdown |
| perf c2c | False sharing (cache-to-cache transfers on shared lines) | perf c2c record/report | ~5% with sampling |
| ThreadSanitizer (user) | Data races in user-space lock-free code | gcc/clang -fsanitize=thread | ~5–15× slowdown |
lockdep — Lock Dependency Graph
lockdep builds a directed graph of lock acquisitions at runtime. An edge A → B means "lock A was held when lock B was acquired." If a cycle is ever detected (A → B and B → A from different code paths), lockdep fires WARN_ON with a full stack trace before the deadlock actually occurs — it detects the potential for deadlock, not just when it happens.
KCSAN — Kernel Concurrency Sanitizer
KCSAN instruments every load and store at compile time. At runtime it records concurrent accesses to the same address and reports a data race if at least one is a write. It uses a watchpoint-based approach — each CPU can hold a small number of watchpoints at a time, providing statistical coverage without full shadow memory.
data_race(x)— suppress a known benign race (e.g., statistical sample)READ_ONCE(x)/WRITE_ONCE(x)— suppress races that are intentionally race-free at the C levelkcsan_nestable_atomic_begin/end()— mark atomic regions KCSAN should not instrument
perf c2c — False Sharing Detection
perf c2c record -- ./workload captures cache-to-cache transfer events (HITM — Hit In Modified state from another core). The report shows which cache lines are hot and which source lines produced the conflicting accesses. Typical fix: pad the hot variables to separate cache lines with __cacheline_aligned_in_smp or restructure structs so writer and reader fields do not share a line.
Minimal C Demo — lockdep Cycle Detection
Debugging Workflow
CONFIG_LOCKDEP) and run your workload. All deadlock-order violations are caught here without reproducing the race.Step 2 — Enable KCSAN (
CONFIG_KCSAN). Run with multiple CPUs and high load. KCSAN statistically samples — run for minutes, not seconds.Step 3 — perf c2c if throughput regressed. Profile, identify hot shared lines, add padding.
Step 4 — KASAN if UAF suspected in lock-free code (RCU callback frees before last reader, ABA pointer recycle).
15. Kernel Source Pointers
| File | Symbol / Macro | What it does |
|---|---|---|
include/linux/compiler.h | barrier() | Compiler-only fence — stops compiler reordering; no CPU instruction |
include/linux/compiler.h | READ_ONCE(), WRITE_ONCE() | Force single volatile-width load/store; prevent compiler optimizations |
include/asm-generic/barrier.h | smp_mb(), smp_rmb(), smp_wmb() | Full / read / write memory barriers; arch-specific implementations |
arch/x86/include/asm/barrier.h | smp_mb() → mfence | x86 TSO: smp_rmb/wmb are barriers only (no fence instruction) |
arch/arm64/include/asm/barrier.h | smp_mb() → dmb ish | ARM64 full inner-shareable domain barrier |
include/linux/atomic.h | smp_load_acquire(), smp_store_release() | One-directional barriers for publish-subscribe patterns |
tools/memory-model/ | LKMM litmus tests | Formal memory model; run with herd7 to verify ordering assumptions |
include/linux/atomic.h | atomic_t / atomic_inc / atomic_cmpxchg / atomic_dec_and_test | Core atomic API; arch-specific impls in arch/*/include/asm/atomic.h |
include/linux/refcount.h | refcount_t / refcount_inc / refcount_dec_and_test | Saturating refcount; CONFIG_REFCOUNT_FULL adds overflow detection |
include/linux/kref.h | kref / kref_init / kref_get / kref_put | Reference-counted object lifecycle; uses atomic_t internally |
lib/percpu_counter.c | percpu_counter_add / percpu_counter_sum | Per-CPU batched counter; batch size = percpu_counter_batch |
include/linux/spinlock.h | spin_lock / spin_unlock / spin_lock_irqsave | Spinlock API; arch-specific ticket/queued spinlock implementations |
include/linux/mutex.h | mutex_lock / mutex_unlock / mutex_trylock | Sleeping mutex; uses futex under the hood; not usable in IRQ context |
kernel/locking/lockdep.c | lockdep — lock dependency tracker | Detects lock order cycles and missing unlock at runtime (CONFIG_LOCKDEP) |
kernel/locking/rtmutex.c | rt_mutex — priority inheritance mutex | Prevents priority inversion; used by PREEMPT_RT and futex PI |
kernel/locking/qspinlock.c | queued_spin_lock() — fast / medium / slow paths | Fast: single CAS; medium: pending bit; slow: MCS queue. Default spinlock since kernel 4.2 |
kernel/locking/mcs_spinlock.h | mcs_spinlock / mcs_lock / mcs_unlock | Per-CPU queue node; each waiter spins on own locked field — O(1) bus traffic per unlock |
include/linux/spinlock.h | spin_lock_irqsave / spin_lock_bh / spin_trylock | irqsave: disables local IRQs + acquires lock; bh: disables softirqs + lock |
kernel/locking/mutex.c | mutex_lock / mutex_lock_interruptible / mutex_trylock | Adaptive spinning: optimistic spin if owner on CPU; sleep if owner blocked. mutex_osq for spinner queue |
include/linux/mutex.h | struct mutex — owner, wait_lock, wait_list, osq | owner field stores task pointer + flag bits (HANDOFF, PICKUP, WAITERS) |
include/linux/seqlock.h | seqlock_t / read_seqbegin / read_seqretry / write_seqlock | Seq counter: even=stable, odd=write active; seqcount_t is the lockless variant |
include/linux/rcupdate.h | rcu_read_lock / rcu_read_unlock / rcu_dereference / rcu_assign_pointer | Core RCU API; read-side is preemption-disable only — never acquires a lock |
include/linux/rcupdate.h | synchronize_rcu() / call_rcu() / kfree_rcu() | synchronize_rcu: blocking grace period; call_rcu: async callback; kfree_rcu: call_rcu + kfree |
kernel/rcu/tree.c | RCU tree hierarchy — rcu_node / rcu_data | Organizes CPUs into a fanout tree; grace period completes when all leaf nodes report quiescent |
include/linux/rculist.h | list_add_rcu / list_del_rcu / list_for_each_entry_rcu | RCU-safe list ops; del unlinks but does not free — free only after synchronize_rcu / call_rcu |
include/linux/srcu.h | srcu_struct / srcu_read_lock / synchronize_srcu | Sleepable RCU: allows sleeping in read-side critical section; per-domain grace period tracking |
include/linux/percpu.h | DEFINE_PER_CPU / get_cpu_var / put_cpu_var / this_cpu_inc | Per-CPU variable API; get_cpu_var disables preemption; this_cpu_* are preemption-unsafe shortcuts |
include/linux/percpu.h | alloc_percpu() / free_percpu() | Dynamic per-CPU allocation; returns __percpu-annotated pointer; use per_cpu_ptr() to dereference |
lib/percpu-refcount.c | percpu_ref — per-CPU reference count | Scalable refcount: writers use per-CPU slots; percpu_ref_kill() switches to atomic for teardown |
include/linux/kfifo.h | kfifo — kernel lock-free SPSC ring buffer | Single-producer single-consumer ring; uses only memory barriers (no lock) for synchronization |
kernel/locking/lockdep.c | lock_acquire / lock_release / lockdep_init_map | lockdep core: records lock class graph, detects cycles. CONFIG_LOCKDEP enables at boot |
mm/kcsan/kcsan.c | kcsan_setup_watchpoint / kcsan_check_access | KCSAN runtime: instruments each access; CONFIG_KCSAN; data_race() macro suppresses benign races |
tools/perf/builtin-c2c.c | perf c2c — cache-to-cache false sharing profiler | Reports HITM counts per cache line; identifies false sharing between CPUs |
16. Interview Prep
False sharing occurs when two CPUs write to different variables that share the same 64-byte cache line. Every write causes an MESI invalidation, forcing the other CPU to refetch the line. Detection: perf c2c shows cache-to-cache transfers. Fix: pad each hot variable to a full cache line, use __cacheline_aligned_in_smp, or use per-CPU data (DEFINE_PER_CPU).
Four states: Modified (dirty, exclusive), Exclusive (clean, exclusive), Shared (clean, multiple copies), Invalid. A write-back occurs when a Modified line transitions to Shared (another CPU requests a read — the modified CPU writes back and both go to Shared) or to Invalid (another CPU requests a write — the modified CPU writes back then gives up the line).
smp_mb() is a full barrier — no load or store on either side may cross it. smp_wmb() orders only stores. Use smp_wmb() when a producer writes data then a flag, paired with smp_rmb() on the consumer. Use smp_mb() when you need to prevent both load and store reordering — e.g., in lock implementations that must flush the store buffer and drain the invalidation queue.
x86 has TSO (Total Store Order): stores are observed in program order by other CPUs — no store-store reordering is visible. smp_wmb() still contains a compiler barrier (asm volatile), preventing the compiler from reordering. The CPU guarantee handles the hardware side. On ARM/POWER these become real fence instructions (dmb ishst / dmb ishld).
READ_ONCE() prevents the compiler from caching a variable in a register, tearing a read, or speculating it — it forces a single volatile-width load. smp_rmb() is a CPU-level read barrier that prevents the CPU's out-of-order engine from completing loads below the barrier before loads above it. READ_ONCE is about the compiler; smp_rmb is about the hardware. Both are usually needed together in a publish-subscribe pattern.
Use acquire/release (one-directional barriers) for most publish-subscribe synchronization — they are cheaper (no full fence on x86, just ldar/stlr on ARM). Use seq_cst only when you need a total global order across multiple atomic variables — e.g., Dekker-style mutual exclusion where both sides write and both sides read. seq_cst is significantly more expensive on ARM and POWER.
cmpxchg(&v, old, new) returns the value that was in v at the time of the operation. If the returned value equals old, the swap succeeded. If not, another CPU changed v before us — we reload, recompute, and retry. The pattern is: old = READ_ONCE(v); new = f(old); while (cmpxchg(&v, old, new) != old) { old = READ_ONCE(v); new = f(old); }. This is the basis for all lock-free data structures.
refcount_t adds saturation semantics: an increment on a zero-count object fires a warning (kref_get on a dead object is a use-after-free bug). An overflow wraps to 1 rather than 0, preventing an attacker from wrapping a counter to manufacture a fake reference. With CONFIG_REFCOUNT_FULL, every operation is checked. atomic_t has none of these protections.
Use per-CPU counters when the write rate is high and readers can tolerate approximate values (filesystem usage, network statistics). Each CPU increments its local slot — zero cross-CPU coherence traffic — so throughput scales linearly with core count. Trade-offs: reads must sum all per-CPU slots (O(NR_CPUS) cost) and the total may be off by up to NR_CPUS × batch. Not suitable for reference counts or anything requiring exact, immediately consistent values.
Lock ordering: every code path that needs locks A and B must always acquire them in the same global order (e.g., by address, by inode number, or by documented hierarchy). The lockdep subsystem tracks the directed graph of 'lock A was held when lock B was acquired' at runtime and fires a BUG if a cycle is detected. Alternative: trylock with release-on-fail — attempt to take the second lock; on failure release the first and retry with a backoff.
Priority inversion: a high-priority task blocks on a mutex held by a low-priority task; a medium-priority task then preempts the low-priority holder, indirectly blocking the high task for an unbounded time. Solution: priority inheritance (rt_mutex). While a task holds an rt_mutex, its scheduling priority is temporarily boosted to the maximum priority of any waiter. This prevents medium-priority tasks from cutting in. Configured with PTHREAD_PRIO_INHERIT in userspace; built into rt_mutex in the kernel.
Fast path: a single CAS on the 32-bit lock word acquires the lock when uncontended. Medium path: if the lock is held and no one is waiting, the second waiter sets a pending bit and spins reading the owner bit — no node allocation, no queue. Slow path: the third and subsequent waiters join an MCS queue, each spinning on their own node's locked field. MCS is needed because without it, all waiters would spin on the same cache line, generating O(N) invalidations on every unlock — a bus storm that degrades with core count.
If you call spin_lock() in process context and an IRQ fires on the same CPU while you hold the lock, the IRQ handler will try to acquire the same lock — deadlock, because the interrupted code cannot release it while the handler is running. spin_lock_irqsave() disables local IRQs before acquiring, preventing the IRQ from firing until unlock. The irqsave variant saves the previous IRQ flags so nested calls restore the correct state.
A spinlock disables preemption on the current CPU. If the holder calls schedule() (directly or via a blocking call like kmalloc(GFP_KERNEL), copy_from_user(), mutex_lock()), the scheduler will try to context-switch away — but it cannot, because preemption is disabled. On PREEMPT_RT kernels this causes a BUG; on non-RT kernels it causes a lockup. Additionally, sleeping on a spinlock-held CPU blocks any other spinlock waiter on that CPU forever.
It enters the optimistic spinning phase. The waiter spins in a tight loop (cpu_relax()), re-checking the lock owner. If the owner releases the lock during this spin, the waiter acquires without ever going to sleep — saving the ~5–10 µs cost of a context switch. The waiter gives up spinning if: (1) the owner leaves the CPU (blocks or gets preempted), (2) a higher-priority task becomes runnable, or (3) an internal spin limit is hit. At that point it adds itself to the mutex wait_list and calls schedule().
A seqlock uses a sequence counter (even = stable, odd = write in progress). Writers increment the counter at start and end of the update; readers snapshot it before and after — if the snapshots differ or the first was odd, they retry. Use seqlock when: (1) reads are far more frequent than writes (e.g., jiffies, clock_gettime), (2) the protected data is small plain numeric values, and (3) retrying is cheap. Do not use rwlock when readers are read-heavy: rwlock requires a bus-locked read on every acquisition. Seqlock readers pay almost nothing. But seqlock cannot protect pointers — a reader may dereference a freed pointer between seqbegin and seqretry.
On non-PREEMPT_RT kernels, rcu_read_lock() expands to preempt_disable() — it increments the per-CPU preempt_count field. No lock is acquired, no atomic operation is performed, no cache line is transferred. This is why the read side is essentially free. On PREEMPT_RT (where spinlocks are converted to sleeping locks), rcu_read_lock() also disables local IRQs to prevent RT-mutex priority inheritance from running inside an RCU critical section. The preempt_disable prevents context switches, which are the quiescent states RCU detects — ensuring the CPU cannot report a quiescent state while a reader is active.
A grace period is the time after a writer calls synchronize_rcu() until every CPU has passed through at least one quiescent state. A quiescent state is a point at which a CPU is guaranteed to not be inside any RCU read-side critical section. Three examples: (1) a context switch — schedule() runs, so preempt_count must be 0; (2) entry to userspace — the kernel is not running any RCU-protected code; (3) the idle loop — the CPU is doing nothing. Once every CPU has passed a quiescent state, any reader that was running before the writer's rcu_assign_pointer() must have finished, making it safe to free the old object.
synchronize_rcu() blocks the caller until the grace period ends — it may sleep for milliseconds. Use it when you can afford to block (process context, no locks held). call_rcu(head, fn) registers a callback fn to be called after the grace period and returns immediately. Use it when you cannot block: inside an IRQ handler, a softirq, a BH, or while holding a spinlock. call_rcu is also more efficient when freeing many objects: callbacks from multiple writers are batched into a single grace period. kfree_rcu(ptr, field) is a convenience wrapper that calls call_rcu then kfree, avoiding boilerplate callback functions.
No. rcu_dereference() expands to READ_ONCE(p) plus a dependency ordering barrier. READ_ONCE prevents the compiler from re-reading the pointer from memory twice (TOCTOU) or from caching it in a register across the read-side critical section. The dependency barrier ensures that on architectures with weak memory ordering (Alpha, theoretically), the CPU does not prefetch data through the pointer before the pointer load is observed. On x86 (TSO) the hardware part is a no-op, but the compiler barrier is always needed. Without rcu_dereference(), a compiler optimization or a weak-arch CPU can produce use-after-free bugs that are impossible to reproduce on x86.
ABA: a CAS loop reads pointer A, gets preempted. Another thread pops A, frees it, allocates a new object that gets the same address A, then pushes it. When the original thread resumes, CAS(&top, A, next) succeeds — it sees the same address — but 'next' is now a dangling pointer. The kernel avoids ABA in lock-free lists primarily through RCU: kfree_rcu and call_rcu defer the free until all concurrent readers have finished, so the old address cannot be reallocated and reused while any CAS loop is still running. DPDK's rte_ring avoids ABA entirely by using integer indices instead of raw pointers — indices can wrap but the ring size prevents a false match.
CAS fails when another thread changed 'top' between our read and our compare. We retry with the new top value — our thread makes progress because at least one thread (the one that beat our CAS) successfully pushed. This is lock-free: the system as a whole always makes progress, no thread can be blocked indefinitely by the scheduler. It is not wait-free: a single thread can theoretically retry infinitely if it keeps losing races. Wait-free algorithms guarantee each thread completes in a bounded number of steps — they require more complex constructions (fetch-and-increment, combining trees).
Use per-CPU when: write rate is high and the value is per-thread statistics (counters, allocator magazines, run-queue state). Correctness rules: (1) always use get_cpu_var/put_cpu_var or preempt_disable/enable around accesses — if the task is migrated between the read and the write, you corrupt two CPUs' data. (2) Never use this_cpu_inc in preemptible context. (3) Reading another CPU's slot (per_cpu(var, cpu)) requires that CPU to not be updating it simultaneously — use spin_lock or memory barriers. (4) Totals computed by summing all slots may be transiently negative or larger than the true value — never use per-CPU for reference counts.
lockdep tracks 'lock A was held when lock B was acquired' as directed edges in a graph. If it detects a cycle (A→B and B→A from different code paths), it fires WARN_ON before the deadlock occurs. False positive example: two different inode mutexes with the same lockdep class (same static lock_class_key). Code path 1 locks inode-A then inode-B; code path 2 locks inode-B then inode-A. lockdep sees A→B and B→A and reports a cycle — but in practice both A and B are always acquired in address order (lock_two_nondirectories), so no real deadlock. Fix: use lockdep_set_subclass() or nest_lock() to annotate the ordering guarantee.
KCSAN instruments every load/store with __tsan_read/__tsan_write calls. Each CPU registers a watchpoint (address, type) for a short window. If another CPU concurrently accesses the same address and at least one access is a write, KCSAN reports a data race with stack traces from both sides. A benign race is a concurrent read of a value that is not protected by a lock but where the torn or stale read is intentional and harmless — e.g., reading a statistics counter without holding the update lock. Suppress with data_race(x) macro, which tells KCSAN 'this concurrent access is intentional.' For single-load/single-store races that are intentionally unsynchronized, use READ_ONCE/WRITE_ONCE instead — they silence KCSAN and also prevent compiler optimizations.
Run: perf c2c record -- ./workload; perf c2c report. HITM = Hit In Modified — a load from CPU A that hits a cache line in the Modified state on CPU B, forcing a cache-to-cache transfer. This is the signature of false sharing or true sharing under write-write contention. The report shows per-cache-line HITM counts and the source code lines responsible for the reader and writer accesses. Fix: pad the struct so the hot writer and hot reader fields land on separate 64-byte cache lines, using __cacheline_aligned_in_smp or ____cacheline_aligned between fields.