Part VI — Parallel Programming & Synchronization

§6.1 – 6.10 · Hardware Foundations · MESI · Barriers · Atomics · Counting · Locking · Spinlock · Mutex · Seqlock · RCU

Based on Paul McKenney's Is Parallel Programming Hard (PPH) Ch.2–3, 15 and Linux kernel locking internals (LKD Ch.9–10).

Overview

Correct concurrent programming starts at the hardware level. A CPU does not execute instructions in the order you wrote them — it reorders stores and loads for performance. Multiple CPUs sharing memory introduce cache coherence traffic that can make a naive counter slower than a single-threaded one. Every kernel synchronization primitive — from a spinlock to RCU — exists to manage these hardware-level effects. This part builds the foundation: hardware behavior (§6.1), the MESI protocol that keeps caches consistent (§6.2), and the memory barriers that enforce ordering across CPUs (§6.3).

1. §6.1 — Hardware Foundations for Parallelism

CPU Pipeline: Store Buffer & Invalidation Queue

A modern CPU core does not write to the cache immediately. When a store instruction executes, the value goes into the store buffer — a small, private FIFO inside the core. The cache line is only updated when the store buffer drains, which can be many nanoseconds later. Meanwhile, other CPUs cannot see the write yet.

Similarly, when a cache line is invalidated by another CPU's write, the invalidation message is first dropped into the invalidation queuerather than applied immediately. This batching improves throughput but means a CPU might read a stale value from its L1 cache even after the message arrived.

Consequence for software: Even on a sequentially consistent machine like x86, the store buffer creates a window where a value written by CPU0 is invisible to CPU1. Memory barriers flush the store buffer, closing this window.

Memory Hierarchy & Latency

Cache misses are the dominant cost in concurrent code. A lock acquisition that transfers a cache line from a remote CPU costs ~100–300 ns — as slow as a DRAM access. Designing concurrent code means designing to minimize cache-line transfers.

LevelSizeLatencyScopeKernel API
L132–64 KB~1 ns / 4 cyclesPer-core, privateprefetch(), cache-aligned structs
L2256 KB – 1 MB~3–5 ns / 12 cyPer-core, private__builtin_prefetch()
L3 / LLC8–64 MB~10–30 ns / 40 cyAll cores shareNUMA-aware allocation
DRAMGBs~60–100 nsGlobalget_free_pages(), vmalloc()
Remote NUMAGBs~150–300 nsOther NUMA nodealloc_pages_node()

False Sharing — The Hidden Performance Killer

Background: Two threads each have their own counter — counter_A (owned by CPU0) and counter_B(owned by CPU1). They never touch each other's counter. Yet a benchmark shows they are 10× slower than expected.
Root cause: Both counters happen to sit within the same 64-byte cache line. Every time CPU0 writes counter_A, the MESI protocol marks CPU1's copy of that line Invalid — forcing CPU1 to re-fetch it before it can write counter_B. They ping-pong the line back and forth even though they never share data.
Detection / PreventionTool / Technique
Detect false sharingperf c2c report — shows cache-to-cache transfer hotspots
Pad struct to cache linestruct { int val; char pad[60]; } — ensures one field per line
Kernel annotation__cacheline_aligned_in_smp — aligns struct start to 64 B
Per-CPU variablesDEFINE_PER_CPU — each CPU has own copy, zero coherence traffic
DPDK equivalentRTE_CACHE_LINE_SIZE (64), rte_cache_aligned attribute

2. §6.2 — Cache Coherence: MESI Protocol

Every cache line in every L1/L2 cache is tagged with one of four MESI states. The protocol guarantees a globally consistent view of memory across all cores — but at the cost of bus traffic on every state transition.

StateMeaningIn Memory?Other Copies?
Modified (M)Dirty — this CPU wrote to it, memory is staleNo (stale)No
Exclusive (E)Clean — only this CPU has it, matches memoryYesNo
Shared (S)Clean — multiple CPUs have read-only copiesYesYes
Invalid (I)Not in this cache; must fetch before useMaybe

Key Transitions to Memorize

TriggerTransitionBus Traffic?
Local write to Exclusive lineE → MNone (silent)
Local write to Shared lineS → MSend invalidation to all sharers
Remote CPU reads Modified lineM → SWriteback to memory, share
Remote CPU writes any cached lineM/E/S → IReceive invalidation
Read miss, no other copyI → EFetch from memory
Read miss, other copies existI → SFetch + notify sharers

Cache-Line Bouncing: atomic_inc() Under Contention

Background: Two CPUs repeatedly call atomic_inc() on the same global counter — a common pattern in reference counting or statistics collection. Each increment requires exclusive ownership of the cache line. The result is a round trip across the coherence bus for every single increment.
Real measurement: On a 16-core machine, a tight atomic_inc() loop on a single shared counter saturates at ~50 million ops/sec across all cores — the same throughput as 1–2 cores. A per-CPU counter scales linearly to 800+ million ops/sec because there is no coherence traffic.

MOESI Extension

AMD CPUs add an Owned (O) state — a line that is dirty but shared. The owner responds to snoops directly (CPU-to-CPU transfer) without writing back to DRAM first. This reduces memory bus traffic in read-heavy workloads. Intel uses a similar Forward state in some implementations.

3. §6.3 — Memory Barriers & Ordering

Background: A producer writes data = 99 then ready = 1. A consumer polls ready and then reads data. On x86 this usually works — but on ARM/POWER the CPU can reorder the two stores, and the consumer can see ready = 1 while data is still 0. Even on x86, the compiler can reorder them.

Two Levels of Reordering

LevelCauseFix
Compiler reorderingCompiler assumes single-thread view; may hoist/sink loads and storesbarrier() macro — empty asm volatile block forces compiler fence
CPU reorderingOut-of-order execution, store buffer, invalidation queueHardware memory fence instruction (mfence / dmb / sync)
Speculative loadsCPU prefetches data before branch resolvesREAD_ONCE() prevents compiler from caching in register
Torn writesCompiler splits 64-bit write into two 32-bit storesWRITE_ONCE() forces single atomic-width store

Linux Kernel Barrier API

APIDirectionx86 instructionARM instructionC11 equivalent
smp_mb()Full (loads + stores)mfencedmb ishmemory_order_seq_cst
smp_wmb()Write (stores only)sfence (or nop on TSO)dmb ishstmemory_order_release (store)
smp_rmb()Read (loads only)lfence (or nop on TSO)dmb ishldmemory_order_acquire (load)
smp_store_release()Write + releaseMOV (TSO)stlratomic_store(release)
smp_load_acquire()Read + acquireMOV (TSO)ldaratomic_load(acquire)
READ_ONCE(x)Compiler barrier + single loadvolatile load
WRITE_ONCE(x,v)Compiler barrier + single storevolatile store

x86 TSO (Total Store Order)

x86 guarantees that stores are observed by other CPUs in program order (no store-store reordering visible to observers). It does allow one reordering: a CPU may observe its own earlier stores before other CPUs do. In practice this means smp_wmb() and smp_rmb() compile to nothing on x86 (the compiler barrier is still needed). Only smp_mb() compiles to mfence.

On ARM and POWER the memory model is much weaker — any reordering is possible. Correct code must use explicit barriers on every platform, not just x86.

Pairing Rules — the Golden Rule

A smp_wmb() on the writer must pair with a smp_rmb() on the reader. A smp_store_release() must pair with a smp_load_acquire(). Unpaired barriers provide no ordering guarantee — you get the fence overhead with none of the safety.

Minimal C Demo — Memory Barrier Simulation

Memory Barriers — producer/consumer with smp_wmb/smp_rmb — C Demo
stdin (optional)

Minimal C Demo — READ_ONCE / WRITE_ONCE

READ_ONCE / WRITE_ONCE — prevent compiler reordering — C Demo
stdin (optional)

Minimal C Demo — C11 memory_order: release/acquire

C11 atomics — memory_order_release / memory_order_acquire pair — C Demo
stdin (optional)

4. §6.4 — Atomic Operations

Memory barriers tell the CPU and compiler about ordering. But they do not make an increment atomic — two CPUs can still read the same value, both add 1, and write back the same result (lost update). Atomic operations solve this by combining the read-modify-write into a single indivisible bus transaction. On x86 this uses the LOCK prefix; on ARM it uses ldxr/stxr (load-linked / store-conditional).

atomic_t, atomic64_t, refcount_t

OperationAPINotes
Readatomic_read(&v)READ_ONCE wrapper — prevents compiler from caching
Writeatomic_set(&v, i)WRITE_ONCE wrapper
Incrementatomic_inc(&v)Returns void; for RMW with result use atomic_inc_return()
Decrement & testatomic_dec_and_test(&v)Returns true if result is zero — used in kref release check
Add & returnatomic_add_return(i, &v)Returns new value after add; implies full barrier
Compare & exchangeatomic_cmpxchg(&v, old, new)Returns old value; swap only if *v == old
Fetch & addatomic_fetch_add(i, &v)Returns value before add (C11: fetch_add)
OR / AND / XORatomic_or(mask, &v)Bitwise ops; useful for flag bitmaps

Core Mechanism: Compare-and-Swap (CAS) Loop

Background: You need to atomically apply a function f(old) → new to a shared variable without taking a lock. CAS lets you: read the old value, compute new, then attempt to commit — only succeeding if no other CPU changed the variable in between.
Plan:
  1. Load current value into old
  2. Compute new = f(old)
  3. Execute cmpxchg(&v, old, new): atomically swap only if v == old
  4. If cmpxchg returns old → success; else another CPU raced us — retry from step 1

Minimal C Demo — CAS Lock-Free Counter

CAS loop — lock-free counter with cmpxchg — C Demo
stdin (optional)

kref / refcount_t — Reference Counting

kref wraps a single atomic_t with a lifecycle discipline: every caller that uses an object takes a reference (kref_get), and releases it when done (kref_put). When the count reaches zero, kref_put calls the release function to free the object. refcount_t (since kernel 4.11) adds saturation — an overflow wraps to 1 rather than 0, preventing use-after-free bugs from counter wraparound.

Minimal C Demo — kref / refcount_t Lifecycle

kref / refcount_t — reference-counted object — C Demo
stdin (optional)

5. §6.5 — Counting at Scale

A single atomic_inc() call on a shared counter causes a cache-line round trip on every call (§6.2). At scale — 16 CPUs each doing millions of increments per second — that single cache line becomes the bottleneck for the entire machine. The kernel solves this with per-CPU counters.

percpu_counter vs local_t vs atomic_t

TypeWrite costRead costExact?Use case
atomic_t~100 ns (bus lock)~1 nsYesReference counts, small shared flags
percpu_counter~1 ns (local inc)O(NR_CPUS)Approximate (batch threshold)FS block/inode usage, network stats
local_t~1 ns (no atomic)Per-CPU onlyPer-CPU exact, global inexactPerf event local counts
DEFINE_PER_CPU(long)~1 nsO(NR_CPUS) sumExact if summed carefullyKernel stats, DPDK per-core counters
percpu_counter batch trick: percpu_counter_add() accumulates increments locally until the per-CPU batch hits ± percpu_counter_batch (default: 32 on small systems, scales with NR_CPUS). Only then does it propagate to the global sum. This means a reader calling percpu_counter_sum() may see a value off by up to NR_CPUS × batch. For filesystem block usage this is acceptable; for reference counts it is not.

Minimal C Demo — Per-CPU Counter

Per-CPU counters — zero-contention counting — C Demo
stdin (optional)

6. §6.6 — Locking Patterns

Choosing the wrong synchronization primitive is one of the most common performance mistakes in kernel and systems code. The decision tree below encodes the key questions — each primitive has a specific niche, and using a heavier one than necessary wastes CPU cycles on every call.

PrimitiveContextSleep inside?OverheadKernel example
spinlock_tAny (including IRQ)Never~5–20 ns (uncontended)sk_buff queues, run-queue lock
mutexProcess context onlyYes (blocking)~100 ns + ctx switchinode i_mutex, mm mmap_lock
rwlock_tAny (no sleep)Never~10–30 nstasklist_lock, file table
seqlockAny, read-heavy numericsNeverRead: near-zero if no writerjiffies, gettimeofday
RCUAny (read side)Never (read side)Read: near-zero; update: ms-range grace periodnetwork routing table, task list
atomic_tAnyNever~5 ns (bus lock)file descriptor refcount
per-CPUAny (preempt disabled)Never~1 ns (local)FS stats, net device stats

Deadlock Prevention: Lock Ordering

Background: Thread A holds L1 and waits for L2. Thread B holds L2 and waits for L1. Neither can proceed — circular wait.
Fix: Enforce a global lock order. Every code path that needs both L1 and L2 must acquire L1 first. The kernel documents this in comments (e.g. "always take i_lock before page lock"). The tool lockdep tracks acquisition order at runtime and fires a warning on any cycle.

Minimal C Demo — Lock Ordering

Lock ordering — prevent deadlock with consistent acquire order — C Demo
stdin (optional)

trylock + Release-on-Fail Pattern

When a code path must acquire two locks but cannot enforce a global order (e.g., the locks belong to different subsystems with incompatible hierarchies), use trylock: attempt to take the second lock; on failure, release the first and restart. Add a short backoff (cpu_relax() or cond_resched()) to avoid livelock.

trylock + backoff — avoid deadlock without lock ordering — C Demo
stdin (optional)

Priority Inversion & PI Mutexes

A high-priority task blocks on a mutex held by a low-priority task. A medium-priority task preempts the low-priority holder, delaying the high-priority task indefinitely. The kernel's rt_mutex uses priority inheritance: while a task holds an rt_mutex, it temporarily runs at the highest priority of any waiter. This prevents medium priority tasks from causing unbounded delays. POSIX threads support this via PTHREAD_PRIO_INHERIT.

7. §6.7 — Spinlock Internals & Variants

A spinlock is the lowest-level mutual exclusion primitive in the kernel — it busy-waits instead of sleeping, making it safe in interrupt context where sleeping is forbidden. Three generations exist in the kernel, each fixing the previous one's scalability problem.

Generation 1: Test-and-Set — Simple but Unfair & Slow

Every waiting CPU repeatedly executes an atomic write (test-and-set) to the shared lock variable. Even when the lock is held and the attempt will fail, it still writes — generating a MESI invalidation that forces every other waiter to refetch the cache line before its next spin. With N waiters, each spin triggers N−1 invalidation round trips.

Generation 2: Ticket Spinlock — Fair but Still Bounces

A ticket spinlock assigns each caller a number (like a deli counter). The lock holds next (next ticket to issue) and owner (currently serving). Waiters spin on read-only copies of owner — no write per spin. Fair: CPUs are served FIFO. But unlock() increments owner, invalidating all spinners at once.

Generation 3: MCS Lock — Fully Local Spinning

Background: Even with ticket locks, unlock causes N cache-line invalidations (one per waiter). MCS eliminates this: each waiter allocates a private mcs_spinlock node and spins on its own lockedfield. On unlock, the holder writes only the direct successor's node — exactly one cache-line transfer.

qspinlock — Linux Default (kernel ≥ 4.2)

qspinlock combines all three ideas into a single 32-bit word. Uncontended acquires cost one CAS. A second waiter sets the pending bit and spins on the owner bit (no node allocation). The third and beyond join an MCS queue. This makes the common case (no contention) extremely cheap while scaling gracefully under high contention.

Spinlock Variants — When to Use Which

VariantAlso disablesRestores?Use when
spin_lock(l)Nothing extraYou know IRQs and BH are already off
spin_lock_irqsave(l, flags)Local IRQsYes (flags)Lock is also used in IRQ handlers on this CPU
spin_lock_bh(l)Softirqs / taskletsYesLock is also used in softirq / BH context
spin_trylock(l)NothingNon-blocking try — return 0 on failure, no deadlock risk
Golden rule: If the same lock is acquired in an interrupt handler, the process-context acquisition must use spin_lock_irqsave(). Without it, the IRQ can fire after the process-context lock and deadlock on the same lock.

Minimal C Demo — Test-and-Set Spinlock

Test-and-Set spinlock — spinning with writes (cache bouncing) — C Demo
stdin (optional)

Minimal C Demo — MCS Lock Acquire / Release

MCS lock — each waiter spins on own node — C Demo
stdin (optional)

8. §6.8 — Sleeping Locks: Mutex & Semaphore

When a critical section may hold a lock for milliseconds (e.g., disk I/O, copying from user space), busy-spinning wastes an entire CPU. A mutexsleeps the waiter and wakes it on unlock. The kernel's struct mutex adds adaptive (optimistic) spinning: if the lock owner is currently running on another CPU, a short spin can avoid the costly context switch.

struct mutex Key Fields

FieldTypePurpose
owneratomic_long_tTask pointer of current owner + flag bits (hand-off, pickup, wait)
wait_lockraw_spinlock_tProtects the wait_list; held briefly inside mutex_lock/unlock
wait_liststruct list_headFIFO queue of mutex_waiter structs (sleeping waiters)
osqstruct optimistic_spin_queueMCS queue for optimistic spinners (avoids thundering herd on wakeup)

rw_semaphore — Reader-Writer Sleeping Lock

struct rw_semaphore allows multiple concurrent readers but exclusive writers. Key policy: once a writer is waiting, new readers are blocked (writer-preferring) to prevent writer starvation. percpu_rw_semaphore goes further — readers take a per-CPU lock (fast path), writers do a stop-machine-like barrier (slow path). Used for mm->mmap_lock, task_list_lock, and kernel module loading.

Minimal C Demo — Mutex Adaptive Spinning

mutex_lock() — fast / optimistic-spin / sleep paths — C Demo
stdin (optional)

9. §6.9 — Sequence Locks

A seqlock gives readers a lock-free fast path at the cost of possible retries. Writers hold an internal spinlock and increment a sequence counter (even → odd → even). Readers snapshot the counter before and after reading — if it changed, a writer was active and they retry. Used pervasively for time-keeping: jiffies, xtime, and the VDSO clock_gettime() fast path.

seqlock vs RCU vs rwlock

PrimitiveRead cost (no writer)Read cost (writer active)Protects pointers?Writer waits for readers?
seqlockNear-zero (no lock)Retry (cheap)No — never use for pointersNo — readers retry
rwlock~10 ns (bus lock on read)Waits for writer or readersYesYes (writer waits for readers)
RCUNear-zero (preempt disable)No wait — reads old versionYes — designed for thisNo — readers never block
Why no pointers? Between read_seqbegin() and read_seqretry(), a writer can free the object the reader just dereferenced. The reader will detect the retry condition — but only after it has already accessed freed memory. Use RCU when the protected data contains pointers.

Minimal C Demo — seqlock Reader & Writer

seqlock — optimistic read with sequence counter retry — C Demo
stdin (optional)

10. §6.10 — RCU: Read-Copy-Update

RCU is the most important synchronization primitive introduced in the Linux kernel (2.5.43, 2002). The core insight: allow reads without any synchronization at all, and defer the cost of synchronization to the write side. Readers pay only a preemption disable (no cache line transfer, no atomic). Writers publish a new version atomically and wait for a grace period before freeing the old one.

RCU API — at a Glance

APISideWhat it does
rcu_read_lock()ReaderDisables preemption (and IRQs on PREEMPT_RT). Opens an RCU read-side critical section.
rcu_read_unlock()ReaderRe-enables preemption. The CPU has now passed a quiescent state.
rcu_dereference(p)ReaderREAD_ONCE + dependency ordering barrier. Safe pointer load from RCU-protected location.
rcu_assign_pointer(p, v)Writersmp_store_release: ensures new object is fully initialized before pointer is visible.
synchronize_rcu()WriterBlocks until all CPUs have passed through at least one quiescent state (grace period).
call_rcu(head, fn)WriterSchedules fn(head) to run after the grace period. Non-blocking — safe in IRQ context.
kfree_rcu(ptr, field)WriterHelper: calls call_rcu then kfree. Common pattern for simple object reclamation.

Core Mechanism: Publish-Subscribe + Grace Period

Background: A routing table entry must be updated while thousands of packets per second are looking it up. Taking a write lock would block all readers during the update — unacceptable latency. RCU lets all existing lookups complete against the old entry while new lookups use the new entry.
Plan:
  1. Allocate and initialize the new object
  2. rcu_assign_pointer() — atomically publish; all readers that start after this see the new object
  3. Call synchronize_rcu() or call_rcu() — wait for all readers that still hold the old pointer to finish
  4. Free the old object — now safe because no reader can reach it

Grace Period: How Does the Kernel Know Readers Are Done?

A quiescent stateis a point at which a CPU cannot be inside an RCU read-side critical section: a context switch, entry to userspace, or the idle loop. The kernel's RCU subsystem watches for each CPU to pass through at least one quiescent state after the writer called synchronize_rcu(). Once every CPU has done so, the grace period is complete and the old object can be freed.

RCU-Protected Linked Lists

The kernel provides a full set of RCU-safe list operations in include/linux/rculist.h. list_del_rcu() unlinks the node but does not free it — a concurrent reader might still hold a pointer to it. Only after the grace period is it safe to free.

SRCU — Sleepable RCU

Standard RCU forbids sleeping in read-side critical sections. When sleeping is needed (e.g., waiting for user-space to respond), srcu_struct provides a per-domain sleepable variant. Each domain has its own grace-period tracking, so a slow reader only delays reclamation in its own domain. Used in KVM, eBPF, and file system notifications.

Practical Uses in the Linux Kernel

SubsystemWhat is RCU-protectedWrite path
IP routingfib_table route entriesnetlink route add/del with write lock
Network namespacenet namespace list traversalunshare(CLONE_NEWNET)
PID lookupfind_task_by_vpid() — task_structfork/exit with tasklist_lock write
Dentry cachedcache negative entry lookupd_add() / d_drop() with d_lock
Module loadingmodule list traversalmodule_mutex + RCU publish

Minimal C Demo — RCU Reader & Writer

RCU reader/writer — publish-subscribe with grace period — C Demo
stdin (optional)

Minimal C Demo — RCU-Protected Linked List

RCU list — del without free, traverse without lock — C Demo
stdin (optional)

11. §6.11 — Lock-Free Data Structures

Lock-free data structures use Compare-and-Swap (CAS) loops instead of locks. If a CAS fails (another CPU changed the value), the thread retries — no thread can be blocked forever by the scheduler. The Linux kernel uses lock-free structures in hot paths: rte_ring (DPDK), RCU-protected lists, and the per-CPU run-queue arrays.

Treiber Stack — Lock-Free Push & Pop

The simplest lock-free structure. A single pointer topis updated atomically. Both push and pop are CAS loops: read the current top, compute the new value, CAS — retry if another CPU changed top first.

The ABA Problem

CAS checks value equality, not identity. If a pointer value is reused (address A is freed then reallocated back to A), a stale CAS will succeed even though the structure changed underneath. The standard fix is to embed a version counter in the same word as the pointer (double-word CAS, or tagged pointers using low bits). The kernel's kfree_rcu sidesteps ABA entirely by deferring frees until all readers finish.

ABA Rule: Never free a node that may still be referenced by a concurrent CAS loop. Use hazard pointers, epoch-based reclamation, or RCU to delay reclamation until all readers have observed the removal.

Minimal C Demo — Treiber Stack

Treiber Stack — Lock-Free Push / Pop — C Demo
stdin (optional)

Minimal C Demo — ABA Problem

ABA Problem — Illustrated — C Demo
stdin (optional)

Michael-Scott Queue (MPMC Lock-Free)

A lock-free FIFO queue with a head (consumer side) and tail(producer side), each independently CAS-ed. Enqueue: CAS tail to point to new node, then CAS tail forward. Dequeue: CAS head forward. Used in DPDK's rte_ringand Java's ConcurrentLinkedQueue.

PropertyTreiber StackMS Queuerte_ring (DPDK)
OrderingLIFOFIFOFIFO
ProducersMPSC or MPMCMPMCSPSC / MPMC modes
ConsumersMPMCMPMCSPSC / MPMC modes
ABA riskYes — use tagged ptrYes — use tagged ptrIndex-based (no ptr) — immune
Memory reclamationHazard pointers / RCUHazard pointers / RCUN/A — ring slots pre-allocated
Kernel usageLock-free lists (rculist)Rare — RCU preferredDPDK data plane

12. §6.12 — Per-CPU Data

Per-CPU variables give each CPU its own private copy of a variable. Accesses require no synchronization at all — no lock, no atomic, no cache-line bounce. The kernel uses them for run-queues, statistics counters, softirq work queues, and slab allocator magazines.

API at a Glance

Macro / FunctionWhat it doesPreemption safe?
DEFINE_PER_CPU(type, name)Declare a per-CPU variable in the per-CPU section
DEFINE_PER_CPU_ALIGNED(type, name)Same, but aligned to cache line — prevents false sharing
get_cpu_var(name)Disable preemption + return reference to this CPU's copyYes — must call put_cpu_var
put_cpu_var(name)Re-enable preemption after get_cpu_varYes
this_cpu_read(name)Read own CPU's copy — preemption must already be disabledNo — caller responsible
this_cpu_inc(name)Increment own copy atomically (non-preemptible context)No
per_cpu(name, cpu)Access another CPU's copy — needs lock or irqsaveNo — use carefully
alloc_percpu(type)Dynamic per-CPU allocation (returns __percpu pointer)
free_percpu(ptr)Free dynamic per-CPU allocation

Why Per-CPU Beats atomic_t at Scale

Every atomic_inc() generates a bus-locked LOCK XADD on x86. With 64 CPUs all hitting the same counter, the cache line bounces between them — throughput degrades as O(N). With per-CPU: each CPU writes its own cache line, zero coherence traffic — throughput scales as O(1). The read cost is O(NR_CPUS) (summing all slots), but reads are rare for statistics counters.

Minimal C Demo — Per-CPU Counter Simulation

Per-CPU Counter vs atomic_t — C Demo
stdin (optional)
Rule: Use get_cpu_var / put_cpu_var — they disable preemption, preventing the scheduler from migrating you to another CPU between the get and the access. Never use this_cpu_inc in preemptible context — the CPU number can change after the read.

13. §6.13 — Synchronization Design Patterns

Choosing the wrong primitive costs performance or correctness. The decision tree below encodes the kernel community's conventional wisdom: start with per-CPU data, escalate to RCU for read-heavy, and only reach for a sleeping lock when the critical section is long or blocking.

Producer-Consumer: Ring Buffer Pattern

The most common real-world pattern in network drivers and audio subsystems.

VariantSync neededKernel example
SPSC (1 producer, 1 consumer)smp_wmb / smp_rmb only — no lockpipe, kfifo (SPSC mode)
MPSC (N producers, 1 consumer)Atomic CAS on head indexio_uring submission ring
MPMC (N producers, N consumers)qspinlock or lock-free (MS queue)rte_ring DPDK

SPSC Ring — Barrier-Only Synchronization

Background: Two threads share a ring buffer. The producer writes data then advances head; the consumer reads data then advances tail. No lock needed because only one thread touches each index.

Plan:
  1. Producer: write data into buf[head % N]
  2. Producer: smp_wmb() — ensure data visible before index
  3. Producer: WRITE_ONCE(head, head + 1)
  4. Consumer: h = READ_ONCE(head)
  5. Consumer: smp_rmb() — ensure index seen before data
  6. Consumer: read buf[tail % N]

Why barriers: Without smp_wmb() the CPU or compiler may publish the index before the data — the consumer would see the new index and read stale data.

Hazard Pointers — Pointer-Safe Memory Reclamation

An alternative to RCU for lock-free structures. Each thread announces the pointer it is currently accessing in a public hazard pointer slot. The reclamer scans all hazard pointer slots before freeing — if any slot holds the address, it defers the free. Advantages over RCU: no grace period concept, works without preemption semantics. Disadvantages: O(N) scan per free, hazard pointers must be carefully ordered with the access (store-load barrier).

Common Anti-Patterns to Avoid

Anti-patternProblemFix
Lock in IRQ handler for data shared with process contextIf process context holds the lock, IRQ handler deadlocks on same CPUUse spin_lock_irqsave in process context
mutex_lock inside spinlock critical sectionmutex_lock may sleep → BUG: scheduling while atomicUse spinlock for both, or restructure to avoid nesting
Bare pointer read under RCU without rcu_dereference()Compiler may split the load; Alpha CPUs may speculate through pointerAlways use rcu_dereference() inside rcu_read_lock()
WRITE_ONCE missing on published flagCompiler may cache flag in register; reader never sees updatePair WRITE_ONCE (publisher) with READ_ONCE (subscriber)
Taking two locks in different orders in different code pathsClassic AB-BA deadlock; lockdep catches it at runtimeDocument and enforce global lock ordering

14. §6.14 — Debugging Concurrency Issues

Concurrency bugs (data races, deadlocks, priority inversion, false sharing) are the hardest class of kernel bugs to reproduce. The kernel ships with four major dynamic analysis tools — enable them during development, not production.

Tool Overview

ToolDetectsConfigOverhead
lockdepLock order cycles, missing unlock, irq-safe violationsCONFIG_LOCKDEP~10% CPU, ~1 MB memory
KCSANData races (concurrent conflicting accesses)CONFIG_KCSAN~10× slowdown — dev only
KASANUse-after-free, out-of-bounds (also useful for UAF in lock-free code)CONFIG_KASAN~2× slowdown
perf c2cFalse sharing (cache-to-cache transfers on shared lines)perf c2c record/report~5% with sampling
ThreadSanitizer (user)Data races in user-space lock-free codegcc/clang -fsanitize=thread~5–15× slowdown

lockdep — Lock Dependency Graph

lockdep builds a directed graph of lock acquisitions at runtime. An edge A → B means "lock A was held when lock B was acquired." If a cycle is ever detected (A → B and B → A from different code paths), lockdep fires WARN_ON with a full stack trace before the deadlock actually occurs — it detects the potential for deadlock, not just when it happens.

KCSAN — Kernel Concurrency Sanitizer

KCSAN instruments every load and store at compile time. At runtime it records concurrent accesses to the same address and reports a data race if at least one is a write. It uses a watchpoint-based approach — each CPU can hold a small number of watchpoints at a time, providing statistical coverage without full shadow memory.

KCSAN annotations:
  • data_race(x) — suppress a known benign race (e.g., statistical sample)
  • READ_ONCE(x) / WRITE_ONCE(x) — suppress races that are intentionally race-free at the C level
  • kcsan_nestable_atomic_begin/end() — mark atomic regions KCSAN should not instrument

perf c2c — False Sharing Detection

perf c2c record -- ./workload captures cache-to-cache transfer events (HITM — Hit In Modified state from another core). The report shows which cache lines are hot and which source lines produced the conflicting accesses. Typical fix: pad the hot variables to separate cache lines with __cacheline_aligned_in_smp or restructure structs so writer and reader fields do not share a line.

Minimal C Demo — lockdep Cycle Detection

lockdep — Cycle Detection (Simulated) — C Demo
stdin (optional)

Debugging Workflow

Step 1 — Enable lockdep (CONFIG_LOCKDEP) and run your workload. All deadlock-order violations are caught here without reproducing the race.

Step 2 — Enable KCSAN (CONFIG_KCSAN). Run with multiple CPUs and high load. KCSAN statistically samples — run for minutes, not seconds.

Step 3 — perf c2c if throughput regressed. Profile, identify hot shared lines, add padding.

Step 4 — KASAN if UAF suspected in lock-free code (RCU callback frees before last reader, ABA pointer recycle).

15. Kernel Source Pointers

FileSymbol / MacroWhat it does
include/linux/compiler.hbarrier()Compiler-only fence — stops compiler reordering; no CPU instruction
include/linux/compiler.hREAD_ONCE(), WRITE_ONCE()Force single volatile-width load/store; prevent compiler optimizations
include/asm-generic/barrier.hsmp_mb(), smp_rmb(), smp_wmb()Full / read / write memory barriers; arch-specific implementations
arch/x86/include/asm/barrier.hsmp_mb() → mfencex86 TSO: smp_rmb/wmb are barriers only (no fence instruction)
arch/arm64/include/asm/barrier.hsmp_mb() → dmb ishARM64 full inner-shareable domain barrier
include/linux/atomic.hsmp_load_acquire(), smp_store_release()One-directional barriers for publish-subscribe patterns
tools/memory-model/LKMM litmus testsFormal memory model; run with herd7 to verify ordering assumptions
include/linux/atomic.hatomic_t / atomic_inc / atomic_cmpxchg / atomic_dec_and_testCore atomic API; arch-specific impls in arch/*/include/asm/atomic.h
include/linux/refcount.hrefcount_t / refcount_inc / refcount_dec_and_testSaturating refcount; CONFIG_REFCOUNT_FULL adds overflow detection
include/linux/kref.hkref / kref_init / kref_get / kref_putReference-counted object lifecycle; uses atomic_t internally
lib/percpu_counter.cpercpu_counter_add / percpu_counter_sumPer-CPU batched counter; batch size = percpu_counter_batch
include/linux/spinlock.hspin_lock / spin_unlock / spin_lock_irqsaveSpinlock API; arch-specific ticket/queued spinlock implementations
include/linux/mutex.hmutex_lock / mutex_unlock / mutex_trylockSleeping mutex; uses futex under the hood; not usable in IRQ context
kernel/locking/lockdep.clockdep — lock dependency trackerDetects lock order cycles and missing unlock at runtime (CONFIG_LOCKDEP)
kernel/locking/rtmutex.crt_mutex — priority inheritance mutexPrevents priority inversion; used by PREEMPT_RT and futex PI
kernel/locking/qspinlock.cqueued_spin_lock() — fast / medium / slow pathsFast: single CAS; medium: pending bit; slow: MCS queue. Default spinlock since kernel 4.2
kernel/locking/mcs_spinlock.hmcs_spinlock / mcs_lock / mcs_unlockPer-CPU queue node; each waiter spins on own locked field — O(1) bus traffic per unlock
include/linux/spinlock.hspin_lock_irqsave / spin_lock_bh / spin_trylockirqsave: disables local IRQs + acquires lock; bh: disables softirqs + lock
kernel/locking/mutex.cmutex_lock / mutex_lock_interruptible / mutex_trylockAdaptive spinning: optimistic spin if owner on CPU; sleep if owner blocked. mutex_osq for spinner queue
include/linux/mutex.hstruct mutex — owner, wait_lock, wait_list, osqowner field stores task pointer + flag bits (HANDOFF, PICKUP, WAITERS)
include/linux/seqlock.hseqlock_t / read_seqbegin / read_seqretry / write_seqlockSeq counter: even=stable, odd=write active; seqcount_t is the lockless variant
include/linux/rcupdate.hrcu_read_lock / rcu_read_unlock / rcu_dereference / rcu_assign_pointerCore RCU API; read-side is preemption-disable only — never acquires a lock
include/linux/rcupdate.hsynchronize_rcu() / call_rcu() / kfree_rcu()synchronize_rcu: blocking grace period; call_rcu: async callback; kfree_rcu: call_rcu + kfree
kernel/rcu/tree.cRCU tree hierarchy — rcu_node / rcu_dataOrganizes CPUs into a fanout tree; grace period completes when all leaf nodes report quiescent
include/linux/rculist.hlist_add_rcu / list_del_rcu / list_for_each_entry_rcuRCU-safe list ops; del unlinks but does not free — free only after synchronize_rcu / call_rcu
include/linux/srcu.hsrcu_struct / srcu_read_lock / synchronize_srcuSleepable RCU: allows sleeping in read-side critical section; per-domain grace period tracking
include/linux/percpu.hDEFINE_PER_CPU / get_cpu_var / put_cpu_var / this_cpu_incPer-CPU variable API; get_cpu_var disables preemption; this_cpu_* are preemption-unsafe shortcuts
include/linux/percpu.halloc_percpu() / free_percpu()Dynamic per-CPU allocation; returns __percpu-annotated pointer; use per_cpu_ptr() to dereference
lib/percpu-refcount.cpercpu_ref — per-CPU reference countScalable refcount: writers use per-CPU slots; percpu_ref_kill() switches to atomic for teardown
include/linux/kfifo.hkfifo — kernel lock-free SPSC ring bufferSingle-producer single-consumer ring; uses only memory barriers (no lock) for synchronization
kernel/locking/lockdep.clock_acquire / lock_release / lockdep_init_maplockdep core: records lock class graph, detects cycles. CONFIG_LOCKDEP enables at boot
mm/kcsan/kcsan.ckcsan_setup_watchpoint / kcsan_check_accessKCSAN runtime: instruments each access; CONFIG_KCSAN; data_race() macro suppresses benign races
tools/perf/builtin-c2c.cperf c2c — cache-to-cache false sharing profilerReports HITM counts per cache line; identifies false sharing between CPUs

16. Interview Prep

Q1: What is false sharing and how do you detect and fix it?

False sharing occurs when two CPUs write to different variables that share the same 64-byte cache line. Every write causes an MESI invalidation, forcing the other CPU to refetch the line. Detection: perf c2c shows cache-to-cache transfers. Fix: pad each hot variable to a full cache line, use __cacheline_aligned_in_smp, or use per-CPU data (DEFINE_PER_CPU).

Q2: Walk through the MESI states. What transition causes a write-back to DRAM?

Four states: Modified (dirty, exclusive), Exclusive (clean, exclusive), Shared (clean, multiple copies), Invalid. A write-back occurs when a Modified line transitions to Shared (another CPU requests a read — the modified CPU writes back and both go to Shared) or to Invalid (another CPU requests a write — the modified CPU writes back then gives up the line).

Q3: What does smp_mb() do and when must you use it vs smp_wmb()?

smp_mb() is a full barrier — no load or store on either side may cross it. smp_wmb() orders only stores. Use smp_wmb() when a producer writes data then a flag, paired with smp_rmb() on the consumer. Use smp_mb() when you need to prevent both load and store reordering — e.g., in lock implementations that must flush the store buffer and drain the invalidation queue.

Q4: Why do smp_wmb() and smp_rmb() compile to nothing on x86 (yet still work)?

x86 has TSO (Total Store Order): stores are observed in program order by other CPUs — no store-store reordering is visible. smp_wmb() still contains a compiler barrier (asm volatile), preventing the compiler from reordering. The CPU guarantee handles the hardware side. On ARM/POWER these become real fence instructions (dmb ishst / dmb ishld).

Q5: What is the difference between READ_ONCE() and smp_rmb()?

READ_ONCE() prevents the compiler from caching a variable in a register, tearing a read, or speculating it — it forces a single volatile-width load. smp_rmb() is a CPU-level read barrier that prevents the CPU's out-of-order engine from completing loads below the barrier before loads above it. READ_ONCE is about the compiler; smp_rmb is about the hardware. Both are usually needed together in a publish-subscribe pattern.

Q6: When should you use memory_order_seq_cst vs memory_order_acquire/release?

Use acquire/release (one-directional barriers) for most publish-subscribe synchronization — they are cheaper (no full fence on x86, just ldar/stlr on ARM). Use seq_cst only when you need a total global order across multiple atomic variables — e.g., Dekker-style mutual exclusion where both sides write and both sides read. seq_cst is significantly more expensive on ARM and POWER.

Q7: What does cmpxchg() return, and how do you use it in a CAS loop?

cmpxchg(&v, old, new) returns the value that was in v at the time of the operation. If the returned value equals old, the swap succeeded. If not, another CPU changed v before us — we reload, recompute, and retry. The pattern is: old = READ_ONCE(v); new = f(old); while (cmpxchg(&v, old, new) != old) { old = READ_ONCE(v); new = f(old); }. This is the basis for all lock-free data structures.

Q8: Why does the kernel use refcount_t instead of atomic_t for reference counts?

refcount_t adds saturation semantics: an increment on a zero-count object fires a warning (kref_get on a dead object is a use-after-free bug). An overflow wraps to 1 rather than 0, preventing an attacker from wrapping a counter to manufacture a fake reference. With CONFIG_REFCOUNT_FULL, every operation is checked. atomic_t has none of these protections.

Q9: When would you use a per-CPU counter instead of atomic_t? What are the trade-offs?

Use per-CPU counters when the write rate is high and readers can tolerate approximate values (filesystem usage, network statistics). Each CPU increments its local slot — zero cross-CPU coherence traffic — so throughput scales linearly with core count. Trade-offs: reads must sum all per-CPU slots (O(NR_CPUS) cost) and the total may be off by up to NR_CPUS × batch. Not suitable for reference counts or anything requiring exact, immediately consistent values.

Q10: How does the kernel prevent deadlocks in code that acquires multiple locks?

Lock ordering: every code path that needs locks A and B must always acquire them in the same global order (e.g., by address, by inode number, or by documented hierarchy). The lockdep subsystem tracks the directed graph of 'lock A was held when lock B was acquired' at runtime and fires a BUG if a cycle is detected. Alternative: trylock with release-on-fail — attempt to take the second lock; on failure release the first and retry with a backoff.

Q11: What is priority inversion and how does the kernel solve it?

Priority inversion: a high-priority task blocks on a mutex held by a low-priority task; a medium-priority task then preempts the low-priority holder, indirectly blocking the high task for an unbounded time. Solution: priority inheritance (rt_mutex). While a task holds an rt_mutex, its scheduling priority is temporarily boosted to the maximum priority of any waiter. This prevents medium-priority tasks from cutting in. Configured with PTHREAD_PRIO_INHERIT in userspace; built into rt_mutex in the kernel.

Q12: Walk through qspinlock fast, medium, and slow paths. Why does the slow path need MCS?

Fast path: a single CAS on the 32-bit lock word acquires the lock when uncontended. Medium path: if the lock is held and no one is waiting, the second waiter sets a pending bit and spins reading the owner bit — no node allocation, no queue. Slow path: the third and subsequent waiters join an MCS queue, each spinning on their own node's locked field. MCS is needed because without it, all waiters would spin on the same cache line, generating O(N) invalidations on every unlock — a bus storm that degrades with core count.

Q13: Why must you use spin_lock_irqsave() when the lock is also acquired in an IRQ handler?

If you call spin_lock() in process context and an IRQ fires on the same CPU while you hold the lock, the IRQ handler will try to acquire the same lock — deadlock, because the interrupted code cannot release it while the handler is running. spin_lock_irqsave() disables local IRQs before acquiring, preventing the IRQ from firing until unlock. The irqsave variant saves the previous IRQ flags so nested calls restore the correct state.

Q14: Why can you never sleep inside a spinlock critical section?

A spinlock disables preemption on the current CPU. If the holder calls schedule() (directly or via a blocking call like kmalloc(GFP_KERNEL), copy_from_user(), mutex_lock()), the scheduler will try to context-switch away — but it cannot, because preemption is disabled. On PREEMPT_RT kernels this causes a BUG; on non-RT kernels it causes a lockup. Additionally, sleeping on a spinlock-held CPU blocks any other spinlock waiter on that CPU forever.

Q15: What does mutex_lock() do when it finds the lock held but the owner is running on another CPU?

It enters the optimistic spinning phase. The waiter spins in a tight loop (cpu_relax()), re-checking the lock owner. If the owner releases the lock during this spin, the waiter acquires without ever going to sleep — saving the ~5–10 µs cost of a context switch. The waiter gives up spinning if: (1) the owner leaves the CPU (blocks or gets preempted), (2) a higher-priority task becomes runnable, or (3) an internal spin limit is hit. At that point it adds itself to the mutex wait_list and calls schedule().

Q16: What is seqlock? When should you use it instead of rwlock?

A seqlock uses a sequence counter (even = stable, odd = write in progress). Writers increment the counter at start and end of the update; readers snapshot it before and after — if the snapshots differ or the first was odd, they retry. Use seqlock when: (1) reads are far more frequent than writes (e.g., jiffies, clock_gettime), (2) the protected data is small plain numeric values, and (3) retrying is cheap. Do not use rwlock when readers are read-heavy: rwlock requires a bus-locked read on every acquisition. Seqlock readers pay almost nothing. But seqlock cannot protect pointers — a reader may dereference a freed pointer between seqbegin and seqretry.

Q17: Explain rcu_read_lock(). What does it actually do at the hardware level?

On non-PREEMPT_RT kernels, rcu_read_lock() expands to preempt_disable() — it increments the per-CPU preempt_count field. No lock is acquired, no atomic operation is performed, no cache line is transferred. This is why the read side is essentially free. On PREEMPT_RT (where spinlocks are converted to sleeping locks), rcu_read_lock() also disables local IRQs to prevent RT-mutex priority inheritance from running inside an RCU critical section. The preempt_disable prevents context switches, which are the quiescent states RCU detects — ensuring the CPU cannot report a quiescent state while a reader is active.

Q18: What is a grace period in RCU? Name three examples of quiescent states.

A grace period is the time after a writer calls synchronize_rcu() until every CPU has passed through at least one quiescent state. A quiescent state is a point at which a CPU is guaranteed to not be inside any RCU read-side critical section. Three examples: (1) a context switch — schedule() runs, so preempt_count must be 0; (2) entry to userspace — the kernel is not running any RCU-protected code; (3) the idle loop — the CPU is doing nothing. Once every CPU has passed a quiescent state, any reader that was running before the writer's rcu_assign_pointer() must have finished, making it safe to free the old object.

Q19: Compare synchronize_rcu() and call_rcu(). When must you use call_rcu()?

synchronize_rcu() blocks the caller until the grace period ends — it may sleep for milliseconds. Use it when you can afford to block (process context, no locks held). call_rcu(head, fn) registers a callback fn to be called after the grace period and returns immediately. Use it when you cannot block: inside an IRQ handler, a softirq, a BH, or while holding a spinlock. call_rcu is also more efficient when freeing many objects: callbacks from multiple writers are batched into a single grace period. kfree_rcu(ptr, field) is a convenience wrapper that calls call_rcu then kfree, avoiding boilerplate callback functions.

Q20: Why does rcu_dereference() exist — can you just dereference the pointer directly?

No. rcu_dereference() expands to READ_ONCE(p) plus a dependency ordering barrier. READ_ONCE prevents the compiler from re-reading the pointer from memory twice (TOCTOU) or from caching it in a register across the read-side critical section. The dependency barrier ensures that on architectures with weak memory ordering (Alpha, theoretically), the CPU does not prefetch data through the pointer before the pointer load is observed. On x86 (TSO) the hardware part is a no-op, but the compiler barrier is always needed. Without rcu_dereference(), a compiler optimization or a weak-arch CPU can produce use-after-free bugs that are impossible to reproduce on x86.

Q21: What is the ABA problem? How does the Linux kernel avoid it?

ABA: a CAS loop reads pointer A, gets preempted. Another thread pops A, frees it, allocates a new object that gets the same address A, then pushes it. When the original thread resumes, CAS(&top, A, next) succeeds — it sees the same address — but 'next' is now a dangling pointer. The kernel avoids ABA in lock-free lists primarily through RCU: kfree_rcu and call_rcu defer the free until all concurrent readers have finished, so the old address cannot be reallocated and reused while any CAS loop is still running. DPDK's rte_ring avoids ABA entirely by using integer indices instead of raw pointers — indices can wrap but the ring size prevents a false match.

Q22: Why does a Treiber stack's push loop retry if CAS fails? Is it lock-free or wait-free?

CAS fails when another thread changed 'top' between our read and our compare. We retry with the new top value — our thread makes progress because at least one thread (the one that beat our CAS) successfully pushed. This is lock-free: the system as a whole always makes progress, no thread can be blocked indefinitely by the scheduler. It is not wait-free: a single thread can theoretically retry infinitely if it keeps losing races. Wait-free algorithms guarantee each thread completes in a bounded number of steps — they require more complex constructions (fetch-and-increment, combining trees).

Q23: When should you use per-CPU data instead of atomic_t? What are the correctness rules?

Use per-CPU when: write rate is high and the value is per-thread statistics (counters, allocator magazines, run-queue state). Correctness rules: (1) always use get_cpu_var/put_cpu_var or preempt_disable/enable around accesses — if the task is migrated between the read and the write, you corrupt two CPUs' data. (2) Never use this_cpu_inc in preemptible context. (3) Reading another CPU's slot (per_cpu(var, cpu)) requires that CPU to not be updating it simultaneously — use spin_lock or memory barriers. (4) Totals computed by summing all slots may be transiently negative or larger than the true value — never use per-CPU for reference counts.

Q24: What does lockdep detect and how? Give an example of a false positive.

lockdep tracks 'lock A was held when lock B was acquired' as directed edges in a graph. If it detects a cycle (A→B and B→A from different code paths), it fires WARN_ON before the deadlock occurs. False positive example: two different inode mutexes with the same lockdep class (same static lock_class_key). Code path 1 locks inode-A then inode-B; code path 2 locks inode-B then inode-A. lockdep sees A→B and B→A and reports a cycle — but in practice both A and B are always acquired in address order (lock_two_nondirectories), so no real deadlock. Fix: use lockdep_set_subclass() or nest_lock() to annotate the ordering guarantee.

Q25: How does KCSAN detect data races? What is a benign race and how do you suppress it?

KCSAN instruments every load/store with __tsan_read/__tsan_write calls. Each CPU registers a watchpoint (address, type) for a short window. If another CPU concurrently accesses the same address and at least one access is a write, KCSAN reports a data race with stack traces from both sides. A benign race is a concurrent read of a value that is not protected by a lock but where the torn or stale read is intentional and harmless — e.g., reading a statistics counter without holding the update lock. Suppress with data_race(x) macro, which tells KCSAN 'this concurrent access is intentional.' For single-load/single-store races that are intentionally unsynchronized, use READ_ONCE/WRITE_ONCE instead — they silence KCSAN and also prevent compiler optimizations.

Q26: How do you detect false sharing with perf c2c? What does HITM mean?

Run: perf c2c record -- ./workload; perf c2c report. HITM = Hit In Modified — a load from CPU A that hits a cache line in the Modified state on CPU B, forcing a cache-to-cache transfer. This is the signature of false sharing or true sharing under write-write contention. The report shows per-cache-line HITM counts and the source code lines responsible for the reader and writer accesses. Fix: pad the struct so the hot writer and hot reader fields land on separate 64-byte cache lines, using __cacheline_aligned_in_smp or ____cacheline_aligned between fields.