Part III — Process

§ 3.1 – 3.7 Process Management — Complete

From task_struct and fork() / exec() / Copy-on-Write, to the CFS red-black tree, kernel preemption, IRQ bottom halves, NAPI networking, timer wheel, hrtimer, and the core kernel data structures: linked list, hlist, rb-tree, and radix/XArray.

1. Overview

Every running entity — process or thread — is described by a struct task_struct in the kernel. It is the central hub that links together the process's address space, open files, signal handlers, and scheduling state. The CFS scheduler maintains one per-CPU cfs_rq red-black tree sorted by accumulated virtual runtime (vruntime); the task with the smallest vruntime is always next to run. When a task calls fork(), the kernel copies the task descriptor and marks all writable pages as read-only — physical pages are shared until one side writes, at which point a private copy is made (Copy-on-Write).

2. Key Data Structure — struct task_struct

task_struct lives in kernel memory (not on any stack) and stays alive until the process is reaped by its parent. On x86-64 it is roughly 6 KB. The kernel accesses the current task via the current macro, which reads a per-CPU pointer stored in a segment register (%gs) — O(1) with no locking.

FieldTypePurpose
state / __stateunsigned intCurrent scheduling state — TASK_RUNNING (on run queue or running), TASK_INTERRUPTIBLE (sleeping, can be woken by signal), TASK_UNINTERRUPTIBLE (waiting for kernel I/O — cannot be killed), TASK_ZOMBIE (exited, not yet reaped)
pidpid_tUnique thread ID — what /proc shows as Pid. Each clone() call gets a fresh pid even within a thread group.
tgidpid_tThread group ID — equals the pid of the thread group leader (the main thread). getpid() returns tgid, so all threads in a process see the same PID.
mmstruct mm_struct *Address space — page tables, VMA list, brk pointer. NULL for kernel threads (they use active_mm from the last user task that ran on the CPU).
filesstruct files_struct *Open file descriptor table. Shared between threads via CLONE_FILES; duplicated (COW) on fork without CLONE_FILES.
se (sched_entity)struct sched_entityEmbedded CFS scheduling state: vruntime, load weight, runnable load average. This is the rb-tree node — se.rb_node is the key inserted into cfs_rq.
prio / static_priointEffective priority (0–139). RT tasks use 0–99; CFS tasks use 100–139 (maps from nice -20..+19). Lower number = higher priority.
thread_infostruct thread_infoSmall struct at the bottom of the kernel stack. Contains TIF_NEED_RESCHED flag checked on every kernel-to-user return and after every interrupt.

3. Threads vs Processes — pid vs tgid

Linux has no separate "thread" kernel object — a thread is just a task_struct that shares its parent's mm_struct, files_struct, and signal_struct. The flags passed to clone() (which pthread_create uses) control exactly what is shared vs copied.

Why does getpid() return tgid? POSIX defines PID as the process identifier shared by all threads. The kernel exposes tgid via getpid() and the per-thread ID via gettid(). In /proc, /proc/<tgid>/task/<tid>/ shows per-thread entries.

4. Process Lifecycle

A Linux process moves through well-defined states from creation to reaping. The state machine determines whether the task is on a run queue, sleeping in a wait queue, or consuming a task descriptor entry while awaiting parent acknowledgement.

5. Core Mechanism — fork()exec()exit()

Background: A shell needs to run ls. It must create a new process, replace that process's image with the ls binary, let it run to completion, and collect its exit status — all without corrupting the shell's own address space.
Plan:
  1. fork() — duplicate task_struct; COW-share pages
  2. Child calls exec() — kernel replaces mm with the ELF image
  3. Child runs, then calls exit() — kernel frees resources, leaves ZOMBIE entry
  4. Parent calls wait4() — kernel copies exit code, frees task_struct

Walkthrough — COW page fault

StepActorState change
1fork() completesChild inherits parent page table. Physical page P at VA 0x4000 has refcnt=2. Both parent and child PTEs point to P with PTE.W=0.
2Child writes to VA 0x4000CPU checks PTE.W=0 → raises protection fault (#PF). Kernel fault handler called.
3do_cow_fault()Kernel allocates new physical page P' from buddy. copy_user_highpage() copies 4 KB from P to P'.
4PTE updateChild's PTE updated to P', PTE.W=1. P's refcnt decremented to 1. If refcnt drops to 1, parent's PTE is re-armed writable (no future fault needed).
5Child resumesChild's store completes to P'. Parent still reads from P — sees unmodified data.

6. Minimal C Demo — task_struct Simulation

Simulates four tasks (init, bash, two worker threads sharing a tgid) and prints their scheduling state. The last line shows which task CFS would pick next — the one with the smallest vruntime.

task_struct — process table simulation — C Demo
stdin (optional)

7. Minimal C Demo — fork / exec / exit / wait4 Lifecycle

Simulates the shell's fork-exec-wait cycle entirely in user-space C. Prints each transition so you can trace the state machine without reading kernel source.

fork → exec → exit → wait4 lifecycle — C Demo
stdin (optional)

8. CFS — Completely Fair Scheduler

CFS replaced the O(1) scheduler in Linux 2.6.23. Its core idea: instead of fixed timeslices, every task accumulates virtual runtime (vruntime) at a rate inversely proportional to its priority weight. The task with the smallest vruntime is always picked next, so CPU time is distributed proportionally to weights — that is, fairly.

CFS Red-Black Tree

Each CPU maintains a struct cfs_rq containing an rb-tree keyed by se.vruntime. Insertion, deletion, and lookup are all O(log n). The leftmost node is cached in cfs_rq->rb_leftmost — so pick_next_task_fair() is O(1) in the common case.

vruntime Update Formula

niceweightEffect on vruntime
-2088761vruntime grows ~87× slower than nice=0 — near-monopoly of CPU
-53121vruntime grows ~3× slower — noticeably higher priority
01024Baseline — vruntime grows at 1:1 with wall time
+5335vruntime grows ~3× faster — lower priority
+1915vruntime grows ~68× faster — near-idle priority

schedule() — the dispatcher

Real-Time Schedulers

PolicyData structureBehavior
SCHED_FIFOPer-prio FIFO queueRT task runs until it blocks or yields. No timeslice. Higher-prio task preempts immediately. Starvation possible.
SCHED_RRPer-prio round-robin queueSame as SCHED_FIFO but with a timeslice. After timeslice expires, task moves to tail of its priority queue.
SCHED_NORMAL (SCHED_OTHER)CFS rb-treeDefault for all user processes. Weighted fair sharing via vruntime. No starvation.
SCHED_DEADLINEEDF (earliest deadline first) queueSporadic real-time tasks. Kernel enforces bandwidth admission control — task declares runtime/period/deadline.

9. Load Balancing — sched_domain

Linux organises CPUs into a hierarchy of scheduling domains, each describing a set of CPUs that share a resource (HT siblings, L3 cache, NUMA node). The load balancer uses this hierarchy to steal tasks from overloaded CPUs while preferring migrations that stay within the cheapest domain (SMT > MC > NUMA).

DPDK tie-in: DPDK binds poll-mode threads to dedicated cores using sched_setaffinity() and sets isolcpus= in the kernel boot parameters to prevent the scheduler from ever migrating other tasks onto those cores. This eliminates context-switch latency entirely — the polling thread runs forever without interruption.

10. Minimal C Demo — CFS vruntime Calculation

Shows how vruntime accumulates at different rates for tasks with different nice values. Uses the actual kernel priority weight table from kernel/sched/core.c.

CFS vruntime — nice value weight table — C Demo
stdin (optional)

11. §3.4 — Preemption & Context Switch

Linux supports both user preemption (at every kernel-to-user return) and kernel preemption (at every IRQ return, if preempt_count == 0). The preempt_count field in thread_info is a nest counter: spin locks, RCU read-sides, and interrupt handlers all increment it to defer preemption.

Context Switch — What Exactly Is Saved?

context_switch() calls two things in sequence: switch_mm() to swap address spaces (CR3 reload), then switch_to() to swap CPU register state. The FPU is saved lazily — only when the next task actually uses floating-point, to avoid ~512-byte saves on every switch.

DPDK tie-in: A single context switch costs 1–5 µs directly plus many cache/TLB misses indirectly. At 10 Mpps per core, that budget is 100 ns/packet. DPDK eliminates context switches entirely by running a polling loop on isolated CPUs (isolcpus=) with SCHED_FIFO priority — the core never leaves user space.

Minimal C Demo — Preemption Counter

Preemption counter — spin_lock / scheduler_tick / spin_unlock — C Demo
stdin (optional)

12. §3.5 — Interrupts & Bottom Halves

Linux splits interrupt work into two halves. The hard IRQ handler runs with interrupts disabled — it must return in microseconds. Heavy work is deferred to a bottom half: softirq, tasklet, or workqueue.

Bottom HalfContextCan sleep?Use case
softirqSoftirq (atomic)NoNET_RX/TX, TIMER, BLOCK — statically allocated, run on every CPU simultaneously
taskletSoftirq (atomic)NoLegacy single-threaded deferred work — deprecated in 5.x, prefer workqueue
workqueue kworkerProcess contextYesFirmware load, NFS callbacks, anything needing sleep or mutex

NAPI — Interrupt Coalescing for High-Throughput Networking

Without NAPI, every received packet fires an IRQ — at 10 Mpps that is 10M IRQs/sec, spending more time in interrupt handlers than forwarding. NAPI solves this: the first packet fires an IRQ which disables further NIC interrupts and schedules a poll loop in NET_RX_SOFTIRQ. The poll loop drains up to napi->weight (default 64) packets per round, then re-enables IRQs only when the ring is empty.

Minimal C Demo — NAPI Poll Loop

NAPI poll — batch RX ring drain with budget — C Demo
stdin (optional)

13. §3.6 — Kernel Timers & Time Management

MechanismPrecisionData structureTypical use
jiffies1/HZ (~1–10 ms)atomic_long_t per-CPU counterCoarse timeouts, boot time, uptime
struct timer_list1 jiffieHierarchical timer wheel (TV1–TV5)TCP keepalive, ARP aging, session expiry
struct hrtimer~1 ns (TSC/HPET/APIC)Red-black tree per clock basenanosleep, itimer, POSIX interval timers
NO_HZ_FULLN/A — skips ticksKernel config optionDPDK isolated cores — zero timer interrupts

Timer Wheel — O(1) Insert, Lazy Cascade

Background: A TCP server may have 100 000 active connections each with a keepalive timer. Inserting all of them into a sorted list would be O(log n) per insert and O(n) to scan for expirations. The timer wheel hashes by expiry jiffy into a slot — O(1) insert and O(1) expiry scan per tick.

hrtimer — Nanosecond Red-Black Tree

Minimal C Demo — Timer Wheel Simulation

Timer wheel — O(1) insert + expiry cascade — C Demo
stdin (optional)

14. §3.7 — Kernel Data Structures

list_head — Circular Doubly-Linked List

The kernel embeds struct list_head directly inside the data struct — there is no separate node allocation. The container_of() macro recovers the outer struct from any list_head * pointer using compile-time offset math. The list is circular: the sentinel head node links back to itself when empty, so every traversal is uniform — no NULL checks needed.

Macro / FunctionPurpose
list_add(new, head)Insert after head (stack-like — LIFO)
list_add_tail(new, head)Insert before head (queue-like — FIFO)
list_del(entry)Unlink entry — O(1), does not free
list_for_each_entry(pos, head, member)Iterate — NOT safe if deleting during iteration
list_for_each_entry_safe(pos, n, head, member)Iterate — safe to delete current entry (saves next pointer first)
container_of(ptr, type, member)Recover containing struct pointer from embedded member pointer

hlist_head — Single-Pointer Head for Hash Tables

hlist_head uses only one pointer (first) instead of two, halving memory for hash buckets. Deletion still O(1) via the pprev back-pointer trick — the node stores a pointer-to-pointer to itself, avoiding the need to traverse the list.

Red-Black Tree — rb_root + rb_node

The kernel's generic struct rb_node is embedded in the data struct (same pattern as list_head). Color is stored in the lowest bit of the parent pointer — no extra memory. rb_insert_color() and rb_erase() perform rotations to maintain invariants.

Radix Tree / XArray & Bitmap

StructureKey insightKernel use
Radix tree / XArrayTrie indexed by integer — O(log₆₄ n) ≈ O(1) for 64-bit keys. XArray is the modern API (xa_store, xa_load) with better locking semantics.Page cache (page index → struct page), IDR (integer ID allocator)
Bitmap (unsigned long[])DECLARE_BITMAP(name, bits). set_bit/clear_bit are atomic. find_first_bit, find_next_bit scan 64 bits per iteration.CPU masks, IRQ allocation, page frame tracking, network packet flags

Minimal C Demo — list_head + container_of

list_head embedded list + container_of recovery — C Demo
stdin (optional)

15. Kernel Source Pointers — §3.1–3.7

File / FunctionWhat it does
include/linux/sched.h :: struct task_structThe process descriptor — ~750 fields, ~6 KB. Central hub linking mm, files, signals, scheduling, and thread state.
kernel/fork.c :: copy_process()Core of fork(): dup_task_struct, copy_mm (COW), copy_files, copy_signal, alloc_pid. Returns new task_struct.
kernel/fork.c :: wake_up_new_task()Enqueues the freshly forked child on a run queue and triggers load balancing.
fs/exec.c :: do_execve()Loads ELF binary: open_exec, bprm_mm_init (new mm), load_elf_binary, start_thread sets %rip/%rsp.
kernel/exit.c :: do_exit()Frees mm, files, fs, IPC resources. Notifies parent with SIGCHLD. Task becomes ZOMBIE.
kernel/sched/core.c :: schedule()Main scheduler entry point: deactivates current, calls pick_next_task(), calls context_switch().
kernel/sched/fair.c :: pick_next_task_fair()CFS picker: returns rb_leftmost from cfs_rq. Updates vruntime, sets new task's exec_start.
kernel/sched/fair.c :: update_curr()Called every tick: computes delta_exec, updates se.vruntime, calls update_min_vruntime.
arch/x86/kernel/process_64.c :: __switch_to()x86 context switch: saves/restores FS/GS base, FPU state, kernel stack pointer (via TSS.sp0).
arch/x86/entry/entry_64.S :: switch_to_asmAssembly that swaps %rsp between prev and next kernel stacks — the actual CPU register swap.
kernel/sched/topology.c :: build_sched_domains()Builds the sched_domain hierarchy from CPU topology (cpumasks, NUMA distances).
kernel/sched/core.c :: preempt_schedule_irq()Kernel preemption entry: checks preempt_count==0 before calling schedule(); triggered on IRQ return to kernel.
arch/x86/kernel/fpu/core.c :: fpu__save()Lazy FPU save: called only on context switch if TIF_NEED_FPU_LOAD is set — avoids ~512-byte XSAVE on every switch.
kernel/softirq.c :: __do_softirq()Softirq dispatcher: loops through softirq_pending bitmap, calls registered handler (e.g. net_rx_action). Runs up to 10 rounds before waking ksoftirqd.
net/core/dev.c :: net_rx_action()NET_RX_SOFTIRQ handler: iterates napi_list, calls poll(), enforces budget across all NAPI instances.
net/core/dev.c :: napi_schedule() / napi_complete_done()napi_schedule: adds napi to softirq list, disables NIC IRQ. napi_complete_done: re-enables IRQ when ring drained.
kernel/time/timer.c :: add_timer() / mod_timer()Timer wheel insertion: computes TV1–TV5 slot, links into per-CPU wheel. mod_timer reschedules atomically.
kernel/time/hrtimer.c :: hrtimer_start() / hrtimer_interrupt()hrtimer_start: inserts into per-CPU rb-tree; programs hardware one-shot if leftmost changes. hrtimer_interrupt: fires expired timers.
include/linux/list.h :: list_add / list_del / list_for_each_entry_safeCore doubly-linked list macros. list_for_each_entry_safe saves ->next before iteration, safe for deletion in-loop.
include/linux/rbtree.h :: rb_insert_color / rb_eraseGeneric rb-tree rotation/recolor after insert or erase. Caller provides the sort key via embedded rb_node.
lib/xarray.c :: xa_store / xa_loadXArray (modern radix tree): lock-free reads, fine-grained locking on writes. Replaces radix_tree_insert in page cache.

16. Interview Prep — §3.1–3.7

#QuestionConcise Answer
Q1What is struct task_struct and what are its most important fields?task_struct is the process descriptor — the kernel's representation of every thread. Key fields: state (scheduling state), pid/tgid (thread vs process ID), mm (address space), files (open FDs), signal/sighand (signals), se.vruntime (CFS scheduling position), thread_info.flags (TIF_NEED_RESCHED). It is allocated from a slab cache and stays alive until the parent reaps the zombie via wait4().
Q2What is the difference between pid and tgid? How do fork() and clone() differ?pid is the unique kernel thread ID; tgid is the thread group leader's pid — all threads in a process share the same tgid. getpid() returns tgid (POSIX PID), gettid() returns pid. fork() creates a new process by calling clone() without sharing flags — new mm, files, signal_struct. pthread_create() calls clone() with CLONE_VM | CLONE_FILES | CLONE_SIGHAND | CLONE_THREAD — sharing all key structures. Only the kernel stack and pid differ.
Q3Explain CFS virtual runtime and the red-black tree scheduling.CFS computes vruntime += delta_exec × (NICE_0_LOAD / task_weight) every tick. Tasks with higher priority (lower nice) have higher weight — their vruntime accumulates slower — so they stay at the rb-tree leftmost position longer and get more CPU time. pick_next_task_fair() always returns rb_leftmost (cached, O(1)). When a task is preempted, it is re-inserted at its new vruntime. On wakeup, the task's vruntime is clamped to min_vruntime - latency_target to prevent it from monopolizing after a sleep.
Q4What is Copy-on-Write and when does it trigger after fork()?After fork(), parent and child share physical pages. copy_process() marks all writable PTEs as read-only and increments each page's _refcount. The first write to a shared page raises a protection fault. do_cow_fault() allocates a new page, copies the 4 KB content, updates the faulter's PTE to the new page with write permission, and decrements the original page's refcount. If refcount drops to 1, the remaining task's PTE is re-armed writable — future accesses need no fault. This means fork() is O(size_of_page_table), not O(address_space_size).
Q5What is the difference between SCHED_FIFO, SCHED_RR, and SCHED_NORMAL?SCHED_NORMAL (SCHED_OTHER): default CFS policy. Weighted fair sharing via vruntime. No starvation. SCHED_FIFO: real-time; task runs until it blocks or yields. No timeslice. Higher-priority RT task preempts immediately. Lower-priority tasks can starve. SCHED_RR: same as SCHED_FIFO but with a configurable timeslice — after expiry the task moves to the tail of its priority queue. RT tasks (FIFO/RR) always preempt CFS tasks because RT schedulers are checked first in pick_next_task().
Q6Walk through switch_to() — what exactly is saved and restored?context_switch() does two steps. (1) switch_mm(): if next->mm != prev->mm, reload CR3 with next's page-global directory — flushes TLB for user mappings. (2) switch_to() assembly: push callee-saved registers (%rbx, %rbp, %r12–r15) onto prev's kernel stack; save %rsp to prev->thread.sp; load %rsp from next->thread.sp; pop next's callee-saved regs; ret — CPU now executes at next's saved %rip. Caller-saved registers (%rax, %rcx, %rdx, %rsi, %rdi, %r8–r11) are NOT saved here — they were already on the stack when the task voluntarily called schedule(). FPU state is saved lazily only when the next task first uses floating-point.
Q7What is NAPI and why does it improve network throughput?NAPI (New API) is interrupt coalescing for the kernel network receive path. Without NAPI, each received packet fires a hard IRQ, resulting in millions of interrupts per second at high packet rates — the interrupt overhead can exceed useful work. With NAPI: the first packet fires an IRQ; the driver calls napi_schedule(), which disables further NIC interrupts and adds a napi_struct to the softirq work list; NET_RX_SOFTIRQ calls the driver's poll() function which drains up to napi->weight (default 64) packets in a tight loop without interrupts; when the ring is empty (done < budget), napi_complete_done() re-enables NIC interrupts. This amortizes interrupt overhead across many packets and enables the driver to stay in a hot loop that exploits CPU caches efficiently.
Q8Explain the timer wheel — what is its time complexity for insert vs. expire?The timer wheel (struct timer_list) uses a hierarchical hash wheel: TV1 has 256 slots (each = 1 jiffie), TV2 has 64 slots (each = 256 jiffies), and so on up to TV5. add_timer() computes the slot as (expires & 0xff) into TV1 — O(1) insertion. Every jiffie, the kernel advances the TV1 index and fires all timers in the current slot — O(1) scan (firing is O(k) where k is the number of timers in the slot). When TV1 wraps (every 256 jiffies), the corresponding TV2 slot is 'cascaded' into TV1 — O(cascade) which is O(n_timers_in_that_tv2_slot). Over time, the amortized cost remains near O(1) per timer because cascade is rare. Compare: a sorted linked list is O(log n) insert, O(n) scan; a min-heap is O(log n) insert, O(log n) pop.
Q9What is NO_HZ_FULL and why does DPDK need it?NO_HZ_FULL (tickless full) is a kernel config that suppresses the periodic scheduler tick (normally HZ times/second) on CPUs that have exactly one runnable task. Normally, even a busy polling thread receives timer interrupts that cause context switches into the kernel. These interrupt the tight polling loop, flush TLB entries and pipeline state, and add latency jitter of ~10–100 µs. With NO_HZ_FULL on a DPDK-isolated CPU (combined with isolcpus= and rcu_nocbs=), the CPU runs the poll loop without any timer interrupts — zero latency jitter. The CPU is still woken for critical events (RCU, scheduler clock), but at microsecond intervals, not every 1–4 ms.
Q10How does list_for_each_entry_safe differ from list_for_each_entry, and why does the kernel use hlist_head for hash tables instead of list_head?list_for_each_entry(pos, head, member) expands to a for-loop that reads pos->member.next at each iteration. If the loop body deletes pos, its next pointer is freed/poisoned — the next iteration reads garbage. list_for_each_entry_safe(pos, n, head, member) saves pos->member.next into 'n' before the loop body runs, so deletion of pos is safe. Regarding hlist vs. list: list_head has both next and prev pointers — 16 bytes per head node. For a hash table with 65536 buckets, that is 1 MB of head nodes. hlist_head has only a single 'first' pointer — 8 bytes per head, saving 512 KB. Deletion is still O(1) via the pprev trick: hlist_node stores a pointer-to-pointer (pprev) pointing to the slot that holds this node's address, allowing unlink without traversal. The kernel uses hlist for the PID hash, connection tracking tables, routing cache, and any other large hash table where the number of buckets dominates memory.