§ 3.1 – 3.7 Process Management — Complete
From task_struct and fork() / exec() / Copy-on-Write, to the CFS red-black tree, kernel preemption, IRQ bottom halves, NAPI networking, timer wheel, hrtimer, and the core kernel data structures: linked list, hlist, rb-tree, and radix/XArray.
1. Overview
Every running entity — process or thread — is described by a struct task_struct in the kernel. It is the central hub that links together the process's address space, open files, signal handlers, and scheduling state. The CFS scheduler maintains one per-CPU cfs_rq red-black tree sorted by accumulated virtual runtime (vruntime); the task with the smallest vruntime is always next to run. When a task calls fork(), the kernel copies the task descriptor and marks all writable pages as read-only — physical pages are shared until one side writes, at which point a private copy is made (Copy-on-Write).
2. Key Data Structure — struct task_struct
task_struct lives in kernel memory (not on any stack) and stays alive until the process is reaped by its parent. On x86-64 it is roughly 6 KB. The kernel accesses the current task via the current macro, which reads a per-CPU pointer stored in a segment register (%gs) — O(1) with no locking.
| Field | Type | Purpose |
|---|---|---|
state / __state | unsigned int | Current scheduling state — TASK_RUNNING (on run queue or running), TASK_INTERRUPTIBLE (sleeping, can be woken by signal), TASK_UNINTERRUPTIBLE (waiting for kernel I/O — cannot be killed), TASK_ZOMBIE (exited, not yet reaped) |
pid | pid_t | Unique thread ID — what /proc shows as Pid. Each clone() call gets a fresh pid even within a thread group. |
tgid | pid_t | Thread group ID — equals the pid of the thread group leader (the main thread). getpid() returns tgid, so all threads in a process see the same PID. |
mm | struct mm_struct * | Address space — page tables, VMA list, brk pointer. NULL for kernel threads (they use active_mm from the last user task that ran on the CPU). |
files | struct files_struct * | Open file descriptor table. Shared between threads via CLONE_FILES; duplicated (COW) on fork without CLONE_FILES. |
se (sched_entity) | struct sched_entity | Embedded CFS scheduling state: vruntime, load weight, runnable load average. This is the rb-tree node — se.rb_node is the key inserted into cfs_rq. |
prio / static_prio | int | Effective priority (0–139). RT tasks use 0–99; CFS tasks use 100–139 (maps from nice -20..+19). Lower number = higher priority. |
thread_info | struct thread_info | Small struct at the bottom of the kernel stack. Contains TIF_NEED_RESCHED flag checked on every kernel-to-user return and after every interrupt. |
3. Threads vs Processes — pid vs tgid
Linux has no separate "thread" kernel object — a thread is just a task_struct that shares its parent's mm_struct, files_struct, and signal_struct. The flags passed to clone() (which pthread_create uses) control exactly what is shared vs copied.
getpid() return tgid? POSIX defines PID as the process identifier shared by all threads. The kernel exposes tgid via getpid() and the per-thread ID via gettid(). In /proc, /proc/<tgid>/task/<tid>/ shows per-thread entries.4. Process Lifecycle
A Linux process moves through well-defined states from creation to reaping. The state machine determines whether the task is on a run queue, sleeping in a wait queue, or consuming a task descriptor entry while awaiting parent acknowledgement.
5. Core Mechanism — fork() → exec() → exit()
ls. It must create a new process, replace that process's image with the ls binary, let it run to completion, and collect its exit status — all without corrupting the shell's own address space.fork()— duplicate task_struct; COW-share pages- Child calls
exec()— kernel replaces mm with the ELF image - Child runs, then calls
exit()— kernel frees resources, leaves ZOMBIE entry - Parent calls
wait4()— kernel copies exit code, frees task_struct
Walkthrough — COW page fault
| Step | Actor | State change |
|---|---|---|
| 1 | fork() completes | Child inherits parent page table. Physical page P at VA 0x4000 has refcnt=2. Both parent and child PTEs point to P with PTE.W=0. |
| 2 | Child writes to VA 0x4000 | CPU checks PTE.W=0 → raises protection fault (#PF). Kernel fault handler called. |
| 3 | do_cow_fault() | Kernel allocates new physical page P' from buddy. copy_user_highpage() copies 4 KB from P to P'. |
| 4 | PTE update | Child's PTE updated to P', PTE.W=1. P's refcnt decremented to 1. If refcnt drops to 1, parent's PTE is re-armed writable (no future fault needed). |
| 5 | Child resumes | Child's store completes to P'. Parent still reads from P — sees unmodified data. |
6. Minimal C Demo — task_struct Simulation
Simulates four tasks (init, bash, two worker threads sharing a tgid) and prints their scheduling state. The last line shows which task CFS would pick next — the one with the smallest vruntime.
7. Minimal C Demo — fork / exec / exit / wait4 Lifecycle
Simulates the shell's fork-exec-wait cycle entirely in user-space C. Prints each transition so you can trace the state machine without reading kernel source.
8. CFS — Completely Fair Scheduler
CFS replaced the O(1) scheduler in Linux 2.6.23. Its core idea: instead of fixed timeslices, every task accumulates virtual runtime (vruntime) at a rate inversely proportional to its priority weight. The task with the smallest vruntime is always picked next, so CPU time is distributed proportionally to weights — that is, fairly.
CFS Red-Black Tree
Each CPU maintains a struct cfs_rq containing an rb-tree keyed by se.vruntime. Insertion, deletion, and lookup are all O(log n). The leftmost node is cached in cfs_rq->rb_leftmost — so pick_next_task_fair() is O(1) in the common case.
vruntime Update Formula
| nice | weight | Effect on vruntime |
|---|---|---|
| -20 | 88761 | vruntime grows ~87× slower than nice=0 — near-monopoly of CPU |
| -5 | 3121 | vruntime grows ~3× slower — noticeably higher priority |
| 0 | 1024 | Baseline — vruntime grows at 1:1 with wall time |
| +5 | 335 | vruntime grows ~3× faster — lower priority |
| +19 | 15 | vruntime grows ~68× faster — near-idle priority |
schedule() — the dispatcher
Real-Time Schedulers
| Policy | Data structure | Behavior |
|---|---|---|
SCHED_FIFO | Per-prio FIFO queue | RT task runs until it blocks or yields. No timeslice. Higher-prio task preempts immediately. Starvation possible. |
SCHED_RR | Per-prio round-robin queue | Same as SCHED_FIFO but with a timeslice. After timeslice expires, task moves to tail of its priority queue. |
SCHED_NORMAL (SCHED_OTHER) | CFS rb-tree | Default for all user processes. Weighted fair sharing via vruntime. No starvation. |
SCHED_DEADLINE | EDF (earliest deadline first) queue | Sporadic real-time tasks. Kernel enforces bandwidth admission control — task declares runtime/period/deadline. |
9. Load Balancing — sched_domain
Linux organises CPUs into a hierarchy of scheduling domains, each describing a set of CPUs that share a resource (HT siblings, L3 cache, NUMA node). The load balancer uses this hierarchy to steal tasks from overloaded CPUs while preferring migrations that stay within the cheapest domain (SMT > MC > NUMA).
sched_setaffinity() and sets isolcpus= in the kernel boot parameters to prevent the scheduler from ever migrating other tasks onto those cores. This eliminates context-switch latency entirely — the polling thread runs forever without interruption.10. Minimal C Demo — CFS vruntime Calculation
Shows how vruntime accumulates at different rates for tasks with different nice values. Uses the actual kernel priority weight table from kernel/sched/core.c.
11. §3.4 — Preemption & Context Switch
Linux supports both user preemption (at every kernel-to-user return) and kernel preemption (at every IRQ return, if preempt_count == 0). The preempt_count field in thread_info is a nest counter: spin locks, RCU read-sides, and interrupt handlers all increment it to defer preemption.
Context Switch — What Exactly Is Saved?
context_switch() calls two things in sequence: switch_mm() to swap address spaces (CR3 reload), then switch_to() to swap CPU register state. The FPU is saved lazily — only when the next task actually uses floating-point, to avoid ~512-byte saves on every switch.
isolcpus=) with SCHED_FIFO priority — the core never leaves user space.Minimal C Demo — Preemption Counter
12. §3.5 — Interrupts & Bottom Halves
Linux splits interrupt work into two halves. The hard IRQ handler runs with interrupts disabled — it must return in microseconds. Heavy work is deferred to a bottom half: softirq, tasklet, or workqueue.
| Bottom Half | Context | Can sleep? | Use case |
|---|---|---|---|
softirq | Softirq (atomic) | No | NET_RX/TX, TIMER, BLOCK — statically allocated, run on every CPU simultaneously |
tasklet | Softirq (atomic) | No | Legacy single-threaded deferred work — deprecated in 5.x, prefer workqueue |
workqueue kworker | Process context | Yes | Firmware load, NFS callbacks, anything needing sleep or mutex |
NAPI — Interrupt Coalescing for High-Throughput Networking
Without NAPI, every received packet fires an IRQ — at 10 Mpps that is 10M IRQs/sec, spending more time in interrupt handlers than forwarding. NAPI solves this: the first packet fires an IRQ which disables further NIC interrupts and schedules a poll loop in NET_RX_SOFTIRQ. The poll loop drains up to napi->weight (default 64) packets per round, then re-enables IRQs only when the ring is empty.
Minimal C Demo — NAPI Poll Loop
13. §3.6 — Kernel Timers & Time Management
| Mechanism | Precision | Data structure | Typical use |
|---|---|---|---|
jiffies | 1/HZ (~1–10 ms) | atomic_long_t per-CPU counter | Coarse timeouts, boot time, uptime |
struct timer_list | 1 jiffie | Hierarchical timer wheel (TV1–TV5) | TCP keepalive, ARP aging, session expiry |
struct hrtimer | ~1 ns (TSC/HPET/APIC) | Red-black tree per clock base | nanosleep, itimer, POSIX interval timers |
NO_HZ_FULL | N/A — skips ticks | Kernel config option | DPDK isolated cores — zero timer interrupts |
Timer Wheel — O(1) Insert, Lazy Cascade
hrtimer — Nanosecond Red-Black Tree
Minimal C Demo — Timer Wheel Simulation
14. §3.7 — Kernel Data Structures
list_head — Circular Doubly-Linked List
The kernel embeds struct list_head directly inside the data struct — there is no separate node allocation. The container_of() macro recovers the outer struct from any list_head * pointer using compile-time offset math. The list is circular: the sentinel head node links back to itself when empty, so every traversal is uniform — no NULL checks needed.
| Macro / Function | Purpose |
|---|---|
list_add(new, head) | Insert after head (stack-like — LIFO) |
list_add_tail(new, head) | Insert before head (queue-like — FIFO) |
list_del(entry) | Unlink entry — O(1), does not free |
list_for_each_entry(pos, head, member) | Iterate — NOT safe if deleting during iteration |
list_for_each_entry_safe(pos, n, head, member) | Iterate — safe to delete current entry (saves next pointer first) |
container_of(ptr, type, member) | Recover containing struct pointer from embedded member pointer |
hlist_head — Single-Pointer Head for Hash Tables
hlist_head uses only one pointer (first) instead of two, halving memory for hash buckets. Deletion still O(1) via the pprev back-pointer trick — the node stores a pointer-to-pointer to itself, avoiding the need to traverse the list.
Red-Black Tree — rb_root + rb_node
The kernel's generic struct rb_node is embedded in the data struct (same pattern as list_head). Color is stored in the lowest bit of the parent pointer — no extra memory. rb_insert_color() and rb_erase() perform rotations to maintain invariants.
Radix Tree / XArray & Bitmap
| Structure | Key insight | Kernel use |
|---|---|---|
Radix tree / XArray | Trie indexed by integer — O(log₆₄ n) ≈ O(1) for 64-bit keys. XArray is the modern API (xa_store, xa_load) with better locking semantics. | Page cache (page index → struct page), IDR (integer ID allocator) |
Bitmap (unsigned long[]) | DECLARE_BITMAP(name, bits). set_bit/clear_bit are atomic. find_first_bit, find_next_bit scan 64 bits per iteration. | CPU masks, IRQ allocation, page frame tracking, network packet flags |
Minimal C Demo — list_head + container_of
15. Kernel Source Pointers — §3.1–3.7
| File / Function | What it does |
|---|---|
include/linux/sched.h :: struct task_struct | The process descriptor — ~750 fields, ~6 KB. Central hub linking mm, files, signals, scheduling, and thread state. |
kernel/fork.c :: copy_process() | Core of fork(): dup_task_struct, copy_mm (COW), copy_files, copy_signal, alloc_pid. Returns new task_struct. |
kernel/fork.c :: wake_up_new_task() | Enqueues the freshly forked child on a run queue and triggers load balancing. |
fs/exec.c :: do_execve() | Loads ELF binary: open_exec, bprm_mm_init (new mm), load_elf_binary, start_thread sets %rip/%rsp. |
kernel/exit.c :: do_exit() | Frees mm, files, fs, IPC resources. Notifies parent with SIGCHLD. Task becomes ZOMBIE. |
kernel/sched/core.c :: schedule() | Main scheduler entry point: deactivates current, calls pick_next_task(), calls context_switch(). |
kernel/sched/fair.c :: pick_next_task_fair() | CFS picker: returns rb_leftmost from cfs_rq. Updates vruntime, sets new task's exec_start. |
kernel/sched/fair.c :: update_curr() | Called every tick: computes delta_exec, updates se.vruntime, calls update_min_vruntime. |
arch/x86/kernel/process_64.c :: __switch_to() | x86 context switch: saves/restores FS/GS base, FPU state, kernel stack pointer (via TSS.sp0). |
arch/x86/entry/entry_64.S :: switch_to_asm | Assembly that swaps %rsp between prev and next kernel stacks — the actual CPU register swap. |
kernel/sched/topology.c :: build_sched_domains() | Builds the sched_domain hierarchy from CPU topology (cpumasks, NUMA distances). |
kernel/sched/core.c :: preempt_schedule_irq() | Kernel preemption entry: checks preempt_count==0 before calling schedule(); triggered on IRQ return to kernel. |
arch/x86/kernel/fpu/core.c :: fpu__save() | Lazy FPU save: called only on context switch if TIF_NEED_FPU_LOAD is set — avoids ~512-byte XSAVE on every switch. |
kernel/softirq.c :: __do_softirq() | Softirq dispatcher: loops through softirq_pending bitmap, calls registered handler (e.g. net_rx_action). Runs up to 10 rounds before waking ksoftirqd. |
net/core/dev.c :: net_rx_action() | NET_RX_SOFTIRQ handler: iterates napi_list, calls poll(), enforces budget across all NAPI instances. |
net/core/dev.c :: napi_schedule() / napi_complete_done() | napi_schedule: adds napi to softirq list, disables NIC IRQ. napi_complete_done: re-enables IRQ when ring drained. |
kernel/time/timer.c :: add_timer() / mod_timer() | Timer wheel insertion: computes TV1–TV5 slot, links into per-CPU wheel. mod_timer reschedules atomically. |
kernel/time/hrtimer.c :: hrtimer_start() / hrtimer_interrupt() | hrtimer_start: inserts into per-CPU rb-tree; programs hardware one-shot if leftmost changes. hrtimer_interrupt: fires expired timers. |
include/linux/list.h :: list_add / list_del / list_for_each_entry_safe | Core doubly-linked list macros. list_for_each_entry_safe saves ->next before iteration, safe for deletion in-loop. |
include/linux/rbtree.h :: rb_insert_color / rb_erase | Generic rb-tree rotation/recolor after insert or erase. Caller provides the sort key via embedded rb_node. |
lib/xarray.c :: xa_store / xa_load | XArray (modern radix tree): lock-free reads, fine-grained locking on writes. Replaces radix_tree_insert in page cache. |
16. Interview Prep — §3.1–3.7
| # | Question | Concise Answer |
|---|---|---|
| Q1 | What is struct task_struct and what are its most important fields? | task_struct is the process descriptor — the kernel's representation of every thread. Key fields: state (scheduling state), pid/tgid (thread vs process ID), mm (address space), files (open FDs), signal/sighand (signals), se.vruntime (CFS scheduling position), thread_info.flags (TIF_NEED_RESCHED). It is allocated from a slab cache and stays alive until the parent reaps the zombie via wait4(). |
| Q2 | What is the difference between pid and tgid? How do fork() and clone() differ? | pid is the unique kernel thread ID; tgid is the thread group leader's pid — all threads in a process share the same tgid. getpid() returns tgid (POSIX PID), gettid() returns pid. fork() creates a new process by calling clone() without sharing flags — new mm, files, signal_struct. pthread_create() calls clone() with CLONE_VM | CLONE_FILES | CLONE_SIGHAND | CLONE_THREAD — sharing all key structures. Only the kernel stack and pid differ. |
| Q3 | Explain CFS virtual runtime and the red-black tree scheduling. | CFS computes vruntime += delta_exec × (NICE_0_LOAD / task_weight) every tick. Tasks with higher priority (lower nice) have higher weight — their vruntime accumulates slower — so they stay at the rb-tree leftmost position longer and get more CPU time. pick_next_task_fair() always returns rb_leftmost (cached, O(1)). When a task is preempted, it is re-inserted at its new vruntime. On wakeup, the task's vruntime is clamped to min_vruntime - latency_target to prevent it from monopolizing after a sleep. |
| Q4 | What is Copy-on-Write and when does it trigger after fork()? | After fork(), parent and child share physical pages. copy_process() marks all writable PTEs as read-only and increments each page's _refcount. The first write to a shared page raises a protection fault. do_cow_fault() allocates a new page, copies the 4 KB content, updates the faulter's PTE to the new page with write permission, and decrements the original page's refcount. If refcount drops to 1, the remaining task's PTE is re-armed writable — future accesses need no fault. This means fork() is O(size_of_page_table), not O(address_space_size). |
| Q5 | What is the difference between SCHED_FIFO, SCHED_RR, and SCHED_NORMAL? | SCHED_NORMAL (SCHED_OTHER): default CFS policy. Weighted fair sharing via vruntime. No starvation. SCHED_FIFO: real-time; task runs until it blocks or yields. No timeslice. Higher-priority RT task preempts immediately. Lower-priority tasks can starve. SCHED_RR: same as SCHED_FIFO but with a configurable timeslice — after expiry the task moves to the tail of its priority queue. RT tasks (FIFO/RR) always preempt CFS tasks because RT schedulers are checked first in pick_next_task(). |
| Q6 | Walk through switch_to() — what exactly is saved and restored? | context_switch() does two steps. (1) switch_mm(): if next->mm != prev->mm, reload CR3 with next's page-global directory — flushes TLB for user mappings. (2) switch_to() assembly: push callee-saved registers (%rbx, %rbp, %r12–r15) onto prev's kernel stack; save %rsp to prev->thread.sp; load %rsp from next->thread.sp; pop next's callee-saved regs; ret — CPU now executes at next's saved %rip. Caller-saved registers (%rax, %rcx, %rdx, %rsi, %rdi, %r8–r11) are NOT saved here — they were already on the stack when the task voluntarily called schedule(). FPU state is saved lazily only when the next task first uses floating-point. |
| Q7 | What is NAPI and why does it improve network throughput? | NAPI (New API) is interrupt coalescing for the kernel network receive path. Without NAPI, each received packet fires a hard IRQ, resulting in millions of interrupts per second at high packet rates — the interrupt overhead can exceed useful work. With NAPI: the first packet fires an IRQ; the driver calls napi_schedule(), which disables further NIC interrupts and adds a napi_struct to the softirq work list; NET_RX_SOFTIRQ calls the driver's poll() function which drains up to napi->weight (default 64) packets in a tight loop without interrupts; when the ring is empty (done < budget), napi_complete_done() re-enables NIC interrupts. This amortizes interrupt overhead across many packets and enables the driver to stay in a hot loop that exploits CPU caches efficiently. |
| Q8 | Explain the timer wheel — what is its time complexity for insert vs. expire? | The timer wheel (struct timer_list) uses a hierarchical hash wheel: TV1 has 256 slots (each = 1 jiffie), TV2 has 64 slots (each = 256 jiffies), and so on up to TV5. add_timer() computes the slot as (expires & 0xff) into TV1 — O(1) insertion. Every jiffie, the kernel advances the TV1 index and fires all timers in the current slot — O(1) scan (firing is O(k) where k is the number of timers in the slot). When TV1 wraps (every 256 jiffies), the corresponding TV2 slot is 'cascaded' into TV1 — O(cascade) which is O(n_timers_in_that_tv2_slot). Over time, the amortized cost remains near O(1) per timer because cascade is rare. Compare: a sorted linked list is O(log n) insert, O(n) scan; a min-heap is O(log n) insert, O(log n) pop. |
| Q9 | What is NO_HZ_FULL and why does DPDK need it? | NO_HZ_FULL (tickless full) is a kernel config that suppresses the periodic scheduler tick (normally HZ times/second) on CPUs that have exactly one runnable task. Normally, even a busy polling thread receives timer interrupts that cause context switches into the kernel. These interrupt the tight polling loop, flush TLB entries and pipeline state, and add latency jitter of ~10–100 µs. With NO_HZ_FULL on a DPDK-isolated CPU (combined with isolcpus= and rcu_nocbs=), the CPU runs the poll loop without any timer interrupts — zero latency jitter. The CPU is still woken for critical events (RCU, scheduler clock), but at microsecond intervals, not every 1–4 ms. |
| Q10 | How does list_for_each_entry_safe differ from list_for_each_entry, and why does the kernel use hlist_head for hash tables instead of list_head? | list_for_each_entry(pos, head, member) expands to a for-loop that reads pos->member.next at each iteration. If the loop body deletes pos, its next pointer is freed/poisoned — the next iteration reads garbage. list_for_each_entry_safe(pos, n, head, member) saves pos->member.next into 'n' before the loop body runs, so deletion of pos is safe. Regarding hlist vs. list: list_head has both next and prev pointers — 16 bytes per head node. For a hash table with 65536 buckets, that is 1 MB of head nodes. hlist_head has only a single 'first' pointer — 8 bytes per head, saving 512 KB. Deletion is still O(1) via the pprev trick: hlist_node stores a pointer-to-pointer (pprev) pointing to the slot that holds this node's address, allowing unlink without traversal. The kernel uses hlist for the PID hash, connection tracking tables, routing cache, and any other large hash table where the number of buckets dominates memory. |