Web3 Docs

1. Overview

Every running entity — process or thread — is described by a struct task_struct in the kernel. It is the central hub that links together the process's address space, open files, signal handlers, and scheduling state. The CFS scheduler maintains one per-CPU cfs_rq red-black tree sorted by accumulated virtual runtime (vruntime); the task with the smallest vruntime is always next to run. When a task calls fork(), the kernel copies the task descriptor and marks all writable pages as read-only — physical pages are shared until one side writes, at which point a private copy is made (Copy-on-Write).

2. Key Data Structure — `struct task_struct`

task_struct lives in kernel memory (not on any stack) and stays alive until the process is reaped by its parent. On x86-64 it is roughly 6 KB. The kernel accesses the current task via the current macro, which reads a per-CPU pointer stored in a segment register (%gs) — O(1) with no locking.

Field	Type	Purpose
`state / __state`	unsigned int	Current scheduling state — TASK_RUNNING (on run queue or running), TASK_INTERRUPTIBLE (sleeping, can be woken by signal), TASK_UNINTERRUPTIBLE (waiting for kernel I/O — cannot be killed), TASK_ZOMBIE (exited, not yet reaped)
`pid`	pid_t	Unique thread ID — what /proc shows as Pid. Each clone() call gets a fresh pid even within a thread group.
`tgid`	pid_t	Thread group ID — equals the pid of the thread group leader (the main thread). getpid() returns tgid, so all threads in a process see the same PID.
`mm`	struct mm_struct *	Address space — page tables, VMA list, brk pointer. NULL for kernel threads (they use active_mm from the last user task that ran on the CPU).
`files`	struct files_struct *	Open file descriptor table. Shared between threads via CLONE_FILES; duplicated (COW) on fork without CLONE_FILES.
`se (sched_entity)`	struct sched_entity	Embedded CFS scheduling state: vruntime, load weight, runnable load average. This is the rb-tree node — se.rb_node is the key inserted into cfs_rq.
`prio / static_prio`	int	Effective priority (0–139). RT tasks use 0–99; CFS tasks use 100–139 (maps from nice -20..+19). Lower number = higher priority.
`thread_info`	struct thread_info	Small struct at the bottom of the kernel stack. Contains TIF_NEED_RESCHED flag checked on every kernel-to-user return and after every interrupt.

3. Threads vs Processes — `pid` vs `tgid`

Linux has no separate "thread" kernel object — a thread is just a task_struct that shares its parent's mm_struct, files_struct, and signal_struct. The flags passed to clone() (which pthread_create uses) control exactly what is shared vs copied.

Why does getpid() return tgid? POSIX defines PID as the process identifier shared by all threads. The kernel exposes tgid via getpid() and the per-thread ID via gettid(). In /proc, /proc/<tgid>/task/<tid>/ shows per-thread entries.

4. Process Lifecycle

A Linux process moves through well-defined states from creation to reaping. The state machine determines whether the task is on a run queue, sleeping in a wait queue, or consuming a task descriptor entry while awaiting parent acknowledgement.

5. Core Mechanism — `fork()` → `exec()` → `exit()`

Background: A shell needs to run ls. It must create a new process, replace that process's image with the ls binary, let it run to completion, and collect its exit status — all without corrupting the shell's own address space.

Plan:

fork() — duplicate task_struct; COW-share pages
Child calls exec() — kernel replaces mm with the ELF image
Child runs, then calls exit() — kernel frees resources, leaves ZOMBIE entry
Parent calls wait4() — kernel copies exit code, frees task_struct

Walkthrough — COW page fault

Step	Actor	State change
1	fork() completes	Child inherits parent page table. Physical page P at VA 0x4000 has refcnt=2. Both parent and child PTEs point to P with PTE.W=0.
2	Child writes to VA 0x4000	CPU checks PTE.W=0 → raises protection fault (#PF). Kernel fault handler called.
3	do_cow_fault()	Kernel allocates new physical page P' from buddy. copy_user_highpage() copies 4 KB from P to P'.
4	PTE update	Child's PTE updated to P', PTE.W=1. P's refcnt decremented to 1. If refcnt drops to 1, parent's PTE is re-armed writable (no future fault needed).
5	Child resumes	Child's store completes to P'. Parent still reads from P — sees unmodified data.

6. Minimal C Demo — task_struct Simulation

Simulates four tasks (init, bash, two worker threads sharing a tgid) and prints their scheduling state. The last line shows which task CFS would pick next — the one with the smallest vruntime.

task_struct — process table simulation — C Demo

stdin (optional)

7. Minimal C Demo — fork / exec / exit / wait4 Lifecycle

Simulates the shell's fork-exec-wait cycle entirely in user-space C. Prints each transition so you can trace the state machine without reading kernel source.

fork → exec → exit → wait4 lifecycle — C Demo

stdin (optional)

8. CFS — Completely Fair Scheduler

CFS replaced the O(1) scheduler in Linux 2.6.23. Its core idea: instead of fixed timeslices, every task accumulates virtual runtime (vruntime) at a rate inversely proportional to its priority weight. The task with the smallest vruntime is always picked next, so CPU time is distributed proportionally to weights — that is, fairly.

CFS Red-Black Tree

Each CPU maintains a struct cfs_rq containing an rb-tree keyed by se.vruntime. Insertion, deletion, and lookup are all O(log n). The leftmost node is cached in cfs_rq->rb_leftmost — so pick_next_task_fair() is O(1) in the common case.

vruntime Update Formula

nice	weight	Effect on vruntime
-20	88761	vruntime grows ~87× slower than nice=0 — near-monopoly of CPU
-5	3121	vruntime grows ~3× slower — noticeably higher priority
0	1024	Baseline — vruntime grows at 1:1 with wall time
+5	335	vruntime grows ~3× faster — lower priority
+19	15	vruntime grows ~68× faster — near-idle priority

schedule() — the dispatcher

Real-Time Schedulers

Policy	Data structure	Behavior
`SCHED_FIFO`	Per-prio FIFO queue	RT task runs until it blocks or yields. No timeslice. Higher-prio task preempts immediately. Starvation possible.
`SCHED_RR`	Per-prio round-robin queue	Same as SCHED_FIFO but with a timeslice. After timeslice expires, task moves to tail of its priority queue.
`SCHED_NORMAL (SCHED_OTHER)`	CFS rb-tree	Default for all user processes. Weighted fair sharing via vruntime. No starvation.
`SCHED_DEADLINE`	EDF (earliest deadline first) queue	Sporadic real-time tasks. Kernel enforces bandwidth admission control — task declares runtime/period/deadline.

9. Load Balancing — `sched_domain`

Linux organises CPUs into a hierarchy of scheduling domains, each describing a set of CPUs that share a resource (HT siblings, L3 cache, NUMA node). The load balancer uses this hierarchy to steal tasks from overloaded CPUs while preferring migrations that stay within the cheapest domain (SMT > MC > NUMA).

DPDK tie-in: DPDK binds poll-mode threads to dedicated cores using sched_setaffinity() and sets isolcpus= in the kernel boot parameters to prevent the scheduler from ever migrating other tasks onto those cores. This eliminates context-switch latency entirely — the polling thread runs forever without interruption.

10. Minimal C Demo — CFS vruntime Calculation

Shows how vruntime accumulates at different rates for tasks with different nice values. Uses the actual kernel priority weight table from kernel/sched/core.c.

CFS vruntime — nice value weight table — C Demo

stdin (optional)

11. §3.4 — Preemption & Context Switch

Linux supports both user preemption (at every kernel-to-user return) and kernel preemption (at every IRQ return, if preempt_count == 0). The preempt_count field in thread_info is a nest counter: spin locks, RCU read-sides, and interrupt handlers all increment it to defer preemption.

Context Switch — What Exactly Is Saved?

context_switch() calls two things in sequence: switch_mm() to swap address spaces (CR3 reload), then switch_to() to swap CPU register state. The FPU is saved lazily — only when the next task actually uses floating-point, to avoid ~512-byte saves on every switch.

DPDK tie-in: A single context switch costs 1–5 µs directly plus many cache/TLB misses indirectly. At 10 Mpps per core, that budget is 100 ns/packet. DPDK eliminates context switches entirely by running a polling loop on isolated CPUs (isolcpus=) with SCHED_FIFO priority — the core never leaves user space.

Minimal C Demo — Preemption Counter

Preemption counter — spin_lock / scheduler_tick / spin_unlock — C Demo

stdin (optional)

12. §3.5 — Interrupts & Bottom Halves

Linux splits interrupt work into two halves. The hard IRQ handler runs with interrupts disabled — it must return in microseconds. Heavy work is deferred to a bottom half: softirq, tasklet, or workqueue.

Bottom Half	Context	Can sleep?	Use case
`softirq`	Softirq (atomic)	No	NET_RX/TX, TIMER, BLOCK — statically allocated, run on every CPU simultaneously
`tasklet`	Softirq (atomic)	No	Legacy single-threaded deferred work — deprecated in 5.x, prefer workqueue
`workqueue kworker`	Process context	Yes	Firmware load, NFS callbacks, anything needing sleep or mutex

NAPI — Interrupt Coalescing for High-Throughput Networking

Without NAPI, every received packet fires an IRQ — at 10 Mpps that is 10M IRQs/sec, spending more time in interrupt handlers than forwarding. NAPI solves this: the first packet fires an IRQ which disables further NIC interrupts and schedules a poll loop in NET_RX_SOFTIRQ. The poll loop drains up to napi->weight (default 64) packets per round, then re-enables IRQs only when the ring is empty.

Minimal C Demo — NAPI Poll Loop

NAPI poll — batch RX ring drain with budget — C Demo

stdin (optional)

13. §3.6 — Kernel Timers & Time Management

Mechanism	Precision	Data structure	Typical use
`jiffies`	1/HZ (~1–10 ms)	atomic_long_t per-CPU counter	Coarse timeouts, boot time, uptime
`struct timer_list`	1 jiffie	Hierarchical timer wheel (TV1–TV5)	TCP keepalive, ARP aging, session expiry
`struct hrtimer`	~1 ns (TSC/HPET/APIC)	Red-black tree per clock base	nanosleep, itimer, POSIX interval timers
`NO_HZ_FULL`	N/A — skips ticks	Kernel config option	DPDK isolated cores — zero timer interrupts

Timer Wheel — O(1) Insert, Lazy Cascade

Background: A TCP server may have 100 000 active connections each with a keepalive timer. Inserting all of them into a sorted list would be O(log n) per insert and O(n) to scan for expirations. The timer wheel hashes by expiry jiffy into a slot — O(1) insert and O(1) expiry scan per tick.

hrtimer — Nanosecond Red-Black Tree

Minimal C Demo — Timer Wheel Simulation

Timer wheel — O(1) insert + expiry cascade — C Demo

stdin (optional)

14. §3.7 — Kernel Data Structures

list_head — Circular Doubly-Linked List

The kernel embeds struct list_head directly inside the data struct — there is no separate node allocation. The container_of() macro recovers the outer struct from any list_head * pointer using compile-time offset math. The list is circular: the sentinel head node links back to itself when empty, so every traversal is uniform — no NULL checks needed.

Macro / Function	Purpose
`list_add(new, head)`	Insert after head (stack-like — LIFO)
`list_add_tail(new, head)`	Insert before head (queue-like — FIFO)
`list_del(entry)`	Unlink entry — O(1), does not free
`list_for_each_entry(pos, head, member)`	Iterate — NOT safe if deleting during iteration
`list_for_each_entry_safe(pos, n, head, member)`	Iterate — safe to delete current entry (saves next pointer first)
`container_of(ptr, type, member)`	Recover containing struct pointer from embedded member pointer

hlist_head — Single-Pointer Head for Hash Tables

hlist_head uses only one pointer (first) instead of two, halving memory for hash buckets. Deletion still O(1) via the pprev back-pointer trick — the node stores a pointer-to-pointer to itself, avoiding the need to traverse the list.

Red-Black Tree — rb_root + rb_node

The kernel's generic struct rb_node is embedded in the data struct (same pattern as list_head). Color is stored in the lowest bit of the parent pointer — no extra memory. rb_insert_color() and rb_erase() perform rotations to maintain invariants.

Radix Tree / XArray & Bitmap

Structure	Key insight	Kernel use
`Radix tree / XArray`	Trie indexed by integer — O(log₆₄ n) ≈ O(1) for 64-bit keys. XArray is the modern API (xa_store, xa_load) with better locking semantics.	Page cache (page index → struct page), IDR (integer ID allocator)
`Bitmap (unsigned long[])`	DECLARE_BITMAP(name, bits). set_bit/clear_bit are atomic. find_first_bit, find_next_bit scan 64 bits per iteration.	CPU masks, IRQ allocation, page frame tracking, network packet flags

Minimal C Demo — list_head + container_of

list_head embedded list + container_of recovery — C Demo

stdin (optional)

15. Kernel Source Pointers — §3.1–3.7

File / Function	What it does
`include/linux/sched.h :: struct task_struct`	The process descriptor — ~750 fields, ~6 KB. Central hub linking mm, files, signals, scheduling, and thread state.
`kernel/fork.c :: copy_process()`	Core of fork(): dup_task_struct, copy_mm (COW), copy_files, copy_signal, alloc_pid. Returns new task_struct.
`kernel/fork.c :: wake_up_new_task()`	Enqueues the freshly forked child on a run queue and triggers load balancing.
`fs/exec.c :: do_execve()`	Loads ELF binary: open_exec, bprm_mm_init (new mm), load_elf_binary, start_thread sets %rip/%rsp.
`kernel/exit.c :: do_exit()`	Frees mm, files, fs, IPC resources. Notifies parent with SIGCHLD. Task becomes ZOMBIE.
`kernel/sched/core.c :: schedule()`	Main scheduler entry point: deactivates current, calls pick_next_task(), calls context_switch().
`kernel/sched/fair.c :: pick_next_task_fair()`	CFS picker: returns rb_leftmost from cfs_rq. Updates vruntime, sets new task's exec_start.
`kernel/sched/fair.c :: update_curr()`	Called every tick: computes delta_exec, updates se.vruntime, calls update_min_vruntime.
`arch/x86/kernel/process_64.c :: __switch_to()`	x86 context switch: saves/restores FS/GS base, FPU state, kernel stack pointer (via TSS.sp0).
`arch/x86/entry/entry_64.S :: switch_to_asm`	Assembly that swaps %rsp between prev and next kernel stacks — the actual CPU register swap.
`kernel/sched/topology.c :: build_sched_domains()`	Builds the sched_domain hierarchy from CPU topology (cpumasks, NUMA distances).
`kernel/sched/core.c :: preempt_schedule_irq()`	Kernel preemption entry: checks preempt_count==0 before calling schedule(); triggered on IRQ return to kernel.
`arch/x86/kernel/fpu/core.c :: fpu__save()`	Lazy FPU save: called only on context switch if TIF_NEED_FPU_LOAD is set — avoids ~512-byte XSAVE on every switch.
`kernel/softirq.c :: __do_softirq()`	Softirq dispatcher: loops through softirq_pending bitmap, calls registered handler (e.g. net_rx_action). Runs up to 10 rounds before waking ksoftirqd.
`net/core/dev.c :: net_rx_action()`	NET_RX_SOFTIRQ handler: iterates napi_list, calls poll(), enforces budget across all NAPI instances.
`net/core/dev.c :: napi_schedule() / napi_complete_done()`	napi_schedule: adds napi to softirq list, disables NIC IRQ. napi_complete_done: re-enables IRQ when ring drained.
`kernel/time/timer.c :: add_timer() / mod_timer()`	Timer wheel insertion: computes TV1–TV5 slot, links into per-CPU wheel. mod_timer reschedules atomically.
`kernel/time/hrtimer.c :: hrtimer_start() / hrtimer_interrupt()`	hrtimer_start: inserts into per-CPU rb-tree; programs hardware one-shot if leftmost changes. hrtimer_interrupt: fires expired timers.
`include/linux/list.h :: list_add / list_del / list_for_each_entry_safe`	Core doubly-linked list macros. list_for_each_entry_safe saves ->next before iteration, safe for deletion in-loop.
`include/linux/rbtree.h :: rb_insert_color / rb_erase`	Generic rb-tree rotation/recolor after insert or erase. Caller provides the sort key via embedded rb_node.
`lib/xarray.c :: xa_store / xa_load`	XArray (modern radix tree): lock-free reads, fine-grained locking on writes. Replaces radix_tree_insert in page cache.

16. Interview Prep — §3.1–3.7

#	Question	Concise Answer
Q1	What is struct task_struct and what are its most important fields?	task_struct is the process descriptor — the kernel's representation of every thread. Key fields: state (scheduling state), pid/tgid (thread vs process ID), mm (address space), files (open FDs), signal/sighand (signals), se.vruntime (CFS scheduling position), thread_info.flags (TIF_NEED_RESCHED). It is allocated from a slab cache and stays alive until the parent reaps the zombie via wait4().
Q2	What is the difference between pid and tgid? How do fork() and clone() differ?	pid is the unique kernel thread ID; tgid is the thread group leader's pid — all threads in a process share the same tgid. getpid() returns tgid (POSIX PID), gettid() returns pid. fork() creates a new process by calling clone() without sharing flags — new mm, files, signal_struct. pthread_create() calls clone() with CLONE_VM \| CLONE_FILES \| CLONE_SIGHAND \| CLONE_THREAD — sharing all key structures. Only the kernel stack and pid differ.
Q3	Explain CFS virtual runtime and the red-black tree scheduling.	CFS computes vruntime += delta_exec × (NICE_0_LOAD / task_weight) every tick. Tasks with higher priority (lower nice) have higher weight — their vruntime accumulates slower — so they stay at the rb-tree leftmost position longer and get more CPU time. pick_next_task_fair() always returns rb_leftmost (cached, O(1)). When a task is preempted, it is re-inserted at its new vruntime. On wakeup, the task's vruntime is clamped to min_vruntime - latency_target to prevent it from monopolizing after a sleep.
Q4	What is Copy-on-Write and when does it trigger after fork()?	After fork(), parent and child share physical pages. copy_process() marks all writable PTEs as read-only and increments each page's _refcount. The first write to a shared page raises a protection fault. do_cow_fault() allocates a new page, copies the 4 KB content, updates the faulter's PTE to the new page with write permission, and decrements the original page's refcount. If refcount drops to 1, the remaining task's PTE is re-armed writable — future accesses need no fault. This means fork() is O(size_of_page_table), not O(address_space_size).
Q5	What is the difference between SCHED_FIFO, SCHED_RR, and SCHED_NORMAL?	SCHED_NORMAL (SCHED_OTHER): default CFS policy. Weighted fair sharing via vruntime. No starvation. SCHED_FIFO: real-time; task runs until it blocks or yields. No timeslice. Higher-priority RT task preempts immediately. Lower-priority tasks can starve. SCHED_RR: same as SCHED_FIFO but with a configurable timeslice — after expiry the task moves to the tail of its priority queue. RT tasks (FIFO/RR) always preempt CFS tasks because RT schedulers are checked first in pick_next_task().
Q6	Walk through switch_to() — what exactly is saved and restored?	context_switch() does two steps. (1) switch_mm(): if next->mm != prev->mm, reload CR3 with next's page-global directory — flushes TLB for user mappings. (2) switch_to() assembly: push callee-saved registers (%rbx, %rbp, %r12–r15) onto prev's kernel stack; save %rsp to prev->thread.sp; load %rsp from next->thread.sp; pop next's callee-saved regs; ret — CPU now executes at next's saved %rip. Caller-saved registers (%rax, %rcx, %rdx, %rsi, %rdi, %r8–r11) are NOT saved here — they were already on the stack when the task voluntarily called schedule(). FPU state is saved lazily only when the next task first uses floating-point.
Q7	What is NAPI and why does it improve network throughput?	NAPI (New API) is interrupt coalescing for the kernel network receive path. Without NAPI, each received packet fires a hard IRQ, resulting in millions of interrupts per second at high packet rates — the interrupt overhead can exceed useful work. With NAPI: the first packet fires an IRQ; the driver calls napi_schedule(), which disables further NIC interrupts and adds a napi_struct to the softirq work list; NET_RX_SOFTIRQ calls the driver's poll() function which drains up to napi->weight (default 64) packets in a tight loop without interrupts; when the ring is empty (done < budget), napi_complete_done() re-enables NIC interrupts. This amortizes interrupt overhead across many packets and enables the driver to stay in a hot loop that exploits CPU caches efficiently.
Q8	Explain the timer wheel — what is its time complexity for insert vs. expire?	The timer wheel (struct timer_list) uses a hierarchical hash wheel: TV1 has 256 slots (each = 1 jiffie), TV2 has 64 slots (each = 256 jiffies), and so on up to TV5. add_timer() computes the slot as (expires & 0xff) into TV1 — O(1) insertion. Every jiffie, the kernel advances the TV1 index and fires all timers in the current slot — O(1) scan (firing is O(k) where k is the number of timers in the slot). When TV1 wraps (every 256 jiffies), the corresponding TV2 slot is 'cascaded' into TV1 — O(cascade) which is O(n_timers_in_that_tv2_slot). Over time, the amortized cost remains near O(1) per timer because cascade is rare. Compare: a sorted linked list is O(log n) insert, O(n) scan; a min-heap is O(log n) insert, O(log n) pop.
Q9	What is NO_HZ_FULL and why does DPDK need it?	NO_HZ_FULL (tickless full) is a kernel config that suppresses the periodic scheduler tick (normally HZ times/second) on CPUs that have exactly one runnable task. Normally, even a busy polling thread receives timer interrupts that cause context switches into the kernel. These interrupt the tight polling loop, flush TLB entries and pipeline state, and add latency jitter of ~10–100 µs. With NO_HZ_FULL on a DPDK-isolated CPU (combined with isolcpus= and rcu_nocbs=), the CPU runs the poll loop without any timer interrupts — zero latency jitter. The CPU is still woken for critical events (RCU, scheduler clock), but at microsecond intervals, not every 1–4 ms.
Q10	How does list_for_each_entry_safe differ from list_for_each_entry, and why does the kernel use hlist_head for hash tables instead of list_head?	list_for_each_entry(pos, head, member) expands to a for-loop that reads pos->member.next at each iteration. If the loop body deletes pos, its next pointer is freed/poisoned — the next iteration reads garbage. list_for_each_entry_safe(pos, n, head, member) saves pos->member.next into 'n' before the loop body runs, so deletion of pos is safe. Regarding hlist vs. list: list_head has both next and prev pointers — 16 bytes per head node. For a hash table with 65536 buckets, that is 1 MB of head nodes. hlist_head has only a single 'first' pointer — 8 bytes per head, saving 512 KB. Deletion is still O(1) via the pprev trick: hlist_node stores a pointer-to-pointer (pprev) pointing to the slot that holds this node's address, allowing unlink without traversal. The kernel uses hlist for the PID hash, connection tracking tables, routing cache, and any other large hash table where the number of buckets dominates memory.

§ 3.1 – 3.7 Process Management — Complete

1. Overview

2. Key Data Structure — struct task_struct

3. Threads vs Processes — pid vs tgid

4. Process Lifecycle

5. Core Mechanism — fork() → exec() → exit()

Walkthrough — COW page fault

6. Minimal C Demo — task_struct Simulation

7. Minimal C Demo — fork / exec / exit / wait4 Lifecycle

8. CFS — Completely Fair Scheduler

CFS Red-Black Tree

vruntime Update Formula

schedule() — the dispatcher

Real-Time Schedulers

9. Load Balancing — sched_domain

10. Minimal C Demo — CFS vruntime Calculation

11. §3.4 — Preemption & Context Switch

Context Switch — What Exactly Is Saved?

Minimal C Demo — Preemption Counter

12. §3.5 — Interrupts & Bottom Halves

NAPI — Interrupt Coalescing for High-Throughput Networking

Minimal C Demo — NAPI Poll Loop

13. §3.6 — Kernel Timers & Time Management

Timer Wheel — O(1) Insert, Lazy Cascade

hrtimer — Nanosecond Red-Black Tree

Minimal C Demo — Timer Wheel Simulation

14. §3.7 — Kernel Data Structures

list_head — Circular Doubly-Linked List

hlist_head — Single-Pointer Head for Hash Tables

Red-Black Tree — rb_root + rb_node

Radix Tree / XArray & Bitmap

Minimal C Demo — list_head + container_of

15. Kernel Source Pointers — §3.1–3.7

16. Interview Prep — §3.1–3.7

2. Key Data Structure — `struct task_struct`

3. Threads vs Processes — `pid` vs `tgid`

5. Core Mechanism — `fork()` → `exec()` → `exit()`

9. Load Balancing — `sched_domain`