Web3 Docs

1. Overview

User programs cannot directly access hardware — they request kernel services via system calls. On x86-64 a single syscall instruction switches to ring 0, loads the kernel entry point from MSR_LSTAR, and dispatches through sys_call_table[rax]. For latency-critical reads like gettimeofday, the kernel maps a vDSO page into every process — the function runs in user space with direct TSC reads, avoiding any ring transition.

2. Key Data Structure — `sys_call_table`

The kernel maintains a static array of function pointers indexed by syscall number. On x86-64, rax carries the number; rdi, rsi, rdx, r10, r8, r9 carry up to six arguments. The SYSCALL_DEFINEN macro wraps each handler, adding auditing, seccomp, and ptrace hooks automatically.

Register	Role	Example (read)
`rax`	Syscall number (in); return value (out)	__NR_read = 0
`rdi`	Arg 1	fd
`rsi`	Arg 2	buf pointer
`rdx`	Arg 3	count
`r10`	Arg 4 (replaces rcx — clobbered by syscall)	—
`r8, r9`	Args 5–6	—

vDSO — Zero-Cost Clock Reads

Why vDSO works: The kernel updates a shared memory page every tick with the current TSC value and scale factor. User code reads it directly — no syscall instruction, no ring switch, no TLB flush. Latency drops from ~100 ns (syscall) to ~5 ns (plain memory read + TSC calculation).

3. Minimal C Demo — Syscall Dispatch Table

Syscall Dispatch — sys_call_table simulation — C Demo

stdin (optional)

4. Core Mechanism — `epoll`

Background: A server handles 100k concurrent connections. select() scans all FDs on every call — O(n). As connections grow, CPU time on polling grows linearly. epoll moves the interest set into the kernel; only ready FDs are returned — O(1) per event regardless of total count.

Plan:

epoll_create1() — allocate struct eventpoll with an empty red-black tree (interest set) and an empty ready list
epoll_ctl(ADD) — allocate epitem, insert into rbtree keyed by fd, register a wait-queue callback on the fd's file
When fd becomes readable, its wait-queue fires → epitem moves from rbtree to rdllist
epoll_wait() — if rdllist empty: sleep; else copy ready epitems to caller's events[]

Walkthrough — LT vs ET with 3 KB data on a 4 KB socket buffer

Step	Level-Triggered	Edge-Triggered
1. Data arrives	epoll_wait returns fd	epoll_wait returns fd
2. App reads 1 KB	returns, 2 KB left	returns, 2 KB left
3. Next epoll_wait	returns fd again (still readable)	DOES NOT return (no new edge)
4. Result	App eventually reads all data	App misses 2 KB until more data arrives
Fix for ET	—	Must loop read() until EAGAIN after every event

epoll Interest Set + Ready List simulation — C Demo

stdin (optional)

5. Core Mechanism — `io_uring`

Background: Even with epoll, every I/O operation costs a syscall (read(), write(), accept()…). At 10 million I/Os/s that's ~10M ring transitions per second. io_uring moves all I/O submission and completion through a shared-memory ring — userspace and kernel communicate without any syscall in the hot path.

SQPOLL — Zero-Syscall Mode

How it works: With IORING_SETUP_SQPOLL, the kernel spawns a dedicated kthread that busypolls the SQ tail. The app writes SQEs and advances sq_tail — the kernel thread picks them up immediately. No io_uring_enter() needed. CQEs appear in the CQ ring; the app polls cq_head != cq_tail. Result: pure userspace event loop with kernel I/O — zero syscalls per I/O operation.

Walkthrough — Submit READ + reap CQE

Step	Actor	State
1	App	Fills SQE: op=READ, fd=3, buf=0x1000, len=4096, user_data=42
2	App	Atomically increments sq_tail (ring producer index)
3	Kernel / sq_thread	Sees sq_tail advanced; dequeues SQE; submits async read to block layer
4	Block layer	DMA completes; io_uring completion callback fires
5	Kernel	Writes CQE: user_data=42, res=4096 (bytes read); advances cq_tail
6	App	Observes cq_head != cq_tail; reads CQE; advances cq_head

io_uring SQ/CQ Ring Simulation — C Demo

stdin (optional)

6. Network Syscalls — `send()` Path & `sendfile()`

Why sendfile() is "zero-copy":The file data moves from the page cache directly to the NIC's DMA buffer — the CPU never touches the payload. Compare to read()+write(): two copies (page cache → user buffer → socket buffer) plus two context switches. HTTP static file servers (nginx, etc.) use sendfile() for all static assets.

API	Mechanism	Copies	Use Case
`read()+write()`	user buffer as intermediary	2	General purpose
`sendfile()`	page cache → NIC DMA	0 (CPU)	Static file serving
`splice()`	pipe buffer as kernel intermediary	0 (CPU)	Pipe to socket transfer
`mmap()+write()`	user maps page cache; write from it	1	Random access file I/O

7. Signals — Asynchronous Process Notification

Signals are the UNIX mechanism for asynchronous notification. The kernel represents pending signals as a bitmap (sigset_t) in task_struct.pending. On every return-to-user the kernel checks TIF_SIGPENDING; if set, it calls do_signal() to set up the handler frame.

Signal	Default Action	Catchable?	Common Use
`SIGTERM (15)`	Terminate	Yes	Graceful shutdown request
`SIGKILL (9)`	Terminate immediately	No	Force kill — cannot be caught or ignored
`SIGSEGV (11)`	Core dump	Yes (limited)	Invalid memory access
`SIGINT (2)`	Terminate	Yes	Ctrl+C from terminal
`SIGCHLD (17)`	Ignore	Yes	Child process state change
`SIGUSR1/2 (10/12)`	Terminate	Yes	User-defined application events
`SIGALRM (14)`	Terminate	Yes	alarm() timer expiry

Signal Delivery — pending bitmap + do_signal() — C Demo

stdin (optional)

8. Kernel Source Pointers

File	Symbol	What it does
`arch/x86/entry/entry_64.S`	`entry_SYSCALL_64`	Assembly entry: swapgs, save pt_regs, call do_syscall_64
`arch/x86/entry/common.c`	`do_syscall_64()`	Dispatches via sys_call_table[nr]
`arch/x86/entry/syscalls/syscall_64.tbl`	`—`	Syscall number → name mapping
`include/linux/syscalls.h`	`SYSCALL_DEFINE*`	Macros that generate handler prototypes + tracing hooks
`fs/eventpoll.c`	`do_epoll_wait(), ep_poll()`	epoll wait loop; rdllist harvest
`fs/eventpoll.c`	`ep_insert()`	epoll_ctl ADD: allocate epitem, insert into rbtree
`io_uring/io_uring.c`	`io_uring_setup(), io_submit_sqes()`	Ring setup and SQE dispatch
`io_uring/sq_poll.c`	`io_sq_thread()`	SQPOLL kernel thread — busypolls SQ
`net/socket.c`	`sock_sendmsg()`	Top-level send path entry
`fs/sendfile.c`	`do_sendfile()`	Zero-copy file-to-socket transfer
`kernel/signal.c`	`do_signal(), send_signal()`	Signal delivery and pending management
`arch/x86/kernel/signal.c`	`setup_rt_frame()`	Builds signal frame on user stack

9. Interview Prep

#	Question	Key Answer
1	Walk through what happens when a process calls read() on a blocking socket.	syscall → entry_SYSCALL_64 → sys_call_table[0] → sys_read → sock_recvmsg → if no data: task_struct.state = TASK_INTERRUPTIBLE, add to socket wait queue, schedule(). When data arrives: NIC IRQ → NAPI → skb → socket recv queue → wake_up() → task re-queued → returns from schedule() → copy_to_user → return count.
2	How does epoll differ from select/poll internally? Why is it O(1)?	select/poll scan the entire FD set on every call — O(n). epoll registers interest once; when a fd becomes ready, the kernel appends its epitem to rdllist via a wait-queue callback. epoll_wait just drains rdllist — O(ready events), not O(all FDs).
3	Level-triggered vs edge-triggered epoll?	LT: epoll_wait returns as long as fd is readable. ET: returns only once on the not-readable→readable transition. ET requires O_NONBLOCK and read-until-EAGAIN loop; misses events if you don't drain the fd completely.
4	Explain io_uring: SQ, CQ, and zero-syscall polling.	SQ and CQ are shared memory rings (mmap'd). User writes SQEs to SQ tail; kernel reads and posts CQEs to CQ tail. With IORING_SETUP_SQPOLL, a kernel sq_thread busypolls SQ — no io_uring_enter() needed. App polls CQ head. Full async I/O with zero syscalls per operation.
5	What is sendfile() and why is it called zero-copy?	sendfile(out_fd, in_fd, offset, count) transfers data from a file (or page cache) directly to a socket's DMA buffer without passing through user space. No copy_to_user + copy_from_user. The CPU issues DMA descriptors; data moves memory→NIC without CPU involvement.
6	What is vDSO and how does it avoid a kernel ring transition?	The kernel maps a small executable page into every process. gettimeofday() in libc resolves to a function inside this vDSO page. It reads TSC directly and uses a kernel-maintained seqlock'd time structure in the same page. No syscall instruction — stays in ring 3.
7	What is SA_RESTART and when do you need it?	If a signal interrupts a slow syscall (read, accept…), the syscall returns -EINTR. SA_RESTART tells the kernel to automatically restart the syscall after the signal handler returns. Without it, you must check for EINTR and retry manually. Use it for long-lived servers that install signal handlers for SIGTERM/SIGCHLD.

§ 4.1 – 4.5 System Calls, epoll, io_uring & Signals

1. Overview

2. Key Data Structure — sys_call_table

vDSO — Zero-Cost Clock Reads

3. Minimal C Demo — Syscall Dispatch Table

4. Core Mechanism — epoll

Walkthrough — LT vs ET with 3 KB data on a 4 KB socket buffer

5. Core Mechanism — io_uring

SQPOLL — Zero-Syscall Mode

Walkthrough — Submit READ + reap CQE

6. Network Syscalls — send() Path & sendfile()

7. Signals — Asynchronous Process Notification

8. Kernel Source Pointers

9. Interview Prep

2. Key Data Structure — `sys_call_table`

4. Core Mechanism — `epoll`

5. Core Mechanism — `io_uring`

6. Network Syscalls — `send()` Path & `sendfile()`