1. Overview
User programs cannot directly access hardware — they request kernel services via system calls. On x86-64 a single syscall instruction switches to ring 0, loads the kernel entry point from MSR_LSTAR, and dispatches through sys_call_table[rax]. For latency-critical reads like gettimeofday, the kernel maps a vDSO page into every process — the function runs in user space with direct TSC reads, avoiding any ring transition.
2. Key Data Structure — sys_call_table
The kernel maintains a static array of function pointers indexed by syscall number. On x86-64, rax carries the number; rdi, rsi, rdx, r10, r8, r9 carry up to six arguments. The SYSCALL_DEFINEN macro wraps each handler, adding auditing, seccomp, and ptrace hooks automatically.
| Register | Role | Example (read) |
|---|
rax | Syscall number (in); return value (out) | __NR_read = 0 |
rdi | Arg 1 | fd |
rsi | Arg 2 | buf pointer |
rdx | Arg 3 | count |
r10 | Arg 4 (replaces rcx — clobbered by syscall) | — |
r8, r9 | Args 5–6 | — |
vDSO — Zero-Cost Clock Reads
Why vDSO works: The kernel updates a shared memory page every tick with the current TSC value and scale factor. User code reads it directly — no syscall instruction, no ring switch, no TLB flush. Latency drops from ~100 ns (syscall) to ~5 ns (plain memory read + TSC calculation).
3. Minimal C Demo — Syscall Dispatch Table
4. Core Mechanism — epoll
Background: A server handles 100k concurrent connections. select() scans all FDs on every call — O(n). As connections grow, CPU time on polling grows linearly. epoll moves the interest set into the kernel; only ready FDs are returned — O(1) per event regardless of total count.
Plan:epoll_create1() — allocate struct eventpoll with an empty red-black tree (interest set) and an empty ready listepoll_ctl(ADD) — allocate epitem, insert into rbtree keyed by fd, register a wait-queue callback on the fd's file- When fd becomes readable, its wait-queue fires → epitem moves from rbtree to
rdllist epoll_wait() — if rdllist empty: sleep; else copy ready epitems to caller's events[]
Walkthrough — LT vs ET with 3 KB data on a 4 KB socket buffer
| Step | Level-Triggered | Edge-Triggered |
|---|
| 1. Data arrives | epoll_wait returns fd | epoll_wait returns fd |
| 2. App reads 1 KB | returns, 2 KB left | returns, 2 KB left |
| 3. Next epoll_wait | returns fd again (still readable) | DOES NOT return (no new edge) |
| 4. Result | App eventually reads all data | App misses 2 KB until more data arrives |
| Fix for ET | — | Must loop read() until EAGAIN after every event |
5. Core Mechanism — io_uring
Background: Even with epoll, every I/O operation costs a syscall (read(), write(), accept()…). At 10 million I/Os/s that's ~10M ring transitions per second. io_uring moves all I/O submission and completion through a shared-memory ring — userspace and kernel communicate without any syscall in the hot path.
SQPOLL — Zero-Syscall Mode
How it works: With IORING_SETUP_SQPOLL, the kernel spawns a dedicated kthread that busypolls the SQ tail. The app writes SQEs and advances sq_tail — the kernel thread picks them up immediately. No io_uring_enter() needed. CQEs appear in the CQ ring; the app polls cq_head != cq_tail. Result: pure userspace event loop with kernel I/O — zero syscalls per I/O operation.
Walkthrough — Submit READ + reap CQE
| Step | Actor | State |
|---|
| 1 | App | Fills SQE: op=READ, fd=3, buf=0x1000, len=4096, user_data=42 |
| 2 | App | Atomically increments sq_tail (ring producer index) |
| 3 | Kernel / sq_thread | Sees sq_tail advanced; dequeues SQE; submits async read to block layer |
| 4 | Block layer | DMA completes; io_uring completion callback fires |
| 5 | Kernel | Writes CQE: user_data=42, res=4096 (bytes read); advances cq_tail |
| 6 | App | Observes cq_head != cq_tail; reads CQE; advances cq_head |
6. Network Syscalls — send() Path & sendfile()
Why sendfile() is "zero-copy":The file data moves from the page cache directly to the NIC's DMA buffer — the CPU never touches the payload. Compare to read()+write(): two copies (page cache → user buffer → socket buffer) plus two context switches. HTTP static file servers (nginx, etc.) use sendfile() for all static assets.
| API | Mechanism | Copies | Use Case |
|---|
read()+write() | user buffer as intermediary | 2 | General purpose |
sendfile() | page cache → NIC DMA | 0 (CPU) | Static file serving |
splice() | pipe buffer as kernel intermediary | 0 (CPU) | Pipe to socket transfer |
mmap()+write() | user maps page cache; write from it | 1 | Random access file I/O |
7. Signals — Asynchronous Process Notification
Signals are the UNIX mechanism for asynchronous notification. The kernel represents pending signals as a bitmap (sigset_t) in task_struct.pending. On every return-to-user the kernel checks TIF_SIGPENDING; if set, it calls do_signal() to set up the handler frame.
| Signal | Default Action | Catchable? | Common Use |
|---|
SIGTERM (15) | Terminate | Yes | Graceful shutdown request |
SIGKILL (9) | Terminate immediately | No | Force kill — cannot be caught or ignored |
SIGSEGV (11) | Core dump | Yes (limited) | Invalid memory access |
SIGINT (2) | Terminate | Yes | Ctrl+C from terminal |
SIGCHLD (17) | Ignore | Yes | Child process state change |
SIGUSR1/2 (10/12) | Terminate | Yes | User-defined application events |
SIGALRM (14) | Terminate | Yes | alarm() timer expiry |
8. Kernel Source Pointers
| File | Symbol | What it does |
|---|
arch/x86/entry/entry_64.S | entry_SYSCALL_64 | Assembly entry: swapgs, save pt_regs, call do_syscall_64 |
arch/x86/entry/common.c | do_syscall_64() | Dispatches via sys_call_table[nr] |
arch/x86/entry/syscalls/syscall_64.tbl | — | Syscall number → name mapping |
include/linux/syscalls.h | SYSCALL_DEFINE* | Macros that generate handler prototypes + tracing hooks |
fs/eventpoll.c | do_epoll_wait(), ep_poll() | epoll wait loop; rdllist harvest |
fs/eventpoll.c | ep_insert() | epoll_ctl ADD: allocate epitem, insert into rbtree |
io_uring/io_uring.c | io_uring_setup(), io_submit_sqes() | Ring setup and SQE dispatch |
io_uring/sq_poll.c | io_sq_thread() | SQPOLL kernel thread — busypolls SQ |
net/socket.c | sock_sendmsg() | Top-level send path entry |
fs/sendfile.c | do_sendfile() | Zero-copy file-to-socket transfer |
kernel/signal.c | do_signal(), send_signal() | Signal delivery and pending management |
arch/x86/kernel/signal.c | setup_rt_frame() | Builds signal frame on user stack |
9. Interview Prep
| # | Question | Key Answer |
|---|
| 1 | Walk through what happens when a process calls read() on a blocking socket. | syscall → entry_SYSCALL_64 → sys_call_table[0] → sys_read → sock_recvmsg → if no data: task_struct.state = TASK_INTERRUPTIBLE, add to socket wait queue, schedule(). When data arrives: NIC IRQ → NAPI → skb → socket recv queue → wake_up() → task re-queued → returns from schedule() → copy_to_user → return count. |
| 2 | How does epoll differ from select/poll internally? Why is it O(1)? | select/poll scan the entire FD set on every call — O(n). epoll registers interest once; when a fd becomes ready, the kernel appends its epitem to rdllist via a wait-queue callback. epoll_wait just drains rdllist — O(ready events), not O(all FDs). |
| 3 | Level-triggered vs edge-triggered epoll? | LT: epoll_wait returns as long as fd is readable. ET: returns only once on the not-readable→readable transition. ET requires O_NONBLOCK and read-until-EAGAIN loop; misses events if you don't drain the fd completely. |
| 4 | Explain io_uring: SQ, CQ, and zero-syscall polling. | SQ and CQ are shared memory rings (mmap'd). User writes SQEs to SQ tail; kernel reads and posts CQEs to CQ tail. With IORING_SETUP_SQPOLL, a kernel sq_thread busypolls SQ — no io_uring_enter() needed. App polls CQ head. Full async I/O with zero syscalls per operation. |
| 5 | What is sendfile() and why is it called zero-copy? | sendfile(out_fd, in_fd, offset, count) transfers data from a file (or page cache) directly to a socket's DMA buffer without passing through user space. No copy_to_user + copy_from_user. The CPU issues DMA descriptors; data moves memory→NIC without CPU involvement. |
| 6 | What is vDSO and how does it avoid a kernel ring transition? | The kernel maps a small executable page into every process. gettimeofday() in libc resolves to a function inside this vDSO page. It reads TSC directly and uses a kernel-maintained seqlock'd time structure in the same page. No syscall instruction — stays in ring 3. |
| 7 | What is SA_RESTART and when do you need it? | If a signal interrupts a slow syscall (read, accept…), the syscall returns -EINTR. SA_RESTART tells the kernel to automatically restart the syscall after the signal handler returns. Without it, you must check for EINTR and retry manually. Use it for long-lived servers that install signal handlers for SIGTERM/SIGCHLD. |