Part IV — System Calls

§ 4.1 – 4.5 System Calls, epoll, io_uring & Signals

How user space crosses into the kernel via the syscall instruction, how epoll scales to millions of connections, how io_uring eliminates syscall overhead with shared rings, and how Linux delivers signals safely across the user/kernel boundary.

1. Overview

User programs cannot directly access hardware — they request kernel services via system calls. On x86-64 a single syscall instruction switches to ring 0, loads the kernel entry point from MSR_LSTAR, and dispatches through sys_call_table[rax]. For latency-critical reads like gettimeofday, the kernel maps a vDSO page into every process — the function runs in user space with direct TSC reads, avoiding any ring transition.

2. Key Data Structure — sys_call_table

The kernel maintains a static array of function pointers indexed by syscall number. On x86-64, rax carries the number; rdi, rsi, rdx, r10, r8, r9 carry up to six arguments. The SYSCALL_DEFINEN macro wraps each handler, adding auditing, seccomp, and ptrace hooks automatically.

RegisterRoleExample (read)
raxSyscall number (in); return value (out)__NR_read = 0
rdiArg 1fd
rsiArg 2buf pointer
rdxArg 3count
r10Arg 4 (replaces rcx — clobbered by syscall)
r8, r9Args 5–6

vDSO — Zero-Cost Clock Reads

Why vDSO works: The kernel updates a shared memory page every tick with the current TSC value and scale factor. User code reads it directly — no syscall instruction, no ring switch, no TLB flush. Latency drops from ~100 ns (syscall) to ~5 ns (plain memory read + TSC calculation).

3. Minimal C Demo — Syscall Dispatch Table

Syscall Dispatch — sys_call_table simulation — C Demo
stdin (optional)

4. Core Mechanism — epoll

Background: A server handles 100k concurrent connections. select() scans all FDs on every call — O(n). As connections grow, CPU time on polling grows linearly. epoll moves the interest set into the kernel; only ready FDs are returned — O(1) per event regardless of total count.
Plan:
  1. epoll_create1() — allocate struct eventpoll with an empty red-black tree (interest set) and an empty ready list
  2. epoll_ctl(ADD) — allocate epitem, insert into rbtree keyed by fd, register a wait-queue callback on the fd's file
  3. When fd becomes readable, its wait-queue fires → epitem moves from rbtree to rdllist
  4. epoll_wait() — if rdllist empty: sleep; else copy ready epitems to caller's events[]

Walkthrough — LT vs ET with 3 KB data on a 4 KB socket buffer

StepLevel-TriggeredEdge-Triggered
1. Data arrivesepoll_wait returns fdepoll_wait returns fd
2. App reads 1 KBreturns, 2 KB leftreturns, 2 KB left
3. Next epoll_waitreturns fd again (still readable)DOES NOT return (no new edge)
4. ResultApp eventually reads all dataApp misses 2 KB until more data arrives
Fix for ETMust loop read() until EAGAIN after every event
epoll Interest Set + Ready List simulation — C Demo
stdin (optional)

5. Core Mechanism — io_uring

Background: Even with epoll, every I/O operation costs a syscall (read(), write(), accept()…). At 10 million I/Os/s that's ~10M ring transitions per second. io_uring moves all I/O submission and completion through a shared-memory ring — userspace and kernel communicate without any syscall in the hot path.

SQPOLL — Zero-Syscall Mode

How it works: With IORING_SETUP_SQPOLL, the kernel spawns a dedicated kthread that busypolls the SQ tail. The app writes SQEs and advances sq_tail — the kernel thread picks them up immediately. No io_uring_enter() needed. CQEs appear in the CQ ring; the app polls cq_head != cq_tail. Result: pure userspace event loop with kernel I/O — zero syscalls per I/O operation.

Walkthrough — Submit READ + reap CQE

StepActorState
1AppFills SQE: op=READ, fd=3, buf=0x1000, len=4096, user_data=42
2AppAtomically increments sq_tail (ring producer index)
3Kernel / sq_threadSees sq_tail advanced; dequeues SQE; submits async read to block layer
4Block layerDMA completes; io_uring completion callback fires
5KernelWrites CQE: user_data=42, res=4096 (bytes read); advances cq_tail
6AppObserves cq_head != cq_tail; reads CQE; advances cq_head
io_uring SQ/CQ Ring Simulation — C Demo
stdin (optional)

6. Network Syscalls — send() Path & sendfile()

Why sendfile() is "zero-copy":The file data moves from the page cache directly to the NIC's DMA buffer — the CPU never touches the payload. Compare to read()+write(): two copies (page cache → user buffer → socket buffer) plus two context switches. HTTP static file servers (nginx, etc.) use sendfile() for all static assets.
APIMechanismCopiesUse Case
read()+write()user buffer as intermediary2General purpose
sendfile()page cache → NIC DMA0 (CPU)Static file serving
splice()pipe buffer as kernel intermediary0 (CPU)Pipe to socket transfer
mmap()+write()user maps page cache; write from it1Random access file I/O

7. Signals — Asynchronous Process Notification

Signals are the UNIX mechanism for asynchronous notification. The kernel represents pending signals as a bitmap (sigset_t) in task_struct.pending. On every return-to-user the kernel checks TIF_SIGPENDING; if set, it calls do_signal() to set up the handler frame.

SignalDefault ActionCatchable?Common Use
SIGTERM (15)TerminateYesGraceful shutdown request
SIGKILL (9)Terminate immediatelyNoForce kill — cannot be caught or ignored
SIGSEGV (11)Core dumpYes (limited)Invalid memory access
SIGINT (2)TerminateYesCtrl+C from terminal
SIGCHLD (17)IgnoreYesChild process state change
SIGUSR1/2 (10/12)TerminateYesUser-defined application events
SIGALRM (14)TerminateYesalarm() timer expiry
Signal Delivery — pending bitmap + do_signal() — C Demo
stdin (optional)

8. Kernel Source Pointers

FileSymbolWhat it does
arch/x86/entry/entry_64.Sentry_SYSCALL_64Assembly entry: swapgs, save pt_regs, call do_syscall_64
arch/x86/entry/common.cdo_syscall_64()Dispatches via sys_call_table[nr]
arch/x86/entry/syscalls/syscall_64.tblSyscall number → name mapping
include/linux/syscalls.hSYSCALL_DEFINE*Macros that generate handler prototypes + tracing hooks
fs/eventpoll.cdo_epoll_wait(), ep_poll()epoll wait loop; rdllist harvest
fs/eventpoll.cep_insert()epoll_ctl ADD: allocate epitem, insert into rbtree
io_uring/io_uring.cio_uring_setup(), io_submit_sqes()Ring setup and SQE dispatch
io_uring/sq_poll.cio_sq_thread()SQPOLL kernel thread — busypolls SQ
net/socket.csock_sendmsg()Top-level send path entry
fs/sendfile.cdo_sendfile()Zero-copy file-to-socket transfer
kernel/signal.cdo_signal(), send_signal()Signal delivery and pending management
arch/x86/kernel/signal.csetup_rt_frame()Builds signal frame on user stack

9. Interview Prep

#QuestionKey Answer
1Walk through what happens when a process calls read() on a blocking socket.syscall → entry_SYSCALL_64 → sys_call_table[0] → sys_read → sock_recvmsg → if no data: task_struct.state = TASK_INTERRUPTIBLE, add to socket wait queue, schedule(). When data arrives: NIC IRQ → NAPI → skb → socket recv queue → wake_up() → task re-queued → returns from schedule() → copy_to_user → return count.
2How does epoll differ from select/poll internally? Why is it O(1)?select/poll scan the entire FD set on every call — O(n). epoll registers interest once; when a fd becomes ready, the kernel appends its epitem to rdllist via a wait-queue callback. epoll_wait just drains rdllist — O(ready events), not O(all FDs).
3Level-triggered vs edge-triggered epoll?LT: epoll_wait returns as long as fd is readable. ET: returns only once on the not-readable→readable transition. ET requires O_NONBLOCK and read-until-EAGAIN loop; misses events if you don't drain the fd completely.
4Explain io_uring: SQ, CQ, and zero-syscall polling.SQ and CQ are shared memory rings (mmap'd). User writes SQEs to SQ tail; kernel reads and posts CQEs to CQ tail. With IORING_SETUP_SQPOLL, a kernel sq_thread busypolls SQ — no io_uring_enter() needed. App polls CQ head. Full async I/O with zero syscalls per operation.
5What is sendfile() and why is it called zero-copy?sendfile(out_fd, in_fd, offset, count) transfers data from a file (or page cache) directly to a socket's DMA buffer without passing through user space. No copy_to_user + copy_from_user. The CPU issues DMA descriptors; data moves memory→NIC without CPU involvement.
6What is vDSO and how does it avoid a kernel ring transition?The kernel maps a small executable page into every process. gettimeofday() in libc resolves to a function inside this vDSO page. It reads TSC directly and uses a kernel-maintained seqlock'd time structure in the same page. No syscall instruction — stays in ring 3.
7What is SA_RESTART and when do you need it?If a signal interrupts a slow syscall (read, accept…), the syscall returns -EINTR. SA_RESTART tells the kernel to automatically restart the syscall after the signal handler returns. Without it, you must check for EINTR and retry manually. Use it for long-lived servers that install signal handlers for SIGTERM/SIGCHLD.