Web3 Docs

1. Overview

Linux provides five families of IPC, each with different latency, ordering, and persistence characteristics. Pipes and Unix sockets route data through kernel buffers; shared memory maps the same physical page into both processes — zero copies. Futex straddles user and kernel space: the common (uncontended) case never enters the kernel at all.

2. § 5.1 — Pipes & FIFOs

Key Data Structure — `pipe_inode_info`

A pipe is backed by a pipe_inode_info struct in the kernel. It holds a ring of 16 pipe_buffer slots, each pointing to one 4 KB page — giving a default capacity of 64 KB. Both file descriptors returned by pipe() share the same pipe_inode_info via their struct file → f_inode pointer.

Field	Type	Purpose
`bufs[16]`	pipe_buffer[]	Ring of page references — the actual data lives here
`head`	unsigned int	Next slot for reader to consume (consumer advances)
`tail`	unsigned int	Next slot for writer to fill (producer advances)
`ring_size`	unsigned int	Total slots (default 16 = 64 KB capacity)
`rd_wait / wr_wait`	wait_queue_head_t	Wait queues for blocked readers / writers
`readers / writers`	int	Reference count — last writer close() → EOF for reader

Core Mechanism — `splice()` Zero-Copy Transfer

Background: nginx serves a static file: read file → write to socket. With plain read()+write() the kernel copies file data to a user buffer, then copies it again into the socket buffer — two copies. splice() moves page references through the pipe ring without touching the payload: zero CPU copies.

Plan:

Open the file (page cache pages are already resident or faulted in)
splice(file_fd, NULL, pipe_fd[1], NULL, len, 0) — kernel moves page pointers into the pipe ring; no copy
splice(pipe_fd[0], NULL, sock_fd, NULL, len, 0) — kernel maps pipe's pages into the NIC DMA ring; no copy
NIC DMA reads directly from the file's page cache pages

Walkthrough — send 1 MB file with splice

Step	Operation	CPU Copies	Notes
1	File read-ahead fills page cache (256 pages × 4KB)	0	DMA from disk → page cache
2	splice(file_fd → pipe_fd[1], 1 MB)	0	Pipe bufs[] now hold pointers to page-cache pages
3	splice(pipe_fd[0] → sock_fd, 1 MB)	0	sk_buff points to same pages; NIC DMA ring updated
4	NIC DMA sends pages to wire	0	CPU never touched payload bytes
Total	read()+write() baseline	2	page cache→user buf + user buf→socket buf

Minimal C Demo — Pipe Circular Buffer

Pipe — circular buffer (pipe_inode_info simulation) — C Demo

stdin (optional)

3. § 5.2 — POSIX IPC (Shared Memory, Message Queues, Semaphores)

Key Data Structure — Shared Memory Physical Page Mapping

Why shared memory is the fastest IPC:Both processes' PTEs point to the same physical page. A producer writes 8 bytes; the consumer reads them at a different virtual address — the data never moves. Bandwidth is limited only by cache and CPU speed, not by any kernel copy or scheduling latency.

Mechanism	Kernel Data Structure	Latency	Ordering	Persistence
Pipe	pipe_inode_info (64 KB ring)	~1 µs	FIFO, strict	Lost on close
Message Queue (mq)	struct msg_queue (linked list)	~2–5 µs	FIFO or priority	Survives process exit
Shared Memory (shm)	physical pages via mmap	< 100 ns	None (user must sync)	Survives process exit
Unix Socket (AF_UNIX)	unix_sock + socket buffer	~1–2 µs	FIFO	Lost on close
TCP loopback	sk_buff + TCP state machine	~10–30 µs	FIFO + reliability	Lost on close

Minimal C Demo — Shared Memory Producer-Consumer Ring

Shared Memory — producer-consumer ring (mmap MAP_SHARED simulation) — C Demo

stdin (optional)

4. § 5.3 — Futex: Fast Userspace Mutex

Background:A mutex protects a shared counter accessed by 16 threads. Each thread locks and unlocks 100 000 times per second. If every lock/unlock required a syscall, that's 3.2 million ring transitions per second on this server. The futex design cuts that to zero syscalls when no thread is waiting — the common case.

Plan:

Lock stores its state in a userspace int (the futex word): 0 = free, 1 = held, 2 = held + waiters
Lock attempt: single LOCK CMPXCHG — if succeeds, done (no syscall)
Lock contended: call futex(FUTEX_WAIT); kernel hashes the physical address of the futex word into a global table and parks the thread
Unlock: set word → 0; if word was 2, call futex(FUTEX_WAKE, 1) to wake one waiter

Physical address as hash key — why it matters: Two processes can share a futex word in shared memory (e.g. a pthread_mutex_t with PTHREAD_PROCESS_SHARED). Each process maps the shared page at a different virtual address, but the physical address is identical. Hashing on physical address lets the kernel find waiters from any process without needing a global name or file descriptor.

Minimal C Demo — Futex-Based Mutex

Futex Mutex — fast path CAS vs slow path kernel wait queue — C Demo

stdin (optional)

5. § 5.4 — Unix Domain Sockets (`AF_UNIX`)

AF_UNIX sockets provide reliable, ordered, connection-oriented IPC within a single host. Unlike TCP loopback, they bypass the entire IP/TCP stack — data is copied directly between socket buffers in the kernel. They also uniquely support credential passing (SCM_CREDENTIALS) and file-descriptor passing (SCM_RIGHTS), which is how vhost-user bootstraps shared memory between DPDK and QEMU without any memory copies.

vhost-user: DPDK ↔ QEMU via Unix Socket

vhost-user is the protocol that allows a DPDK vSwitch to serve as the virtio backend for a QEMU VM. The control plane uses a Unix socket to exchange capability messages and — critically — to pass the guest RAM memfd file descriptor via SCM_RIGHTS. Once setup is done the data plane runs on shared virtio rings with no socket involvement at all.

AF_UNIX Feature	Syscall / Ancillary	Use Case
Stream socket	`socket + bind + listen + connect + accept`	RPC between processes (Docker daemon, systemd, DBus)
Datagram socket	`socket + bind + sendto + recvfrom`	Log shipping, metrics (no connection overhead)
Credential passing	`sendmsg with SCM_CREDENTIALS cmsg`	Peer authentication without root — used by dbus
FD passing	`sendmsg with SCM_RIGHTS cmsg`	vhost-user memfd, container rootfs handoff
Abstract namespace	`sun_path[0] = '\0' (no filesystem entry)`	Auto-cleanup on close — used by DBus, Chrome sandboxes

Minimal C Demo — AF_UNIX Request/Response

AF_UNIX Socket — server/client request-response simulation — C Demo

stdin (optional)

6. Kernel Source Pointers

File	Symbol	What it does
`fs/pipe.c`	`pipe_write(), pipe_read()`	Core pipe circular buffer — manages bufs[], head, tail, wait queues
`fs/pipe.c`	`struct pipe_inode_info`	Pipe state: bufs[16], ring_size, head, tail, rd_wait, wr_wait
`fs/splice.c`	`do_splice(), splice_direct_to_actor()`	Zero-copy pipe ↔ file ↔ socket page-pointer transfer
`ipc/shm.c`	`do_shmat(), shmget_kernel()`	Shared memory attach — creates VMA pointing to shm pages
`mm/mmap.c`	`do_mmap() with MAP_SHARED`	Maps file/shm pages into process VMA, sets up PTE
`ipc/mqueue.c`	`do_mq_send(), do_mq_timedreceive()`	POSIX message queue enqueue / dequeue
`kernel/futex/futex.c`	`futex_wait(), futex_wake()`	Kernel slow path: hash table lookup, wait queue, wake
`kernel/futex/futex.c`	`futex_hash_bucket`	Per-physical-address wait list, protected by spinlock
`net/unix/af_unix.c`	`unix_stream_connect(), unix_stream_sendmsg()`	AF_UNIX connect and send — bypasses TCP/IP stack
`net/unix/scm.c`	`unix_attach_fds()`	SCM_RIGHTS: pass file descriptors across process boundary

7. Interview Prep

#	Question	Key Answer
1	What is the kernel data structure backing a pipe? What is its buffer size?	pipe_inode_info: a ring of 16 pipe_buffer structs, each holding a pointer to one 4 KB page → 64 KB total capacity. Both fd[0] and fd[1] share the same pipe_inode_info via f_inode. Writers sleep on wr_wait when full; readers on rd_wait when empty.
2	What is splice() and how does it avoid a copy?	splice(in_fd, ..., out_fd, ...) moves data between a file and a pipe (or between two pipes) by transferring page pointers in the pipe_buffer ring, never copying payload bytes. Combined with sendfile()-like logic, a file page can go from page cache → NIC DMA without any CPU copy — only page-table manipulation.
3	How does a futex work? Describe the fast path and the slow path.	Fast path: LOCK CMPXCHG on the futex word (a userspace int). If CAS succeeds (0→1), lock is acquired — no syscall. Slow path: CAS fails (lock held); thread calls futex(FUTEX_WAIT). Kernel hashes the physical address of the futex word into futex_queues[], adds the task to the wait list, and calls schedule(). Unlock: set word→0; if word was 2 (waiters present), call futex(FUTEX_WAKE,1) to wake one waiter.
4	Why does pthread_mutex_lock() not always invoke a syscall?	glibc implements pthread_mutex using futex. The lock word lives in the mutex struct in userspace. Uncontended lock is a single LOCK CMPXCHG — ring-3 only. Only when the CAS fails (another thread holds it) does glibc call futex(FUTEX_WAIT) to enter the kernel. In a program with low contention the vast majority of lock operations never leave ring 3.
5	What is a Unix domain socket and when would you prefer it over a TCP socket?	AF_UNIX socket for local IPC: no IP header, no TCP state machine, no loopback NIC driver. Data goes directly between kernel socket buffers — latency ~1 µs vs ~10–30 µs for TCP loopback. Prefer AF_UNIX for same-host IPC (Docker daemon, DBus, nginx worker ↔ master). Additional features unavailable in TCP: SCM_CREDENTIALS (peer auth) and SCM_RIGHTS (fd passing).
6	How does vhost-user use Unix domain sockets for DPDK virtio?	vhost-user control plane: DPDK (master) binds an AF_UNIX socket; QEMU (slave) connects. They exchange vhost_user_msg structs: negotiate features, and — critically — QEMU sends VHOST_USER_SET_MEM_TABLE with a memfd passed via SCM_RIGHTS, mapping guest RAM into the DPDK process. Once setup completes, the data plane runs on shared virtio TX/RX rings in that memory — the Unix socket is only used for one-time setup.

§ 5.1 – 5.4 Pipes, POSIX IPC, Futex & Unix Sockets

1. Overview

2. § 5.1 — Pipes & FIFOs

Key Data Structure — pipe_inode_info

Core Mechanism — splice() Zero-Copy Transfer

Walkthrough — send 1 MB file with splice

Minimal C Demo — Pipe Circular Buffer

3. § 5.2 — POSIX IPC (Shared Memory, Message Queues, Semaphores)

Key Data Structure — Shared Memory Physical Page Mapping

Minimal C Demo — Shared Memory Producer-Consumer Ring

4. § 5.3 — Futex: Fast Userspace Mutex

Minimal C Demo — Futex-Based Mutex

5. § 5.4 — Unix Domain Sockets (AF_UNIX)

vhost-user: DPDK ↔ QEMU via Unix Socket

Minimal C Demo — AF_UNIX Request/Response

6. Kernel Source Pointers

7. Interview Prep

Key Data Structure — `pipe_inode_info`

Core Mechanism — `splice()` Zero-Copy Transfer

5. § 5.4 — Unix Domain Sockets (`AF_UNIX`)