1. Overview
Linux provides five families of IPC, each with different latency, ordering, and persistence characteristics. Pipes and Unix sockets route data through kernel buffers; shared memory maps the same physical page into both processes — zero copies. Futex straddles user and kernel space: the common (uncontended) case never enters the kernel at all.
2. § 5.1 — Pipes & FIFOs
Key Data Structure — pipe_inode_info
A pipe is backed by a pipe_inode_info struct in the kernel. It holds a ring of 16 pipe_buffer slots, each pointing to one 4 KB page — giving a default capacity of 64 KB. Both file descriptors returned by pipe() share the same pipe_inode_info via their struct file → f_inode pointer.
| Field | Type | Purpose |
|---|
bufs[16] | pipe_buffer[] | Ring of page references — the actual data lives here |
head | unsigned int | Next slot for reader to consume (consumer advances) |
tail | unsigned int | Next slot for writer to fill (producer advances) |
ring_size | unsigned int | Total slots (default 16 = 64 KB capacity) |
rd_wait / wr_wait | wait_queue_head_t | Wait queues for blocked readers / writers |
readers / writers | int | Reference count — last writer close() → EOF for reader |
Core Mechanism — splice() Zero-Copy Transfer
Background: nginx serves a static file: read file → write to socket. With plain read()+write() the kernel copies file data to a user buffer, then copies it again into the socket buffer — two copies. splice() moves page references through the pipe ring without touching the payload: zero CPU copies.
Plan:- Open the file (page cache pages are already resident or faulted in)
splice(file_fd, NULL, pipe_fd[1], NULL, len, 0) — kernel moves page pointers into the pipe ring; no copysplice(pipe_fd[0], NULL, sock_fd, NULL, len, 0) — kernel maps pipe's pages into the NIC DMA ring; no copy- NIC DMA reads directly from the file's page cache pages
Walkthrough — send 1 MB file with splice
| Step | Operation | CPU Copies | Notes |
|---|
| 1 | File read-ahead fills page cache (256 pages × 4KB) | 0 | DMA from disk → page cache |
| 2 | splice(file_fd → pipe_fd[1], 1 MB) | 0 | Pipe bufs[] now hold pointers to page-cache pages |
| 3 | splice(pipe_fd[0] → sock_fd, 1 MB) | 0 | sk_buff points to same pages; NIC DMA ring updated |
| 4 | NIC DMA sends pages to wire | 0 | CPU never touched payload bytes |
| Total | read()+write() baseline | 2 | page cache→user buf + user buf→socket buf |
Minimal C Demo — Pipe Circular Buffer
3. § 5.2 — POSIX IPC (Shared Memory, Message Queues, Semaphores)
Key Data Structure — Shared Memory Physical Page Mapping
Why shared memory is the fastest IPC:Both processes' PTEs point to the same physical page. A producer writes 8 bytes; the consumer reads them at a different virtual address — the data never moves. Bandwidth is limited only by cache and CPU speed, not by any kernel copy or scheduling latency.
| Mechanism | Kernel Data Structure | Latency | Ordering | Persistence |
|---|
| Pipe | pipe_inode_info (64 KB ring) | ~1 µs | FIFO, strict | Lost on close |
| Message Queue (mq) | struct msg_queue (linked list) | ~2–5 µs | FIFO or priority | Survives process exit |
| Shared Memory (shm) | physical pages via mmap | < 100 ns | None (user must sync) | Survives process exit |
| Unix Socket (AF_UNIX) | unix_sock + socket buffer | ~1–2 µs | FIFO | Lost on close |
| TCP loopback | sk_buff + TCP state machine | ~10–30 µs | FIFO + reliability | Lost on close |
Minimal C Demo — Shared Memory Producer-Consumer Ring
4. § 5.3 — Futex: Fast Userspace Mutex
Background:A mutex protects a shared counter accessed by 16 threads. Each thread locks and unlocks 100 000 times per second. If every lock/unlock required a syscall, that's 3.2 million ring transitions per second on this server. The futex design cuts that to zero syscalls when no thread is waiting — the common case.
Plan:- Lock stores its state in a userspace
int (the futex word): 0 = free, 1 = held, 2 = held + waiters - Lock attempt: single
LOCK CMPXCHG — if succeeds, done (no syscall) - Lock contended: call
futex(FUTEX_WAIT); kernel hashes the physical address of the futex word into a global table and parks the thread - Unlock: set word → 0; if word was 2, call
futex(FUTEX_WAKE, 1) to wake one waiter
Physical address as hash key — why it matters: Two processes can share a futex word in shared memory (e.g. a pthread_mutex_t with PTHREAD_PROCESS_SHARED). Each process maps the shared page at a different virtual address, but the physical address is identical. Hashing on physical address lets the kernel find waiters from any process without needing a global name or file descriptor.
Minimal C Demo — Futex-Based Mutex
5. § 5.4 — Unix Domain Sockets (AF_UNIX)
AF_UNIX sockets provide reliable, ordered, connection-oriented IPC within a single host. Unlike TCP loopback, they bypass the entire IP/TCP stack — data is copied directly between socket buffers in the kernel. They also uniquely support credential passing (SCM_CREDENTIALS) and file-descriptor passing (SCM_RIGHTS), which is how vhost-user bootstraps shared memory between DPDK and QEMU without any memory copies.
vhost-user: DPDK ↔ QEMU via Unix Socket
vhost-user is the protocol that allows a DPDK vSwitch to serve as the virtio backend for a QEMU VM. The control plane uses a Unix socket to exchange capability messages and — critically — to pass the guest RAM memfd file descriptor via SCM_RIGHTS. Once setup is done the data plane runs on shared virtio rings with no socket involvement at all.
| AF_UNIX Feature | Syscall / Ancillary | Use Case |
|---|
| Stream socket | socket + bind + listen + connect + accept | RPC between processes (Docker daemon, systemd, DBus) |
| Datagram socket | socket + bind + sendto + recvfrom | Log shipping, metrics (no connection overhead) |
| Credential passing | sendmsg with SCM_CREDENTIALS cmsg | Peer authentication without root — used by dbus |
| FD passing | sendmsg with SCM_RIGHTS cmsg | vhost-user memfd, container rootfs handoff |
| Abstract namespace | sun_path[0] = '\0' (no filesystem entry) | Auto-cleanup on close — used by DBus, Chrome sandboxes |
Minimal C Demo — AF_UNIX Request/Response
6. Kernel Source Pointers
| File | Symbol | What it does |
|---|
fs/pipe.c | pipe_write(), pipe_read() | Core pipe circular buffer — manages bufs[], head, tail, wait queues |
fs/pipe.c | struct pipe_inode_info | Pipe state: bufs[16], ring_size, head, tail, rd_wait, wr_wait |
fs/splice.c | do_splice(), splice_direct_to_actor() | Zero-copy pipe ↔ file ↔ socket page-pointer transfer |
ipc/shm.c | do_shmat(), shmget_kernel() | Shared memory attach — creates VMA pointing to shm pages |
mm/mmap.c | do_mmap() with MAP_SHARED | Maps file/shm pages into process VMA, sets up PTE |
ipc/mqueue.c | do_mq_send(), do_mq_timedreceive() | POSIX message queue enqueue / dequeue |
kernel/futex/futex.c | futex_wait(), futex_wake() | Kernel slow path: hash table lookup, wait queue, wake |
kernel/futex/futex.c | futex_hash_bucket | Per-physical-address wait list, protected by spinlock |
net/unix/af_unix.c | unix_stream_connect(), unix_stream_sendmsg() | AF_UNIX connect and send — bypasses TCP/IP stack |
net/unix/scm.c | unix_attach_fds() | SCM_RIGHTS: pass file descriptors across process boundary |
7. Interview Prep
| # | Question | Key Answer |
|---|
| 1 | What is the kernel data structure backing a pipe? What is its buffer size? | pipe_inode_info: a ring of 16 pipe_buffer structs, each holding a pointer to one 4 KB page → 64 KB total capacity. Both fd[0] and fd[1] share the same pipe_inode_info via f_inode. Writers sleep on wr_wait when full; readers on rd_wait when empty. |
| 2 | What is splice() and how does it avoid a copy? | splice(in_fd, ..., out_fd, ...) moves data between a file and a pipe (or between two pipes) by transferring page pointers in the pipe_buffer ring, never copying payload bytes. Combined with sendfile()-like logic, a file page can go from page cache → NIC DMA without any CPU copy — only page-table manipulation. |
| 3 | How does a futex work? Describe the fast path and the slow path. | Fast path: LOCK CMPXCHG on the futex word (a userspace int). If CAS succeeds (0→1), lock is acquired — no syscall. Slow path: CAS fails (lock held); thread calls futex(FUTEX_WAIT). Kernel hashes the physical address of the futex word into futex_queues[], adds the task to the wait list, and calls schedule(). Unlock: set word→0; if word was 2 (waiters present), call futex(FUTEX_WAKE,1) to wake one waiter. |
| 4 | Why does pthread_mutex_lock() not always invoke a syscall? | glibc implements pthread_mutex using futex. The lock word lives in the mutex struct in userspace. Uncontended lock is a single LOCK CMPXCHG — ring-3 only. Only when the CAS fails (another thread holds it) does glibc call futex(FUTEX_WAIT) to enter the kernel. In a program with low contention the vast majority of lock operations never leave ring 3. |
| 5 | What is a Unix domain socket and when would you prefer it over a TCP socket? | AF_UNIX socket for local IPC: no IP header, no TCP state machine, no loopback NIC driver. Data goes directly between kernel socket buffers — latency ~1 µs vs ~10–30 µs for TCP loopback. Prefer AF_UNIX for same-host IPC (Docker daemon, DBus, nginx worker ↔ master). Additional features unavailable in TCP: SCM_CREDENTIALS (peer auth) and SCM_RIGHTS (fd passing). |
| 6 | How does vhost-user use Unix domain sockets for DPDK virtio? | vhost-user control plane: DPDK (master) binds an AF_UNIX socket; QEMU (slave) connects. They exchange vhost_user_msg structs: negotiate features, and — critically — QEMU sends VHOST_USER_SET_MEM_TABLE with a memfd passed via SCM_RIGHTS, mapping guest RAM into the DPDK process. Once setup completes, the data plane runs on shared virtio TX/RX rings in that memory — the Unix socket is only used for one-time setup. |