Part V — IPC

§ 5.1 – 5.4 Pipes, POSIX IPC, Futex & Unix Sockets

How processes communicate — from the humble pipe's circular buffer, through shared-memory zero-copy rings, to the futex fast-path that keeps pthread_mutex_lock() out of the kernel 99% of the time, and the AF_UNIX socket that carries vhost-user control messages between DPDK and QEMU.

1. Overview

Linux provides five families of IPC, each with different latency, ordering, and persistence characteristics. Pipes and Unix sockets route data through kernel buffers; shared memory maps the same physical page into both processes — zero copies. Futex straddles user and kernel space: the common (uncontended) case never enters the kernel at all.

2. § 5.1 — Pipes & FIFOs

Key Data Structure — pipe_inode_info

A pipe is backed by a pipe_inode_info struct in the kernel. It holds a ring of 16 pipe_buffer slots, each pointing to one 4 KB page — giving a default capacity of 64 KB. Both file descriptors returned by pipe() share the same pipe_inode_info via their struct file → f_inode pointer.

FieldTypePurpose
bufs[16]pipe_buffer[]Ring of page references — the actual data lives here
headunsigned intNext slot for reader to consume (consumer advances)
tailunsigned intNext slot for writer to fill (producer advances)
ring_sizeunsigned intTotal slots (default 16 = 64 KB capacity)
rd_wait / wr_waitwait_queue_head_tWait queues for blocked readers / writers
readers / writersintReference count — last writer close() → EOF for reader

Core Mechanism — splice() Zero-Copy Transfer

Background: nginx serves a static file: read file → write to socket. With plain read()+write() the kernel copies file data to a user buffer, then copies it again into the socket buffer — two copies. splice() moves page references through the pipe ring without touching the payload: zero CPU copies.
Plan:
  1. Open the file (page cache pages are already resident or faulted in)
  2. splice(file_fd, NULL, pipe_fd[1], NULL, len, 0) — kernel moves page pointers into the pipe ring; no copy
  3. splice(pipe_fd[0], NULL, sock_fd, NULL, len, 0) — kernel maps pipe's pages into the NIC DMA ring; no copy
  4. NIC DMA reads directly from the file's page cache pages

Walkthrough — send 1 MB file with splice

StepOperationCPU CopiesNotes
1File read-ahead fills page cache (256 pages × 4KB)0DMA from disk → page cache
2splice(file_fd → pipe_fd[1], 1 MB)0Pipe bufs[] now hold pointers to page-cache pages
3splice(pipe_fd[0] → sock_fd, 1 MB)0sk_buff points to same pages; NIC DMA ring updated
4NIC DMA sends pages to wire0CPU never touched payload bytes
Totalread()+write() baseline2page cache→user buf + user buf→socket buf

Minimal C Demo — Pipe Circular Buffer

Pipe — circular buffer (pipe_inode_info simulation) — C Demo
stdin (optional)

3. § 5.2 — POSIX IPC (Shared Memory, Message Queues, Semaphores)

Key Data Structure — Shared Memory Physical Page Mapping

Why shared memory is the fastest IPC:Both processes' PTEs point to the same physical page. A producer writes 8 bytes; the consumer reads them at a different virtual address — the data never moves. Bandwidth is limited only by cache and CPU speed, not by any kernel copy or scheduling latency.
MechanismKernel Data StructureLatencyOrderingPersistence
Pipepipe_inode_info (64 KB ring)~1 µsFIFO, strictLost on close
Message Queue (mq)struct msg_queue (linked list)~2–5 µsFIFO or prioritySurvives process exit
Shared Memory (shm)physical pages via mmap< 100 nsNone (user must sync)Survives process exit
Unix Socket (AF_UNIX)unix_sock + socket buffer~1–2 µsFIFOLost on close
TCP loopbacksk_buff + TCP state machine~10–30 µsFIFO + reliabilityLost on close

Minimal C Demo — Shared Memory Producer-Consumer Ring

Shared Memory — producer-consumer ring (mmap MAP_SHARED simulation) — C Demo
stdin (optional)

4. § 5.3 — Futex: Fast Userspace Mutex

Background:A mutex protects a shared counter accessed by 16 threads. Each thread locks and unlocks 100 000 times per second. If every lock/unlock required a syscall, that's 3.2 million ring transitions per second on this server. The futex design cuts that to zero syscalls when no thread is waiting — the common case.
Plan:
  1. Lock stores its state in a userspace int (the futex word): 0 = free, 1 = held, 2 = held + waiters
  2. Lock attempt: single LOCK CMPXCHG — if succeeds, done (no syscall)
  3. Lock contended: call futex(FUTEX_WAIT); kernel hashes the physical address of the futex word into a global table and parks the thread
  4. Unlock: set word → 0; if word was 2, call futex(FUTEX_WAKE, 1) to wake one waiter
Physical address as hash key — why it matters: Two processes can share a futex word in shared memory (e.g. a pthread_mutex_t with PTHREAD_PROCESS_SHARED). Each process maps the shared page at a different virtual address, but the physical address is identical. Hashing on physical address lets the kernel find waiters from any process without needing a global name or file descriptor.

Minimal C Demo — Futex-Based Mutex

Futex Mutex — fast path CAS vs slow path kernel wait queue — C Demo
stdin (optional)

5. § 5.4 — Unix Domain Sockets (AF_UNIX)

AF_UNIX sockets provide reliable, ordered, connection-oriented IPC within a single host. Unlike TCP loopback, they bypass the entire IP/TCP stack — data is copied directly between socket buffers in the kernel. They also uniquely support credential passing (SCM_CREDENTIALS) and file-descriptor passing (SCM_RIGHTS), which is how vhost-user bootstraps shared memory between DPDK and QEMU without any memory copies.

vhost-user: DPDK ↔ QEMU via Unix Socket

vhost-user is the protocol that allows a DPDK vSwitch to serve as the virtio backend for a QEMU VM. The control plane uses a Unix socket to exchange capability messages and — critically — to pass the guest RAM memfd file descriptor via SCM_RIGHTS. Once setup is done the data plane runs on shared virtio rings with no socket involvement at all.

AF_UNIX FeatureSyscall / AncillaryUse Case
Stream socketsocket + bind + listen + connect + acceptRPC between processes (Docker daemon, systemd, DBus)
Datagram socketsocket + bind + sendto + recvfromLog shipping, metrics (no connection overhead)
Credential passingsendmsg with SCM_CREDENTIALS cmsgPeer authentication without root — used by dbus
FD passingsendmsg with SCM_RIGHTS cmsgvhost-user memfd, container rootfs handoff
Abstract namespacesun_path[0] = '\0' (no filesystem entry)Auto-cleanup on close — used by DBus, Chrome sandboxes

Minimal C Demo — AF_UNIX Request/Response

AF_UNIX Socket — server/client request-response simulation — C Demo
stdin (optional)

6. Kernel Source Pointers

FileSymbolWhat it does
fs/pipe.cpipe_write(), pipe_read()Core pipe circular buffer — manages bufs[], head, tail, wait queues
fs/pipe.cstruct pipe_inode_infoPipe state: bufs[16], ring_size, head, tail, rd_wait, wr_wait
fs/splice.cdo_splice(), splice_direct_to_actor()Zero-copy pipe ↔ file ↔ socket page-pointer transfer
ipc/shm.cdo_shmat(), shmget_kernel()Shared memory attach — creates VMA pointing to shm pages
mm/mmap.cdo_mmap() with MAP_SHAREDMaps file/shm pages into process VMA, sets up PTE
ipc/mqueue.cdo_mq_send(), do_mq_timedreceive()POSIX message queue enqueue / dequeue
kernel/futex/futex.cfutex_wait(), futex_wake()Kernel slow path: hash table lookup, wait queue, wake
kernel/futex/futex.cfutex_hash_bucketPer-physical-address wait list, protected by spinlock
net/unix/af_unix.cunix_stream_connect(), unix_stream_sendmsg()AF_UNIX connect and send — bypasses TCP/IP stack
net/unix/scm.cunix_attach_fds()SCM_RIGHTS: pass file descriptors across process boundary

7. Interview Prep

#QuestionKey Answer
1What is the kernel data structure backing a pipe? What is its buffer size?pipe_inode_info: a ring of 16 pipe_buffer structs, each holding a pointer to one 4 KB page → 64 KB total capacity. Both fd[0] and fd[1] share the same pipe_inode_info via f_inode. Writers sleep on wr_wait when full; readers on rd_wait when empty.
2What is splice() and how does it avoid a copy?splice(in_fd, ..., out_fd, ...) moves data between a file and a pipe (or between two pipes) by transferring page pointers in the pipe_buffer ring, never copying payload bytes. Combined with sendfile()-like logic, a file page can go from page cache → NIC DMA without any CPU copy — only page-table manipulation.
3How does a futex work? Describe the fast path and the slow path.Fast path: LOCK CMPXCHG on the futex word (a userspace int). If CAS succeeds (0→1), lock is acquired — no syscall. Slow path: CAS fails (lock held); thread calls futex(FUTEX_WAIT). Kernel hashes the physical address of the futex word into futex_queues[], adds the task to the wait list, and calls schedule(). Unlock: set word→0; if word was 2 (waiters present), call futex(FUTEX_WAKE,1) to wake one waiter.
4Why does pthread_mutex_lock() not always invoke a syscall?glibc implements pthread_mutex using futex. The lock word lives in the mutex struct in userspace. Uncontended lock is a single LOCK CMPXCHG — ring-3 only. Only when the CAS fails (another thread holds it) does glibc call futex(FUTEX_WAIT) to enter the kernel. In a program with low contention the vast majority of lock operations never leave ring 3.
5What is a Unix domain socket and when would you prefer it over a TCP socket?AF_UNIX socket for local IPC: no IP header, no TCP state machine, no loopback NIC driver. Data goes directly between kernel socket buffers — latency ~1 µs vs ~10–30 µs for TCP loopback. Prefer AF_UNIX for same-host IPC (Docker daemon, DBus, nginx worker ↔ master). Additional features unavailable in TCP: SCM_CREDENTIALS (peer auth) and SCM_RIGHTS (fd passing).
6How does vhost-user use Unix domain sockets for DPDK virtio?vhost-user control plane: DPDK (master) binds an AF_UNIX socket; QEMU (slave) connects. They exchange vhost_user_msg structs: negotiate features, and — critically — QEMU sends VHOST_USER_SET_MEM_TABLE with a memfd passed via SCM_RIGHTS, mapping guest RAM into the DPDK process. Once setup completes, the data plane runs on shared virtio TX/RX rings in that memory — the Unix socket is only used for one-time setup.