Part VII — Network

§ 7.1 – 7.16 Linux Network Stack — Sockets to BGP

Socket layer, sk_buff, NAPI (§7.1–7.4) · IP, routing, ARP, TCP/UDP (§7.5–7.9) · Netfilter/NAT, bridge, VXLAN, IPv6, multicast (§7.10–7.14) · DHCP proxy (§7.15) · RIP, OSPF, BGP (§7.16)

1. Overview

The Linux network stack is layered: the socket layer abstracts transport protocols (TCP/UDP) behind a VFS file interface; the sk_buff is the universal packet container that flows between every layer; the net_device abstraction decouples protocol code from hardware drivers; and NAPI (New API) switches from interrupt-driven to polling-driven processing under load, dramatically improving throughput at high PPS.

2. § 7.1 — Socket Layer

Abstraction Layers — fd → file → socket → sock → tcp_sock

A socket descriptor is just a Linux file descriptor. The VFS layer maps it through struct filestruct socket struct sock. Protocol-specific state (TCP congestion window, sequence numbers) lives in struct tcp_sock, which extends struct sock by embedding it as its first member — so a simple pointer cast promotes between the two types.

StructKey FieldsPurpose
struct socketsock, ops, type, stateVFS-visible wrapper; holds proto_ops (inet_stream_ops) and pointer to sock
struct socksk_state, sk_sndbuf, sk_rcvbuf, sk_write_queue, sk_receive_queue, sk_protProtocol-independent socket state; send/receive buffers; wait queues
struct inet_sockinet_saddr, inet_daddr, inet_sport, inet_dportIPv4-specific addresses and ports, embedded in tcp_sock
struct tcp_sockrcv_nxt, snd_nxt, snd_una, cwnd, ssthresh, retransmit_timerFull TCP state machine and congestion control variables
proto_opsbind, connect, accept, sendmsg, recvmsg, pollVFS-level socket operations — dispatch table per address family
struct protosendmsg, recvmsg, connect, close, hashTransport-level operations — tcp_prot / udp_prot

Send/Receive Buffer Watermarks

Each socket has a bounded send buffer (sk_sndbuf) and receive buffer (sk_rcvbuf). When the write queue fills, tcp_sendmsg() blocks the caller (or returns EAGAIN on a non-blocking socket). When the receive queue exceeds sk_rcvbuf, new incoming segments are silently dropped — TCP flow control (receiver window advertisement) prevents this in steady state.

3. § 7.2 — sk_buff: The Core Packet Container

Memory Layout

Every network packet in the kernel is represented by an sk_buff. A single contiguous buffer is allocated (via kmalloc), and four pointers divide it into zones: headroom for headers prepended during TX (Ethernet → IP → TCP), data for the current payload, tailroom for trailers, and a trailer skb_shared_info struct for paged data (scatter-gather I/O) and GSO metadata.

Pointer Operations

FunctionWhat movesWhen used
skb_reserve(skb, n)data += n, tail += nCreate headroom before any data is written (TX path, before filling payload)
skb_put(skb, n)tail += n → return old tailAppend n bytes of payload to the end (caller fills returned pointer)
skb_push(skb, n)data -= n → return new dataPrepend n-byte header into headroom (TCP → IP → Ethernet on TX)
skb_pull(skb, n)data += nStrip n-byte header from front (Ethernet header consumed on RX, then IP, then TCP)

Clone vs Copy

skb_clone() creates a new header struct pointing to the same data buffer — zero cost for forwarding a packet to multiple consumers (e.g., packet sniffer + routing). The clone is read-only for the data area. skb_copy() allocates a fresh buffer and copies everything — required before modifying packet bytes (e.g., NAT rewrite).

GSO & GRO

FeatureDirectionWhat it does
TSO (TCP Segmentation Offload)TXNIC splits one large skb into MTU-sized segments — CPU never touches per-segment headers
GSO (Generic Segmentation Offload)TXSoftware fallback when NIC lacks TSO — kernel splits at the last moment before driver
GRO (Generic Receive Offload)RXCoalesce many small TCP segments into one large skb in NAPI poll — reduces per-packet overhead
LRO (Large Receive Offload)RXHardware coalescing — deprecated; GRO is preferred (GRO is protocol-aware)

Minimal C Demo — sk_buff Pointer Manipulation

sk_buff — skb_reserve / skb_put / skb_push / skb_pull — C Demo
stdin (optional)

4. § 7.3 — Network Device Layer & NAPI

TX Path — write() to NIC Ring

On the TX side the kernel walks a layered dispatch chain: the socket send path calls tcp_sendmsg() which segments the data and calls ip_queue_xmit() to add the IP header and perform a route lookup. dev_queue_xmit()then hands the skb to the queueing discipline (qdisc) for rate control and shaping. The qdisc dequeues and calls the driver's ndo_start_xmit() which fills a TX ring descriptor and signals the NIC.

RX Path — NIC Interrupt to Socket Buffer

On receive, the NIC DMA-writes the frame into a pre-allocated ring buffer and raises a hard IRQ. The IRQ handler does the minimum: disable NIC RX interrupt, call napi_schedule() to register a poll callback, and return. The actual packet processing happens in a softirq (NET_RX_SOFTIRQ) outside interrupt context, up to a configurable budget — avoiding the per-packet interrupt overhead that collapses throughput at high PPS.

Core Mechanism — NAPI Budget & Interrupt Coalescing

Background: A 10 Gbps NIC at minimum frame size (64 B) delivers ~14.8 Mpps. If each packet triggers a hard IRQ, the CPU spends 100% of its time in interrupt context — no user code runs. NAPI solves this by switching to polling when the arrival rate is high.
Plan:
  1. First packet: NIC fires hard IRQ → IRQ handler disables NIC RX interrupt + calls napi_schedule()
  2. Softirq net_rx_action() runs on the same CPU, calls napi.poll(quota)
  3. Driver processes up to quota descriptors per poll — builds one sk_buff per packet
  4. If work < quota → ring drained → call napi_complete_done() → re-enable NIC interrupt → back to interrupt-driven
  5. If work == quota → ring may still have more → softirq re-schedules poll (no interrupt re-arm)
Walkthrough — 40 packets arrive, budget = 16 per poll:
Roundnapi_poll(16) returnsRing leftAction
11624work == quota → reschedule poll
2168work == quota → reschedule poll
380work < quota → napi_complete_done() → re-arm IRQ
net_device fieldTypePurpose
name[IFNAMSIZ]char[]Interface name (eth0, ens3, lo)
dev_addr[MAX_ADDR_LEN]unsigned char[]MAC address
mtuunsigned intMaximum transmission unit (default 1500)
featuresnetdev_features_tOffload capability flags: NETIF_F_TSO, NETIF_F_GRO, NETIF_F_HW_CSUM
netdev_opsstruct net_device_ops *Driver operations: ndo_open, ndo_stop, ndo_start_xmit, ndo_get_stats64
napi_liststruct list_headList of napi_struct — one per queue (multi-queue NICs have one per TX/RX queue pair)
tx_queue_lenunsigned longTX qdisc queue depth limit
num_tx_queues / num_rx_queuesunsigned intMulti-queue NIC: RSS spreads RX across multiple queues, one per CPU

Minimal C Demo — NAPI Budget Simulation

NAPI Poll Budget — packet processing loop — C Demo
stdin (optional)

5. § 7.4 — Traffic Control (TC) & QoS

Linux Traffic Control (TC) inserts a queueing discipline (qdisc) between the network layer and the NIC driver. Every packet leaving via dev_queue_xmit() passes through the root qdisc. TC enables per-VM bandwidth limiting in cloud hypervisors, ingress policing, and fair queueing between flows — all without modifying application code.

TC Pipeline

Key Qdiscs

QdiscAlgorithmBest for
pfifo_fastThree-band FIFO, TOS-based priorityDefault; lowest overhead; no rate control
tbf (Token Bucket Filter)Token bucket — tokens accumulate at rate R; burst up to B bytesSimple rate limiting (e.g., cap a VM to 1 Gbps)
htb (Hierarchical Token Bucket)Class hierarchy with rate + ceil; excess lent to childrenPer-VM bandwidth guarantee + burst borrowing in cloud hypervisors
fq_codelFair Queue + CoDel AQM; per-flow FIFO + delay-based dropISP edge; eliminates bufferbloat; default on many Linux routers
fq (Fair Queue)Per-flow pacing using TCP pacing rate; reduces RTT varianceHigh-throughput servers with many TCP flows
netemAdds delay / jitter / loss / reorder / corruptionNetwork emulation in test environments

HTB — Per-VM Bandwidth Limiting

HTB is the standard qdisc for per-tenant bandwidth limiting in cloud hypervisors. Each VM gets an HTB class with a guaranteed rate and an optional burst ceil. When a VM is below its rate it may borrow unused tokens from sibling classes up to ceil — ensuring work-conserving behaviour without permanent starvation.

Background: A hypervisor hosts 20 VMs on a 10 Gbps uplink. Without TC one noisy VM can saturate the NIC. With HTB: assign each VM a guaranteed rate = 400 Mbps and allow burst to ceil = 1 Gbps when the link is idle.
Plan:
  1. Create root htb qdisc on eth0: tc qdisc add dev eth0 root handle 1: htb default 10
  2. Add per-VM class: tc class add dev eth0 parent 1: classid 1:1 htb rate 400mbit ceil 1000mbit
  3. Attach leaf qdisc (fq_codel for AQM): tc qdisc add dev eth0 parent 1:1 handle 10: fq_codel
  4. Attach filter to steer VM packets to the class: tc filter add dev eth0 ... match ip src 10.0.0.5 classid 1:1
struct Qdisc fieldTypePurpose
opsstruct Qdisc_ops *enqueue, dequeue, peek, init, reset, destroy callbacks
qstruct qdisc_skb_headInternal packet queue (or class structure for classful qdiscs)
rate_eststruct net_rate_estimator *Rate estimation for tc stats
handleu32TC handle — 16-bit major:minor (e.g. 1:0 = root, 1:1 = first class)
parentu32Parent handle (0 for root)
limitu32Max queue depth (for pfifo-like qdiscs)

6. § 7.5 — IP Layer

The IP layer sits between transport protocols (TCP/UDP) and the device layer. Every packet — locally generated or forwarded — passes through ip_rcv() on the receive side. The routing subsystem attaches a dst_entry to each sk_buff that decides whether the packet goes to a local socket (ip_local_deliver) or is forwarded to another host (ip_forward).

IP Header — Key Fields

FieldSizePurpose
version / IHL4b / 4bVersion=4; IHL = header length in 32-bit words (min 5 = 20 bytes)
DSCP / ECN6b / 2bDifferentiated services (QoS); ECN = explicit congestion notification
total length16bHeader + payload in bytes (max 65535)
identification16bFragment group ID — all fragments of same datagram share this value
flags / frag offset3b / 13bDF (don't fragment), MF (more fragments); offset in 8-byte units
TTL8bDecremented each hop; reaches 0 → ICMP TTL Exceeded + drop
protocol8bNext header: 6=TCP, 17=UDP, 1=ICMP, 89=OSPF
header checksum16bOne's complement of IP header; recomputed each hop (TTL changed)
src / dst address32b eachSource and destination IPv4 addresses

IP Fragmentation & Reassembly

When a packet exceeds path MTU, the kernel fragments it or drops and sends ICMP. Modern stacks prefer PMTU discovery — set DF=1, let the network signal the bottleneck MTU via ICMP Frag Needed, resend smaller. Fragmentation is expensive: it stresses stateful firewalls and can block the reassembly queue. The kernel reassembles at the destination using per-flow fragment queues (struct ipq) with a 30-second GC timeout.

7. § 7.6 — Routing Subsystem

Linux stores its routing table in a FIB (Forwarding Information Base)implemented as a level-compressed trie (fib_trie). Every route lookup calls fib_lookup() which traverses the trie from the most-significant bit, always choosing the longest matching prefix.

FIB Trie — Longest Prefix Match

Route Lookup Path

Struct / SymbolPurpose
struct fib_tableOne routing table; Linux has main + local by default (255 max)
struct fib_infoNexthop: gateway IP, output device, priority, scope
struct rtablePer-packet route cache: dst_entry + rt_gateway + rt_iif
struct dst_entryEmbedded in rtable; holds .output() and .input() function pointers
fib_lookup()Traverse fib_trie → return fib_result (route type + nexthop)
ip rule (policy routing)Match src/dst/tos → select which FIB table to consult

Core Mechanism — LPM Walkthrough

Background: A router with 800K BGP routes must make a forwarding decision per packet in nanoseconds. A sorted array costs 20 binary-search steps; a full 32-level trie wastes memory. The LC-trie compresses single-child subtrees into multi-bit steps, typically reaching any leaf in 4–6 comparisons.
Example — lookup 10.1.2.99:
StepNodePrefix checkAction
1root0.0.0.0/0match — record candidate
2tnode /810.0.0.0/8match — better candidate
3tnode /1610.1.0.0/16match — better candidate
4leaf /2410.1.2.0/24match — longest, return nexthop 10.1.2.1

Minimal C Demo — LPM Route Lookup

FIB LPM — longest prefix match routing — C Demo
stdin (optional)

8. § 7.7 — Neighbor Subsystem (ARP/NDP)

Before a packet can leave via the NIC, the kernel must know the next-hop's MAC address. The neighbour subsystem caches IP→MAC mappings in a per-device hash table. ARP (IPv4) and NDP (IPv6) are the discovery protocols. Their lifecycle is managed by the NUD state machine (Neighbour Unreachability Detection) which periodically confirms that cached entries are still valid.

NUD State Machine

ARP Request / Reply Flow

ConceptDetail
struct neighbourOne entry per (IP, dev); holds MAC, NUD state, timer, queue of skbs pending resolution
neigh_tablePer-protocol hash table (arp_tbl / nd_tbl); LRU eviction when gc_thresh3 reached
Gratuitous ARP (GARP)ARP request where sender IP = target IP; announces ownership; used for failover + IP conflict detection
Proxy ARPKernel answers ARP on behalf of another host; used in virtual switches (VM behind NAT gateway)
NDP Neighbor SolicitationIPv6 ARP equivalent; multicast to solicited-node address ff02::1:ff<last 24b of target IP>
NDP Router AdvertisementRouter advertises prefix + gateway; hosts use SLAAC to derive IPv6 address (EUI-64 + DAD)

9. § 7.8 — TCP/IP Protocol Deep Dive

TCP provides reliable, ordered, byte-stream delivery over unreliable IP. Its complexity comes from three interlocking mechanisms: a state machine tracking connection lifecycle, a sliding window providing flow control, and congestion control preventing the sender from overwhelming the network.

TCP State Machine

3-Way Handshake

Kernel FunctionRole in Handshake
tcp_v4_connect()Client: build SYN, set SYN_SENT, start connect timer
tcp_rcv_state_process()Server: receive SYN in LISTEN → allocate request_sock → send SYN-ACK → SYN_RECEIVED
tcp_v4_syn_recv_sock()Server: receive ACK → tcp_create_openreq_child() → ESTABLISHED → wake accept()
tcp_fastopen_create_child()TFO: send data with SYN to skip 1 RTT on repeat connections

Congestion Control

AlgorithmKey ideaDefault?
CUBICcwnd grows as cubic function of time since last loss; fast recovery on high-BDP linksYes (since 2.6.19)
BBRModel-based: estimate BtlBw + min RTT; pace at BtlBw rate — doesn't wait for lossNo — opt-in
RENOClassic halve cwnd on loss, linear growth; basis for all othersFallback

SYN Cookie — Defending Against SYN Floods

Background: A SYN flood sends millions of SYNs with spoofed IPs. Each SYN allocates a struct request_sock in the SYN backlog. When the backlog fills, legitimate SYNs are dropped. SYN cookies encode all connection state in the ISN, eliminating the backlog entry.

Key TCP Mechanisms

MechanismHow it works
Nagle AlgorithmDelay small segments until ACK arrives or MSS-sized segment ready. Reduces chatty sends. Disable with TCP_NODELAY for latency-sensitive apps.
TCP_CORKHold data until uncorked or MSS full — application-controlled coalescing. Used by HTTP servers to bundle headers + body.
TCP KeepaliveAfter tcp_keepalive_time (default 2h) of idle, send keepalive probes. Close if no response. Detects dead peers behind stateful firewalls.
Zero Window ProbeWhen receiver advertises window=0, sender periodically probes to detect when window reopens, avoiding deadlock.
PMTU DiscoverySet DF=1; on ICMP Frag Needed, reduce MSS to the path MTU. Avoids fragmentation along the path.
RTT Estimation (Karn)SRTT = 7/8*SRTT + 1/8*sample. Karn: never sample RTT from retransmitted segments — ambiguous which copy was ACKed.

Minimal C Demo — TCP Congestion Control Simulation

TCP Congestion Control — slow start, CA, fast retransmit — C Demo
stdin (optional)

10. § 7.9 — UDP

UDP adds port numbers and an optional checksum to IP, then delivers datagrams directly to the socket. No connection setup, no retransmission, no ordering. The receive path is primarily a 4-tuple hash lookup to find the right socket, followed by an skb enqueue.

FeatureDetail
4-tuple demuxudp4_lib_lookup(): hash(src IP, src port, dst IP, dst port) → linked list of matching sockets
Multicast deliveryip_check_mc_rcu(): fan-out to all sockets joined to the group; IGMP controls group membership at the router
UDP checksumOptional in IPv4 (0 = disabled); mandatory in IPv6. Offloaded via NETIF_F_HW_CSUM on capable NICs
RCVBUF overflowsk_rcvbuf limit: if receive queue full, skb silently dropped — no retransmission unlike TCP
SO_REUSEPORTMultiple sockets bind same port; kernel distributes by hash — enables multi-threaded UDP servers without lock contention
UDP GROCoalesce UDP packets from same flow into one skb — reduces per-packet overhead for QUIC/WireGuard tunnels

11. § 7.10 — Netfilter & NAT

Netfilter is Linux's packet filtering framework. It inserts five hook points on the IP packet path where kernel modules (iptables, nftables, conntrack) can inspect or modify packets. Connection tracking (nf_conntrack) maintains a state table so that stateful firewalls and NAT can match reply packets to their original flows.

Hook Points on the Packet Path

Connection Tracking States

nf_conn fieldTypePurpose
tuplehash[IP_CT_DIR_ORIGINAL]struct nf_conntrack_tuple_hashOriginal direction tuple: src/dst IP + port + proto
tuplehash[IP_CT_DIR_REPLY]struct nf_conntrack_tuple_hashReply direction tuple — reverse of original; used to match reply packets
statusunsigned longBitfield: IPS_CONFIRMED, IPS_NAT_MASQ, IPS_SEEN_REPLY, IPS_ASSURED
timeoutstruct timer_listPer-state timeout: TCP ESTABLISHED=432000s, UDP=30s, ICMP=30s
nat.infostruct nf_nat_conn_infoStores allocated NAT port and IP (for SNAT/DNAT)
protounion nf_conntrack_protoProtocol-specific state (TCP: window scale, sequence numbers)

Core Mechanism — SNAT Port Selection

Background: Multiple VMs share one public IP. Each outgoing TCP connection must be given a unique (src_ip, src_port) on the public side so that return traffic can be demultiplexed back to the correct VM.
Plan:
  1. First packet hits POSTROUTING hook; nf_nat module sees no existing entry
  2. nf_nat_get_unique_tuple() picks an ephemeral port (1024–65535) not already used by another conn
  3. Kernel rewrites skb src IP + src port; updates IP/TCP checksums
  4. Stores original ↔ NATted tuple pair in nf_conn
  5. Reply arrives with dst = NATted port; conntrack lookup finds the entry; reverse rewrite restores original dst

Minimal C Demo — SNAT Connection Tracker

SNAT conntrack — port allocation and reuse — C Demo
stdin (optional)

12. § 7.11 — Network Bridge

The Linux kernel bridge (br_*) implements an IEEE 802.1D Ethernet bridge entirely in software. Each port is a net_device enslaved to a bridge device. Frames are forwarded using a MAC learning table (FDB — Forwarding Database). Unknown destination MACs are flooded to all ports; known MACs are forwarded unicast.

FDB Learning & Forwarding

Struct / FunctionPurpose
struct net_bridgeMaster bridge device — holds port list, FDB hash table, STP state
struct net_bridge_portOne per enslaved interface — port state (forwarding/blocking/learning), STP timers
struct net_bridge_fdb_entryOne FDB entry: MAC → port, ageing timer (default 300 s)
br_fdb_update()Called on every received frame to learn src MAC → ingress port mapping
br_forward() / br_flood()Unicast forward to known port or flood to all ports (minus ingress)
ebtablesL2 filtering framework — analogous to iptables but operates on Ethernet frames

Minimal C Demo — Bridge MAC Learning Table

Bridge FDB — MAC learning and unicast forwarding — C Demo
stdin (optional)

13. § 7.12 — GRE & VXLAN Tunnels

Overlay tunnels encapsulate one packet inside another so that tenant traffic can traverse an underlay network without leaking tenant MAC/IP addresses. VXLAN (Virtual eXtensible LAN) wraps an Ethernet frame in UDP/IP, adding a 24-bit VNI (VXLAN Network Identifier) to separate up to 16 million tenant networks over the same physical fabric.

VXLAN Encapsulation Format

VXLAN RX Path

ConceptDetail
VTEP (VXLAN Tunnel End Point)The host NIC + vxlan driver that adds/strips the outer UDP/IP + VXLAN header
VNI (VXLAN Network Identifier)24-bit tenant ID — analogous to VLAN ID but 16M namespaces vs 4096
VXLAN port 4789IANA-assigned UDP destination port for all VXLAN traffic
Outer UDP src portHash of inner 5-tuple (for ECMP load balancing across LAG / ECMP paths)
Underlay requirementMTU ≥ inner MTU + 50 bytes overhead — typically set to 1550 or jumbo frames
GRE vs VXLANGRE: IP proto 47, no src port → no ECMP. VXLAN: UDP → ECMP-friendly. GENEVE extends VXLAN with TLV options
Checksum offloadvxlan driver sets NETIF_F_GSO_UDP_TUNNEL so NIC can offload inner TCP segmentation

Minimal C Demo — VXLAN Header Encode / Decode

VXLAN header — encode VNI and decode on receive — C Demo
stdin (optional)

14. § 7.13 — IPv6

IPv6 uses 128-bit addresses and eliminates broadcast in favor of multicast. The Neighbor Discovery Protocol (NDP) replaces ARP; Router Advertisements (RA) distribute prefixes so hosts can self-configure via SLAAC without a DHCP server.

IPv6 Address Types & Scope

SLAAC — Stateless Address Autoconfiguration

NDP MessageICMPv6 TypePurpose
Router Solicitation (RS)133Host → ff02::2 asking for prefix + gateway info
Router Advertisement (RA)134Router → ff02::1 advertising prefix, MTU, default route; triggers SLAAC
Neighbor Solicitation (NS)135Who has this IP? (like ARP request); also used for DAD
Neighbor Advertisement (NA)136I have this IP (like ARP reply); Source Link-Layer Address (SLLA) option carries MAC
Redirect137Router tells host about a better next-hop for a destination

Minimal C Demo — EUI-64 Interface ID from MAC

EUI-64 — generate IPv6 IID from MAC-48 — C Demo
stdin (optional)

15. § 7.14 — Multicast (IGMP & MLD)

Multicast delivers one stream to many receivers without per-receiver replication at the source. IGMP (IPv4) and MLD (IPv6) are the group management protocols that hosts use to join/leave groups. PIM-SM builds the distribution tree. The RPF check prevents routing loops by verifying that multicast traffic arrives on the correct upstream interface.

IGMP Join Flow (SSM — Source-Specific Multicast)

RPF Check — Loop Prevention

ConceptDetail
IGMP v1/v2/v3v1: join only. v2: + leave (prune faster). v3: SSM — report specific source(s) to receive from
MLD v1/v2IPv6 multicast listener discovery — MLD v2 adds SSM like IGMPv3; uses ICMPv6 types 130–132
PIM-SM (Sparse Mode)Builds shared tree via Rendezvous Point (RP), then optionally switches to source tree (SPT)
Source tree (*,G) vs (S,G)(*,G) = any source for group G — shared RP tree. (S,G) = specific source S — lowest latency, shortest path
RPF interfaceThe interface that unicast routing would use to reach the multicast source — packet must arrive here
mfc_cacheMulticast Forwarding Cache entry: (src, group) → OIF list (output interfaces)
IGMP snoopingSwitch learns group membership from IGMP frames — prevents flooding multicast to all ports

Minimal C Demo — RPF Check

RPF check — validate multicast source reachability — C Demo
stdin (optional)

16. § 7.15 — DHCP Protocol & Proxy DHCP

DHCP (Dynamic Host Configuration Protocol) automates IP address assignment via a four-message exchange called DORA. In cloud VPC environments, virtual switches implement a DHCP proxy (代答) that intercepts DISCOVER packets and synthesizes replies from a local config database — eliminating dependency on a real DHCP server and giving the hypervisor full control over tenant addressing.

DHCP DORA Sequence (with Relay)

Virtual Switch DHCP Proxy (代答)

DHCP OptionCodePurpose
Subnet Mask1Netmask for assigned IP (e.g., 255.255.255.0)
Router (gateway)3Default gateway IP — must be set for off-subnet traffic
DNS Servers6Up to 8 DNS server IPs
Lease Time51Seconds until IP expires; use 0xFFFFFFFF for infinite (VPC)
DHCP Message Type531=DISCOVER 2=OFFER 3=REQUEST 5=ACK 6=NAK
Server Identifier54DHCP server IP — client uses this to select among multiple offers
Relay Agent Info82Sub-options: circuit-id (ingress port), remote-id (switch MAC) — added by relay
DHCPv6 — Stateful vs Stateless
ModeFlowAddress Source
Stateful (IA_NA)Solicit → Advertise → Request → ReplyDHCPv6 server assigns full /128
Stateless (SLAAC + options)RA → SLAAC address + DHCPv6 for options onlyHost self-configures via EUI-64; DHCPv6 supplies DNS/NTP

Minimal C Demo — DHCP DORA State Machine

DHCP DORA — client state machine simulation — C Demo
stdin (optional)

17. § 7.16 — Routing Protocols: RIP, OSPF, BGP

Dynamic routing protocols allow routers to exchange reachability information and converge on consistent forwarding tables. RIP is distance-vector (hop count); OSPF is link-state (Dijkstra SPF); BGP is path-vector (AS-path, policy-driven). In cloud infrastructure, BGP (via BIRD or FRR) is used for advertising VM IP prefixes during live migration, anycast VIP advertisement, and cross-datacenter routing.

ProtocolTypeMetricAlgorithmScope
RIP v2Distance-vectorHop count (max 15)Bellman-Ford, periodic full-table broadcast every 30 sSmall LANs; obsolete in production
OSPF v2Link-stateCost (bandwidth-based)Dijkstra SPF; LSA flooding within areaEnterprise / datacenter IGP
BGP-4Path-vectorAS_PATH + policy attributesPolicy-driven best-path selectionInter-AS Internet routing; datacenter BGP-only fabric

OSPF Neighbor State Machine

OSPF Area Hierarchy

LSA TypeNameFlooded toPurpose
Type 1Router LSAWithin areaEach router's links and their costs
Type 2Network LSAWithin areaDR-generated: lists all routers on multi-access segment
Type 3Summary LSAOther areasABR advertises inter-area prefixes into backbone/areas
Type 4ASBR Summary LSAOther areasABR advertises path to ASBR
Type 5AS-External LSAEntire OSPF domainASBR redistributes external routes (blocked by Stub areas)
Type 7NSSA External LSANSSA area onlyLocal ASBR in NSSA; ABR translates to Type 5 at border

BGP State Machine

BGP Best-Path Selection (in order)

BGP AttributeTypeMeaning
AS_PATHWell-known mandatoryOrdered list of ASes the route has traversed — loop prevention + path length metric
NEXT_HOPWell-known mandatoryIP of the next-hop router; iBGP peers must resolve this via IGP
LOCAL_PREFWell-known discretionarySet within an AS to prefer one exit point over another; higher wins
MEDOptional non-transitiveMulti-Exit Discriminator — hint to neighboring AS which entry to prefer; lower wins
COMMUNITYOptional transitive32-bit tag for grouping routes; used for policy (e.g., blackhole, no-export)
ORIGINWell-known mandatoryRoute origin: IGP (i) < EGP (e) < Incomplete (?) — lower preferred
RIP — Count-to-Infinity Problem

When a route becomes unreachable, RIP routers may advertise stale hop counts back to each other, incrementing forever until max (16 = infinity). Mitigations: split horizon (never advertise a route back out the interface it was learned on), poison reverse (advertise with metric 16 back on that interface — faster convergence), and triggered updates (send immediately on topology change rather than waiting 30 s).

BIRD — BGP route publishing in VM live migration

During VM live migration, the destination hypervisor needs to attract traffic for the migrated VM's IP without waiting for ARP to age out on every ToR switch. The hypervisor runs a BIRD BGP daemon that announces the VM's /32 host route to the ToR via iBGP/eBGP. The ToR installs a more-specific /32 that overrides the /24 subnet route, redirecting traffic immediately. After migration completes, the source hypervisor withdraws its announcement.

Minimal C Demo — BGP Best-Path Selection

BGP best-path — LOCAL_PREF, AS_PATH, ORIGIN, MED — C Demo
stdin (optional)

18. Kernel Source Pointers

ConceptFileKey Function / Symbol
struct socket / sockinclude/linux/net.h, include/net/sock.hsock_alloc(), sk_alloc()
struct tcp_sockinclude/linux/tcp.htcp_sk(sk) cast macro
Socket creationnet/socket.csys_socket() → __sock_create() → inet_create()
struct sk_buffinclude/linux/skbuff.halloc_skb(), skb_put(), skb_push(), skb_pull(), skb_clone()
skb_shared_infoinclude/linux/skbuff.hskb_shinfo(skb) macro → (skb_shared_info*)skb->end
GSOnet/core/gso.c__skb_gso_segment(), skb_gso_reset()
GROnet/core/gro.cnapi_gro_receive(), dev_gro_receive()
TX pathnet/core/dev.cdev_queue_xmit() → __dev_queue_xmit() → sch_direct_xmit()
RX pathnet/core/dev.cnetif_receive_skb() → __netif_receive_skb() → deliver_skb()
NAPInet/core/dev.cnapi_schedule(), net_rx_action(), napi_complete_done()
struct net_deviceinclude/linux/netdevice.halloc_etherdev(), register_netdev()
TC qdiscnet/sched/sch_generic.cqdisc_enqueue(), qdisc_dequeue_head()
HTBnet/sched/sch_htb.chtb_enqueue(), htb_dequeue(), htb_charge_class()
fq_codelnet/sched/sch_fq_codel.cfq_codel_enqueue(), codel_should_drop()
IP receivenet/ipv4/ip_input.cip_rcv(), ip_rcv_finish(), ip_local_deliver()
IP forward / outputnet/ipv4/ip_forward.c, ip_output.cip_forward(), ip_output(), ip_finish_output2()
IP fragmentationnet/ipv4/ip_output.c, ip_fragment.cip_fragment(), ip_defrag(), struct ipq
FIB trie / LPMnet/ipv4/fib_trie.cfib_table_lookup(), tnode_get_child_rcu()
Route lookupnet/ipv4/route.cip_route_input_noref(), ip_route_output_key(), alloc_cache_rt()
Neighbour / ARPnet/core/neighbour.c, net/ipv4/arp.cneigh_lookup(), neigh_update(), arp_send()
TCP state machinenet/ipv4/tcp_input.ctcp_rcv_state_process(), tcp_v4_rcv()
TCP handshakenet/ipv4/tcp_ipv4.ctcp_v4_connect(), tcp_v4_syn_recv_sock()
SYN cookienet/ipv4/syncookies.ccookie_v4_check(), __cookie_v4_init_sequence()
Congestion controlnet/ipv4/tcp_cong.c, tcp_cubic.ctcp_cong_ops, bictcp_cong_avoid()
UDP receivenet/ipv4/udp.cudp_rcv(), udp4_lib_rcv(), udp_queue_rcv_skb()
Netfilter hooksnet/netfilter/core.cnf_hook_slow(), nf_register_net_hook()
Connection trackingnet/netfilter/nf_conntrack_core.cnf_conntrack_in(), init_conntrack(), __nf_ct_refresh_acct()
NATnet/netfilter/nf_nat_core.cnf_nat_packet(), nf_nat_get_unique_tuple(), nf_nat_setup_info()
iptables / nftablesnet/ipv4/netfilter/ip_tables.c, net/netfilter/nf_tables_core.cipt_do_table(), nft_do_chain()
Network bridgenet/bridge/br.c, br_fdb.c, br_forward.cbr_handle_frame(), br_fdb_update(), br_forward()
STPnet/bridge/br_stp.cbr_stp_enable_port(), br_received_config_bpdu()
VXLANdrivers/net/vxlan/vxlan_core.cvxlan_rcv(), vxlan_xmit(), vxlan_fdb_find()
GRE tunnelnet/ipv4/ip_gre.cipgre_rcv(), ipgre_xmit()
IPv6 receivenet/ipv6/ip6_input.cipv6_rcv(), ip6_rcv_finish()
NDP / RAnet/ipv6/ndisc.cndisc_recv_ra(), ndisc_recv_ns(), addrconf_dad_start()
SLAAC / addrconfnet/ipv6/addrconf.caddrconf_prefix_rcv(), ipv6_generate_eui64()
IGMPnet/ipv4/igmp.cigmp_rcv(), ip_mc_join_group(), igmpv3_sendpack()
MLDnet/ipv6/mcast.cmld_rcv(), ipv6_sock_mc_join()
Multicast forwardingnet/ipv4/ipmr.cip_mr_forward(), ipmr_cache_find(), mfc_cache_put()
OSPF (user-space)FRR: ospfd/ospf_interface.c, BIRD: proto/ospf/ospf_hello_send(), ospf_spf_calculate(), ospf_lsa_install()
BGP (user-space)FRR: bgpd/bgp_best.c, BIRD: proto/bgp/bgp_best_selection(), bgp_process(), bgp_update_main()
Netlink RIB installnet/core/rtnetlink.crtnl_newroute(), fib_new_table(), ip_rt_ioctl()
DHCP (user-space)isc-dhcp client, dhcpcd, systemd-networkdSends DISCOVER via raw socket (AF_PACKET), listens on udp/68

19. Interview Prep

Q: Walk through the sk_buff memory layout — what do head, data, tail, end point to?

head/end bound the allocated buffer (never move). data points to the start of current packet data; tail to the end. Headroom (head→data) is reserved for TX header prepending via skb_push(). skb_shared_info lives at end for scatter-gather fragments and GSO metadata.

Q: What is NAPI and why does it improve throughput at high PPS?

NAPI switches from per-packet hard IRQs to a polling model under load. The first packet triggers a hard IRQ; the handler disables NIC RX interrupts and schedules a softirq poll. The poll processes up to budget packets per call. When the ring drains, the NIC interrupt is re-armed. This amortizes interrupt overhead: at 14 Mpps, per-packet IRQs would consume 100% CPU in interrupt context.

Q: Describe the full TX path from write() to NIC ring.

write()/sendmsg() → tcp_sendmsg() (segments, adds TCP header) → ip_queue_xmit() (IP header + route lookup) → dev_queue_xmit() (enqueue to qdisc) → qdisc dequeues → ndo_start_xmit() (driver fills TX ring descriptor) → NIC DMA reads descriptor + payload → TX complete interrupt frees skb.

Q: What is the difference between skb_clone() and skb_copy()?

skb_clone() creates a new sk_buff header pointing to the same data buffer (users++ on shared_info). Zero cost but cloned skb data is read-only. skb_copy() allocates a completely new buffer and copies all data — required before in-place modification (e.g., NAT IP rewrite). pskb_copy() copies the linear area but shares frags[].

Q: How does HTB implement per-VM bandwidth limiting?

HTB builds a class hierarchy. Each leaf class has a rate (guaranteed) and ceil (burst). The dequeue path walks the hierarchy, selects the class with the earliest eligible packet by virtual time (deficit round-robin at each level). A class below rate borrows tokens from its parent up to ceil when the parent has surplus. This ensures minimum guaranteed bandwidth while allowing burst into idle capacity.

Q: How does ip_rcv() decide to forward vs deliver locally?

ip_rcv_finish() calls ip_route_input_noref() which performs a FIB lookup. If the destination matches a local address the dst.input is set to ip_local_deliver(). If IP forwarding is enabled and the route is to a remote host, dst.input is ip_forward(). ip_forward() decrements TTL, checks MTU (fragments if needed), then calls ip_output() to re-enter the TX path.

Q: Explain LPM routing and the fib_trie data structure.

LPM (Longest Prefix Match) returns the routing entry whose prefix is the most specific match for a given destination IP. Linux uses an LC-trie (level-compressed trie) called fib_trie. Internal tnodes index into child arrays using multi-bit slices of the address; leaves hold fib_alias entries. Compression reduces the average lookup depth from 32 to 4–6 steps even with 800K BGP routes.

Q: What is the ARP NUD state machine? What is gratuitous ARP?

NUD (Neighbour Unreachability Detection) tracks MAC→IP cache validity: INCOMPLETE (waiting for reply) → REACHABLE (confirmed) → STALE (timer expired) → DELAY (wait for upper-layer confirmation) → PROBE (send unicast ARP) → FAILED (drop). Gratuitous ARP is an ARP request where sender IP = target IP — it announces IP ownership, updates neighbour caches on peers, and detects IP conflicts. Used heavily for VIP failover in virtual switches.

Q: Walk through the TCP 3-way handshake with kernel functions.

Client: tcp_v4_connect() sends SYN, state→SYN_SENT. Server: tcp_v4_rcv()→tcp_rcv_state_process() receives SYN in LISTEN, allocates request_sock, sends SYN-ACK, state→SYN_RECEIVED. Client receives SYN-ACK, sends ACK, state→ESTABLISHED. Server: receives ACK, calls tcp_v4_syn_recv_sock()→tcp_create_openreq_child(), state→ESTABLISHED, wakes accept().

Q: What is a SYN cookie and how does it defend against SYN floods?

SYN cookies encode all connection state (src/dst IP, ports, MSS index, timestamp) in the initial sequence number. The server sends SYN-ACK without allocating any backlog entry. When the client's ACK arrives, the server decodes ack_num-1 and verifies the hash — if valid, creates the socket. This eliminates the SYN backlog as an attack surface. Tradeoff: TCP options like SACK and window scaling may not be negotiated, slightly reducing throughput.

Q: Explain TCP slow start and fast retransmit.

Slow start: begin with cwnd=1 MSS, double cwnd each RTT (exponential growth) until cwnd reaches ssthresh, then switch to congestion avoidance (+1 MSS/RTT, linear). Fast retransmit: 3 duplicate ACKs indicate a lost segment without waiting for RTO; retransmit immediately, set ssthresh=cwnd/2, cwnd=ssthresh (fast recovery). RTO timeout is more severe: ssthresh=cwnd/2 and cwnd=1, restarting slow start.

Q: What are the five netfilter hook points and in what order are they traversed for a forwarded packet?

Forwarded path: PREROUTING (conntrack, DNAT) → routing decision → FORWARD (firewall rules) → POSTROUTING (SNAT/masquerade). For locally destined packets: PREROUTING → INPUT → local socket. For locally generated packets: OUTPUT → routing → POSTROUTING. nftables and iptables both register callbacks at these same NF_INET_* hook points.

Q: Explain SNAT port selection — how does nf_nat avoid port collisions?

nf_nat_get_unique_tuple() tries the original port first; if it is already used by another conntrack entry (same dst IP:port), it iterates through the configured port range (1024–65535 by default) hashing through the range until it finds one where no nf_conn with the same (proto, nat_ip, nat_port, dst_ip, dst_port) exists. The result is stored in nf_conn so every subsequent packet in the same flow reuses it.

Q: How does a Linux bridge learn MAC addresses and forward frames?

Every frame received on a bridge port triggers br_fdb_update() which looks up (or creates) an FDB entry keyed on src MAC, records the ingress port, and resets the ageing timer (default 300 s). For forwarding, br_forward() does an FDB lookup on dst MAC — if found, unicast; if not (unknown unicast, broadcast, or multicast), br_flood() copies the skb to all ports except the ingress. STP marks ports as forwarding/blocking to prevent loops.

Q: What is VXLAN and why is UDP used for the outer encapsulation?

VXLAN (RFC 7348) wraps an inner Ethernet frame in an 8-byte VXLAN header (flags + 24-bit VNI) then UDP + IP. UDP is used because: (1) the UDP src port is set to a hash of the inner 5-tuple, enabling ECMP load balancing across LAG/ECMP paths that hash on UDP ports — GRE lacks a src port and cannot ECMP. (2) UDP checksum can be disabled (inner layers handle their own). VTEP overhead is 50 bytes, reducing inner MTU from 1500 to 1450 bytes.

Q: Describe IPv6 SLAAC — how does a host get a global unicast address without DHCPv6?

SLAAC (RFC 4862): (1) Host sends Router Solicitation to ff02::2. (2) Router replies with RA containing prefix (e.g., 2001:db8:1::/64) + valid/preferred lifetimes. (3) Host generates interface ID using EUI-64 from MAC (flip U/L bit, insert FF:FE) or stable privacy extension (RFC 8981). (4) Combine: prefix + IID = full /128 address. (5) DAD: send NS to solicited-node multicast address; wait 1 RTT; if no NA reply, the address is unique and assigned.

Q: What is the RPF check in multicast routing and what does it prevent?

Reverse Path Forwarding check: when a multicast packet arrives on interface I from source S, the kernel performs a unicast route lookup for S. If the route to S would exit via a different interface than I, the packet is dropped. This prevents routing loops: in a loop, a multicast packet could bounce between routers indefinitely; RPF ensures the packet arrived from the correct upstream (towards the source tree). Implemented in ipmr.c for IPv4, ip6mr.c for IPv6.

Q: Compare IGMPv2 and IGMPv3 — what does SSM add?

IGMPv2: hosts report group membership (join *,G — any source) and send Leave Group messages, triggering Last Member Query before removing state. IGMPv3 (RFC 3376) adds Source-Specific Multicast (SSM): hosts specify which sources they want to receive from (INCLUDE mode) or exclude (EXCLUDE mode). This allows the network to build source-rooted shortest-path trees (S,G) instead of shared RP trees, eliminating the Rendezvous Point bottleneck and enabling the kernel to drop traffic from unwanted sources at the router before it reaches the receiver.

Q: Explain the DHCP proxy pattern — how does a virtual switch implement DHCP 代答?

When a VM boots and broadcasts a DHCP DISCOVER, the vSwitch intercepts it on the VM's virtual port before forwarding. It looks up the VM's MAC (or port ID) in a config database pre-populated by the VPC control plane with the assigned IP, mask, gateway, DNS, and lease time. The vSwitch synthesizes a DHCP OFFER and sends it directly to the VM without involving a real DHCP server. This gives the hypervisor full control of addressing (critical for VPC isolation), eliminates the single-point-of-failure DHCP server, and allows infinite lease times so VMs never try to renew.

Q: Compare RIP, OSPF, and BGP — when would you use each?

RIP: distance-vector, hop-count metric (max 15), 30-second full-table broadcast, slow convergence, count-to-infinity risk. Only suitable for small flat networks. OSPF: link-state, Dijkstra SPF, fast convergence (sub-second with BFD), area hierarchy scales to large enterprise. The preferred IGP. BGP: path-vector, AS_PATH prevents loops, policy-driven via LOCAL_PREF/MED/COMMUNITY, carries the full Internet table (800K+ routes). Used between ISPs, within large DCs (BGP-only fabric), and for advertising VM host routes during live migration via BIRD/FRR.

Q: Walk through the OSPF neighbor state machine to Full.

Down → Init (Hello received, neighbor's Router-ID not yet in our Hello) → 2-Way (our Router-ID appears in neighbor's Hello — basic bidirectional) → ExStart (master/slave negotiated via DBD, higher Router-ID becomes master) → Exchange (exchange full DBD summaries) → Loading (request missing LSAs via LSR, receive LSU) → Full (databases synchronized — neighbor is now an SPF input). On multi-access Ethernet, only DR/BDR reach Full with all routers; DROthers stay in 2-Way with each other.

Q: Walk through BGP route selection attributes in order.

1. Highest WEIGHT (Cisco-local, not advertised). 2. Highest LOCAL_PREF (intra-AS policy — which exit AS to prefer). 3. Locally originated (network/aggregate > iBGP learned). 4. Shortest AS_PATH (fewer AS hops). 5. Lowest ORIGIN (IGP < EGP < Incomplete). 6. Lowest MED (only compared among routes from same neighbor AS). 7. eBGP over iBGP. 8. Lowest IGP metric to NEXT_HOP. 9. Oldest eBGP route (stability). 10. Lowest BGP Router-ID (deterministic tiebreak). The winning route is installed in RIB and, if changed, triggers UPDATE messages to peers.

Q: How does BIRD publish a VM's /32 host route during live migration?

The destination hypervisor configures BIRD with a static /32 route for the migrating VM's IP pointing to the local vSwitch. BIRD's BGP session to the ToR switch advertises this /32 via iBGP or eBGP (depending on fabric design). The ToR's longest-prefix-match selects /32 over the existing /24 subnet route, redirecting all traffic to the new hypervisor immediately — without waiting for gratuitous ARP to propagate or neighbor caches to expire. When migration completes, the source hypervisor's BIRD withdraws its /32, and the /24 handles the now-unified traffic.