§ 7.1 – 7.16 Linux Network Stack — Sockets to BGP
Socket layer, sk_buff, NAPI (§7.1–7.4) · IP, routing, ARP, TCP/UDP (§7.5–7.9) · Netfilter/NAT, bridge, VXLAN, IPv6, multicast (§7.10–7.14) · DHCP proxy (§7.15) · RIP, OSPF, BGP (§7.16)
1. Overview
The Linux network stack is layered: the socket layer abstracts transport protocols (TCP/UDP) behind a VFS file interface; the sk_buff is the universal packet container that flows between every layer; the net_device abstraction decouples protocol code from hardware drivers; and NAPI (New API) switches from interrupt-driven to polling-driven processing under load, dramatically improving throughput at high PPS.
2. § 7.1 — Socket Layer
Abstraction Layers — fd → file → socket → sock → tcp_sock
A socket descriptor is just a Linux file descriptor. The VFS layer maps it through struct file → struct socket → struct sock. Protocol-specific state (TCP congestion window, sequence numbers) lives in struct tcp_sock, which extends struct sock by embedding it as its first member — so a simple pointer cast promotes between the two types.
| Struct | Key Fields | Purpose |
|---|---|---|
struct socket | sock, ops, type, state | VFS-visible wrapper; holds proto_ops (inet_stream_ops) and pointer to sock |
struct sock | sk_state, sk_sndbuf, sk_rcvbuf, sk_write_queue, sk_receive_queue, sk_prot | Protocol-independent socket state; send/receive buffers; wait queues |
struct inet_sock | inet_saddr, inet_daddr, inet_sport, inet_dport | IPv4-specific addresses and ports, embedded in tcp_sock |
struct tcp_sock | rcv_nxt, snd_nxt, snd_una, cwnd, ssthresh, retransmit_timer | Full TCP state machine and congestion control variables |
proto_ops | bind, connect, accept, sendmsg, recvmsg, poll | VFS-level socket operations — dispatch table per address family |
struct proto | sendmsg, recvmsg, connect, close, hash | Transport-level operations — tcp_prot / udp_prot |
Send/Receive Buffer Watermarks
Each socket has a bounded send buffer (sk_sndbuf) and receive buffer (sk_rcvbuf). When the write queue fills, tcp_sendmsg() blocks the caller (or returns EAGAIN on a non-blocking socket). When the receive queue exceeds sk_rcvbuf, new incoming segments are silently dropped — TCP flow control (receiver window advertisement) prevents this in steady state.
3. § 7.2 — sk_buff: The Core Packet Container
Memory Layout
Every network packet in the kernel is represented by an sk_buff. A single contiguous buffer is allocated (via kmalloc), and four pointers divide it into zones: headroom for headers prepended during TX (Ethernet → IP → TCP), data for the current payload, tailroom for trailers, and a trailer skb_shared_info struct for paged data (scatter-gather I/O) and GSO metadata.
Pointer Operations
| Function | What moves | When used |
|---|---|---|
skb_reserve(skb, n) | data += n, tail += n | Create headroom before any data is written (TX path, before filling payload) |
skb_put(skb, n) | tail += n → return old tail | Append n bytes of payload to the end (caller fills returned pointer) |
skb_push(skb, n) | data -= n → return new data | Prepend n-byte header into headroom (TCP → IP → Ethernet on TX) |
skb_pull(skb, n) | data += n | Strip n-byte header from front (Ethernet header consumed on RX, then IP, then TCP) |
Clone vs Copy
skb_clone() creates a new header struct pointing to the same data buffer — zero cost for forwarding a packet to multiple consumers (e.g., packet sniffer + routing). The clone is read-only for the data area. skb_copy() allocates a fresh buffer and copies everything — required before modifying packet bytes (e.g., NAT rewrite).
GSO & GRO
| Feature | Direction | What it does |
|---|---|---|
TSO (TCP Segmentation Offload) | TX | NIC splits one large skb into MTU-sized segments — CPU never touches per-segment headers |
GSO (Generic Segmentation Offload) | TX | Software fallback when NIC lacks TSO — kernel splits at the last moment before driver |
GRO (Generic Receive Offload) | RX | Coalesce many small TCP segments into one large skb in NAPI poll — reduces per-packet overhead |
LRO (Large Receive Offload) | RX | Hardware coalescing — deprecated; GRO is preferred (GRO is protocol-aware) |
Minimal C Demo — sk_buff Pointer Manipulation
4. § 7.3 — Network Device Layer & NAPI
TX Path — write() to NIC Ring
On the TX side the kernel walks a layered dispatch chain: the socket send path calls tcp_sendmsg() which segments the data and calls ip_queue_xmit() to add the IP header and perform a route lookup. dev_queue_xmit()then hands the skb to the queueing discipline (qdisc) for rate control and shaping. The qdisc dequeues and calls the driver's ndo_start_xmit() which fills a TX ring descriptor and signals the NIC.
RX Path — NIC Interrupt to Socket Buffer
On receive, the NIC DMA-writes the frame into a pre-allocated ring buffer and raises a hard IRQ. The IRQ handler does the minimum: disable NIC RX interrupt, call napi_schedule() to register a poll callback, and return. The actual packet processing happens in a softirq (NET_RX_SOFTIRQ) outside interrupt context, up to a configurable budget — avoiding the per-packet interrupt overhead that collapses throughput at high PPS.
Core Mechanism — NAPI Budget & Interrupt Coalescing
- First packet: NIC fires hard IRQ → IRQ handler disables NIC RX interrupt + calls
napi_schedule() - Softirq
net_rx_action()runs on the same CPU, callsnapi.poll(quota) - Driver processes up to quota descriptors per poll — builds one
sk_buffper packet - If work < quota → ring drained → call
napi_complete_done()→ re-enable NIC interrupt → back to interrupt-driven - If work == quota → ring may still have more → softirq re-schedules poll (no interrupt re-arm)
| Round | napi_poll(16) returns | Ring left | Action |
|---|---|---|---|
| 1 | 16 | 24 | work == quota → reschedule poll |
| 2 | 16 | 8 | work == quota → reschedule poll |
| 3 | 8 | 0 | work < quota → napi_complete_done() → re-arm IRQ |
| net_device field | Type | Purpose |
|---|---|---|
name[IFNAMSIZ] | char[] | Interface name (eth0, ens3, lo) |
dev_addr[MAX_ADDR_LEN] | unsigned char[] | MAC address |
mtu | unsigned int | Maximum transmission unit (default 1500) |
features | netdev_features_t | Offload capability flags: NETIF_F_TSO, NETIF_F_GRO, NETIF_F_HW_CSUM |
netdev_ops | struct net_device_ops * | Driver operations: ndo_open, ndo_stop, ndo_start_xmit, ndo_get_stats64 |
napi_list | struct list_head | List of napi_struct — one per queue (multi-queue NICs have one per TX/RX queue pair) |
tx_queue_len | unsigned long | TX qdisc queue depth limit |
num_tx_queues / num_rx_queues | unsigned int | Multi-queue NIC: RSS spreads RX across multiple queues, one per CPU |
Minimal C Demo — NAPI Budget Simulation
5. § 7.4 — Traffic Control (TC) & QoS
Linux Traffic Control (TC) inserts a queueing discipline (qdisc) between the network layer and the NIC driver. Every packet leaving via dev_queue_xmit() passes through the root qdisc. TC enables per-VM bandwidth limiting in cloud hypervisors, ingress policing, and fair queueing between flows — all without modifying application code.
TC Pipeline
Key Qdiscs
| Qdisc | Algorithm | Best for |
|---|---|---|
pfifo_fast | Three-band FIFO, TOS-based priority | Default; lowest overhead; no rate control |
tbf (Token Bucket Filter) | Token bucket — tokens accumulate at rate R; burst up to B bytes | Simple rate limiting (e.g., cap a VM to 1 Gbps) |
htb (Hierarchical Token Bucket) | Class hierarchy with rate + ceil; excess lent to children | Per-VM bandwidth guarantee + burst borrowing in cloud hypervisors |
fq_codel | Fair Queue + CoDel AQM; per-flow FIFO + delay-based drop | ISP edge; eliminates bufferbloat; default on many Linux routers |
fq (Fair Queue) | Per-flow pacing using TCP pacing rate; reduces RTT variance | High-throughput servers with many TCP flows |
netem | Adds delay / jitter / loss / reorder / corruption | Network emulation in test environments |
HTB — Per-VM Bandwidth Limiting
HTB is the standard qdisc for per-tenant bandwidth limiting in cloud hypervisors. Each VM gets an HTB class with a guaranteed rate and an optional burst ceil. When a VM is below its rate it may borrow unused tokens from sibling classes up to ceil — ensuring work-conserving behaviour without permanent starvation.
rate = 400 Mbps and allow burst to ceil = 1 Gbps when the link is idle.- Create root htb qdisc on eth0:
tc qdisc add dev eth0 root handle 1: htb default 10 - Add per-VM class:
tc class add dev eth0 parent 1: classid 1:1 htb rate 400mbit ceil 1000mbit - Attach leaf qdisc (fq_codel for AQM):
tc qdisc add dev eth0 parent 1:1 handle 10: fq_codel - Attach filter to steer VM packets to the class:
tc filter add dev eth0 ... match ip src 10.0.0.5 classid 1:1
| struct Qdisc field | Type | Purpose |
|---|---|---|
ops | struct Qdisc_ops * | enqueue, dequeue, peek, init, reset, destroy callbacks |
q | struct qdisc_skb_head | Internal packet queue (or class structure for classful qdiscs) |
rate_est | struct net_rate_estimator * | Rate estimation for tc stats |
handle | u32 | TC handle — 16-bit major:minor (e.g. 1:0 = root, 1:1 = first class) |
parent | u32 | Parent handle (0 for root) |
limit | u32 | Max queue depth (for pfifo-like qdiscs) |
6. § 7.5 — IP Layer
The IP layer sits between transport protocols (TCP/UDP) and the device layer. Every packet — locally generated or forwarded — passes through ip_rcv() on the receive side. The routing subsystem attaches a dst_entry to each sk_buff that decides whether the packet goes to a local socket (ip_local_deliver) or is forwarded to another host (ip_forward).
IP Header — Key Fields
| Field | Size | Purpose |
|---|---|---|
version / IHL | 4b / 4b | Version=4; IHL = header length in 32-bit words (min 5 = 20 bytes) |
DSCP / ECN | 6b / 2b | Differentiated services (QoS); ECN = explicit congestion notification |
total length | 16b | Header + payload in bytes (max 65535) |
identification | 16b | Fragment group ID — all fragments of same datagram share this value |
flags / frag offset | 3b / 13b | DF (don't fragment), MF (more fragments); offset in 8-byte units |
TTL | 8b | Decremented each hop; reaches 0 → ICMP TTL Exceeded + drop |
protocol | 8b | Next header: 6=TCP, 17=UDP, 1=ICMP, 89=OSPF |
header checksum | 16b | One's complement of IP header; recomputed each hop (TTL changed) |
src / dst address | 32b each | Source and destination IPv4 addresses |
IP Fragmentation & Reassembly
When a packet exceeds path MTU, the kernel fragments it or drops and sends ICMP. Modern stacks prefer PMTU discovery — set DF=1, let the network signal the bottleneck MTU via ICMP Frag Needed, resend smaller. Fragmentation is expensive: it stresses stateful firewalls and can block the reassembly queue. The kernel reassembles at the destination using per-flow fragment queues (struct ipq) with a 30-second GC timeout.
7. § 7.6 — Routing Subsystem
Linux stores its routing table in a FIB (Forwarding Information Base)implemented as a level-compressed trie (fib_trie). Every route lookup calls fib_lookup() which traverses the trie from the most-significant bit, always choosing the longest matching prefix.
FIB Trie — Longest Prefix Match
Route Lookup Path
| Struct / Symbol | Purpose |
|---|---|
struct fib_table | One routing table; Linux has main + local by default (255 max) |
struct fib_info | Nexthop: gateway IP, output device, priority, scope |
struct rtable | Per-packet route cache: dst_entry + rt_gateway + rt_iif |
struct dst_entry | Embedded in rtable; holds .output() and .input() function pointers |
fib_lookup() | Traverse fib_trie → return fib_result (route type + nexthop) |
ip rule (policy routing) | Match src/dst/tos → select which FIB table to consult |
Core Mechanism — LPM Walkthrough
| Step | Node | Prefix check | Action |
|---|---|---|---|
| 1 | root | 0.0.0.0/0 | match — record candidate |
| 2 | tnode /8 | 10.0.0.0/8 | match — better candidate |
| 3 | tnode /16 | 10.1.0.0/16 | match — better candidate |
| 4 | leaf /24 | 10.1.2.0/24 | match — longest, return nexthop 10.1.2.1 |
Minimal C Demo — LPM Route Lookup
8. § 7.7 — Neighbor Subsystem (ARP/NDP)
Before a packet can leave via the NIC, the kernel must know the next-hop's MAC address. The neighbour subsystem caches IP→MAC mappings in a per-device hash table. ARP (IPv4) and NDP (IPv6) are the discovery protocols. Their lifecycle is managed by the NUD state machine (Neighbour Unreachability Detection) which periodically confirms that cached entries are still valid.
NUD State Machine
ARP Request / Reply Flow
| Concept | Detail |
|---|---|
| struct neighbour | One entry per (IP, dev); holds MAC, NUD state, timer, queue of skbs pending resolution |
| neigh_table | Per-protocol hash table (arp_tbl / nd_tbl); LRU eviction when gc_thresh3 reached |
| Gratuitous ARP (GARP) | ARP request where sender IP = target IP; announces ownership; used for failover + IP conflict detection |
| Proxy ARP | Kernel answers ARP on behalf of another host; used in virtual switches (VM behind NAT gateway) |
| NDP Neighbor Solicitation | IPv6 ARP equivalent; multicast to solicited-node address ff02::1:ff<last 24b of target IP> |
| NDP Router Advertisement | Router advertises prefix + gateway; hosts use SLAAC to derive IPv6 address (EUI-64 + DAD) |
9. § 7.8 — TCP/IP Protocol Deep Dive
TCP provides reliable, ordered, byte-stream delivery over unreliable IP. Its complexity comes from three interlocking mechanisms: a state machine tracking connection lifecycle, a sliding window providing flow control, and congestion control preventing the sender from overwhelming the network.
TCP State Machine
3-Way Handshake
| Kernel Function | Role in Handshake |
|---|---|
tcp_v4_connect() | Client: build SYN, set SYN_SENT, start connect timer |
tcp_rcv_state_process() | Server: receive SYN in LISTEN → allocate request_sock → send SYN-ACK → SYN_RECEIVED |
tcp_v4_syn_recv_sock() | Server: receive ACK → tcp_create_openreq_child() → ESTABLISHED → wake accept() |
tcp_fastopen_create_child() | TFO: send data with SYN to skip 1 RTT on repeat connections |
Congestion Control
| Algorithm | Key idea | Default? |
|---|---|---|
CUBIC | cwnd grows as cubic function of time since last loss; fast recovery on high-BDP links | Yes (since 2.6.19) |
BBR | Model-based: estimate BtlBw + min RTT; pace at BtlBw rate — doesn't wait for loss | No — opt-in |
RENO | Classic halve cwnd on loss, linear growth; basis for all others | Fallback |
SYN Cookie — Defending Against SYN Floods
struct request_sock in the SYN backlog. When the backlog fills, legitimate SYNs are dropped. SYN cookies encode all connection state in the ISN, eliminating the backlog entry.Key TCP Mechanisms
| Mechanism | How it works |
|---|---|
| Nagle Algorithm | Delay small segments until ACK arrives or MSS-sized segment ready. Reduces chatty sends. Disable with TCP_NODELAY for latency-sensitive apps. |
| TCP_CORK | Hold data until uncorked or MSS full — application-controlled coalescing. Used by HTTP servers to bundle headers + body. |
| TCP Keepalive | After tcp_keepalive_time (default 2h) of idle, send keepalive probes. Close if no response. Detects dead peers behind stateful firewalls. |
| Zero Window Probe | When receiver advertises window=0, sender periodically probes to detect when window reopens, avoiding deadlock. |
| PMTU Discovery | Set DF=1; on ICMP Frag Needed, reduce MSS to the path MTU. Avoids fragmentation along the path. |
| RTT Estimation (Karn) | SRTT = 7/8*SRTT + 1/8*sample. Karn: never sample RTT from retransmitted segments — ambiguous which copy was ACKed. |
Minimal C Demo — TCP Congestion Control Simulation
10. § 7.9 — UDP
UDP adds port numbers and an optional checksum to IP, then delivers datagrams directly to the socket. No connection setup, no retransmission, no ordering. The receive path is primarily a 4-tuple hash lookup to find the right socket, followed by an skb enqueue.
| Feature | Detail |
|---|---|
| 4-tuple demux | udp4_lib_lookup(): hash(src IP, src port, dst IP, dst port) → linked list of matching sockets |
| Multicast delivery | ip_check_mc_rcu(): fan-out to all sockets joined to the group; IGMP controls group membership at the router |
| UDP checksum | Optional in IPv4 (0 = disabled); mandatory in IPv6. Offloaded via NETIF_F_HW_CSUM on capable NICs |
| RCVBUF overflow | sk_rcvbuf limit: if receive queue full, skb silently dropped — no retransmission unlike TCP |
| SO_REUSEPORT | Multiple sockets bind same port; kernel distributes by hash — enables multi-threaded UDP servers without lock contention |
| UDP GRO | Coalesce UDP packets from same flow into one skb — reduces per-packet overhead for QUIC/WireGuard tunnels |
11. § 7.10 — Netfilter & NAT
Netfilter is Linux's packet filtering framework. It inserts five hook points on the IP packet path where kernel modules (iptables, nftables, conntrack) can inspect or modify packets. Connection tracking (nf_conntrack) maintains a state table so that stateful firewalls and NAT can match reply packets to their original flows.
Hook Points on the Packet Path
Connection Tracking States
| nf_conn field | Type | Purpose |
|---|---|---|
tuplehash[IP_CT_DIR_ORIGINAL] | struct nf_conntrack_tuple_hash | Original direction tuple: src/dst IP + port + proto |
tuplehash[IP_CT_DIR_REPLY] | struct nf_conntrack_tuple_hash | Reply direction tuple — reverse of original; used to match reply packets |
status | unsigned long | Bitfield: IPS_CONFIRMED, IPS_NAT_MASQ, IPS_SEEN_REPLY, IPS_ASSURED |
timeout | struct timer_list | Per-state timeout: TCP ESTABLISHED=432000s, UDP=30s, ICMP=30s |
nat.info | struct nf_nat_conn_info | Stores allocated NAT port and IP (for SNAT/DNAT) |
proto | union nf_conntrack_proto | Protocol-specific state (TCP: window scale, sequence numbers) |
Core Mechanism — SNAT Port Selection
- First packet hits
POSTROUTINGhook; nf_nat module sees no existing entry nf_nat_get_unique_tuple()picks an ephemeral port (1024–65535) not already used by another conn- Kernel rewrites
skbsrc IP + src port; updates IP/TCP checksums - Stores original ↔ NATted tuple pair in
nf_conn - Reply arrives with dst = NATted port; conntrack lookup finds the entry; reverse rewrite restores original dst
Minimal C Demo — SNAT Connection Tracker
12. § 7.11 — Network Bridge
The Linux kernel bridge (br_*) implements an IEEE 802.1D Ethernet bridge entirely in software. Each port is a net_device enslaved to a bridge device. Frames are forwarded using a MAC learning table (FDB — Forwarding Database). Unknown destination MACs are flooded to all ports; known MACs are forwarded unicast.
FDB Learning & Forwarding
| Struct / Function | Purpose |
|---|---|
struct net_bridge | Master bridge device — holds port list, FDB hash table, STP state |
struct net_bridge_port | One per enslaved interface — port state (forwarding/blocking/learning), STP timers |
struct net_bridge_fdb_entry | One FDB entry: MAC → port, ageing timer (default 300 s) |
br_fdb_update() | Called on every received frame to learn src MAC → ingress port mapping |
br_forward() / br_flood() | Unicast forward to known port or flood to all ports (minus ingress) |
ebtables | L2 filtering framework — analogous to iptables but operates on Ethernet frames |
Minimal C Demo — Bridge MAC Learning Table
13. § 7.12 — GRE & VXLAN Tunnels
Overlay tunnels encapsulate one packet inside another so that tenant traffic can traverse an underlay network without leaking tenant MAC/IP addresses. VXLAN (Virtual eXtensible LAN) wraps an Ethernet frame in UDP/IP, adding a 24-bit VNI (VXLAN Network Identifier) to separate up to 16 million tenant networks over the same physical fabric.
VXLAN Encapsulation Format
VXLAN RX Path
| Concept | Detail |
|---|---|
| VTEP (VXLAN Tunnel End Point) | The host NIC + vxlan driver that adds/strips the outer UDP/IP + VXLAN header |
| VNI (VXLAN Network Identifier) | 24-bit tenant ID — analogous to VLAN ID but 16M namespaces vs 4096 |
| VXLAN port 4789 | IANA-assigned UDP destination port for all VXLAN traffic |
| Outer UDP src port | Hash of inner 5-tuple (for ECMP load balancing across LAG / ECMP paths) |
| Underlay requirement | MTU ≥ inner MTU + 50 bytes overhead — typically set to 1550 or jumbo frames |
| GRE vs VXLAN | GRE: IP proto 47, no src port → no ECMP. VXLAN: UDP → ECMP-friendly. GENEVE extends VXLAN with TLV options |
| Checksum offload | vxlan driver sets NETIF_F_GSO_UDP_TUNNEL so NIC can offload inner TCP segmentation |
Minimal C Demo — VXLAN Header Encode / Decode
14. § 7.13 — IPv6
IPv6 uses 128-bit addresses and eliminates broadcast in favor of multicast. The Neighbor Discovery Protocol (NDP) replaces ARP; Router Advertisements (RA) distribute prefixes so hosts can self-configure via SLAAC without a DHCP server.
IPv6 Address Types & Scope
SLAAC — Stateless Address Autoconfiguration
| NDP Message | ICMPv6 Type | Purpose |
|---|---|---|
| Router Solicitation (RS) | 133 | Host → ff02::2 asking for prefix + gateway info |
| Router Advertisement (RA) | 134 | Router → ff02::1 advertising prefix, MTU, default route; triggers SLAAC |
| Neighbor Solicitation (NS) | 135 | Who has this IP? (like ARP request); also used for DAD |
| Neighbor Advertisement (NA) | 136 | I have this IP (like ARP reply); Source Link-Layer Address (SLLA) option carries MAC |
| Redirect | 137 | Router tells host about a better next-hop for a destination |
Minimal C Demo — EUI-64 Interface ID from MAC
15. § 7.14 — Multicast (IGMP & MLD)
Multicast delivers one stream to many receivers without per-receiver replication at the source. IGMP (IPv4) and MLD (IPv6) are the group management protocols that hosts use to join/leave groups. PIM-SM builds the distribution tree. The RPF check prevents routing loops by verifying that multicast traffic arrives on the correct upstream interface.
IGMP Join Flow (SSM — Source-Specific Multicast)
RPF Check — Loop Prevention
| Concept | Detail |
|---|---|
| IGMP v1/v2/v3 | v1: join only. v2: + leave (prune faster). v3: SSM — report specific source(s) to receive from |
| MLD v1/v2 | IPv6 multicast listener discovery — MLD v2 adds SSM like IGMPv3; uses ICMPv6 types 130–132 |
| PIM-SM (Sparse Mode) | Builds shared tree via Rendezvous Point (RP), then optionally switches to source tree (SPT) |
| Source tree (*,G) vs (S,G) | (*,G) = any source for group G — shared RP tree. (S,G) = specific source S — lowest latency, shortest path |
| RPF interface | The interface that unicast routing would use to reach the multicast source — packet must arrive here |
| mfc_cache | Multicast Forwarding Cache entry: (src, group) → OIF list (output interfaces) |
| IGMP snooping | Switch learns group membership from IGMP frames — prevents flooding multicast to all ports |
Minimal C Demo — RPF Check
16. § 7.15 — DHCP Protocol & Proxy DHCP
DHCP (Dynamic Host Configuration Protocol) automates IP address assignment via a four-message exchange called DORA. In cloud VPC environments, virtual switches implement a DHCP proxy (代答) that intercepts DISCOVER packets and synthesizes replies from a local config database — eliminating dependency on a real DHCP server and giving the hypervisor full control over tenant addressing.
DHCP DORA Sequence (with Relay)
Virtual Switch DHCP Proxy (代答)
| DHCP Option | Code | Purpose |
|---|---|---|
| Subnet Mask | 1 | Netmask for assigned IP (e.g., 255.255.255.0) |
| Router (gateway) | 3 | Default gateway IP — must be set for off-subnet traffic |
| DNS Servers | 6 | Up to 8 DNS server IPs |
| Lease Time | 51 | Seconds until IP expires; use 0xFFFFFFFF for infinite (VPC) |
| DHCP Message Type | 53 | 1=DISCOVER 2=OFFER 3=REQUEST 5=ACK 6=NAK |
| Server Identifier | 54 | DHCP server IP — client uses this to select among multiple offers |
| Relay Agent Info | 82 | Sub-options: circuit-id (ingress port), remote-id (switch MAC) — added by relay |
| Mode | Flow | Address Source |
|---|---|---|
| Stateful (IA_NA) | Solicit → Advertise → Request → Reply | DHCPv6 server assigns full /128 |
| Stateless (SLAAC + options) | RA → SLAAC address + DHCPv6 for options only | Host self-configures via EUI-64; DHCPv6 supplies DNS/NTP |
Minimal C Demo — DHCP DORA State Machine
17. § 7.16 — Routing Protocols: RIP, OSPF, BGP
Dynamic routing protocols allow routers to exchange reachability information and converge on consistent forwarding tables. RIP is distance-vector (hop count); OSPF is link-state (Dijkstra SPF); BGP is path-vector (AS-path, policy-driven). In cloud infrastructure, BGP (via BIRD or FRR) is used for advertising VM IP prefixes during live migration, anycast VIP advertisement, and cross-datacenter routing.
| Protocol | Type | Metric | Algorithm | Scope |
|---|---|---|---|---|
| RIP v2 | Distance-vector | Hop count (max 15) | Bellman-Ford, periodic full-table broadcast every 30 s | Small LANs; obsolete in production |
| OSPF v2 | Link-state | Cost (bandwidth-based) | Dijkstra SPF; LSA flooding within area | Enterprise / datacenter IGP |
| BGP-4 | Path-vector | AS_PATH + policy attributes | Policy-driven best-path selection | Inter-AS Internet routing; datacenter BGP-only fabric |
OSPF Neighbor State Machine
OSPF Area Hierarchy
| LSA Type | Name | Flooded to | Purpose |
|---|---|---|---|
Type 1 | Router LSA | Within area | Each router's links and their costs |
Type 2 | Network LSA | Within area | DR-generated: lists all routers on multi-access segment |
Type 3 | Summary LSA | Other areas | ABR advertises inter-area prefixes into backbone/areas |
Type 4 | ASBR Summary LSA | Other areas | ABR advertises path to ASBR |
Type 5 | AS-External LSA | Entire OSPF domain | ASBR redistributes external routes (blocked by Stub areas) |
Type 7 | NSSA External LSA | NSSA area only | Local ASBR in NSSA; ABR translates to Type 5 at border |
BGP State Machine
BGP Best-Path Selection (in order)
| BGP Attribute | Type | Meaning |
|---|---|---|
AS_PATH | Well-known mandatory | Ordered list of ASes the route has traversed — loop prevention + path length metric |
NEXT_HOP | Well-known mandatory | IP of the next-hop router; iBGP peers must resolve this via IGP |
LOCAL_PREF | Well-known discretionary | Set within an AS to prefer one exit point over another; higher wins |
MED | Optional non-transitive | Multi-Exit Discriminator — hint to neighboring AS which entry to prefer; lower wins |
COMMUNITY | Optional transitive | 32-bit tag for grouping routes; used for policy (e.g., blackhole, no-export) |
ORIGIN | Well-known mandatory | Route origin: IGP (i) < EGP (e) < Incomplete (?) — lower preferred |
When a route becomes unreachable, RIP routers may advertise stale hop counts back to each other, incrementing forever until max (16 = infinity). Mitigations: split horizon (never advertise a route back out the interface it was learned on), poison reverse (advertise with metric 16 back on that interface — faster convergence), and triggered updates (send immediately on topology change rather than waiting 30 s).
During VM live migration, the destination hypervisor needs to attract traffic for the migrated VM's IP without waiting for ARP to age out on every ToR switch. The hypervisor runs a BIRD BGP daemon that announces the VM's /32 host route to the ToR via iBGP/eBGP. The ToR installs a more-specific /32 that overrides the /24 subnet route, redirecting traffic immediately. After migration completes, the source hypervisor withdraws its announcement.
Minimal C Demo — BGP Best-Path Selection
18. Kernel Source Pointers
| Concept | File | Key Function / Symbol |
|---|---|---|
| struct socket / sock | include/linux/net.h, include/net/sock.h | sock_alloc(), sk_alloc() |
| struct tcp_sock | include/linux/tcp.h | tcp_sk(sk) cast macro |
| Socket creation | net/socket.c | sys_socket() → __sock_create() → inet_create() |
| struct sk_buff | include/linux/skbuff.h | alloc_skb(), skb_put(), skb_push(), skb_pull(), skb_clone() |
| skb_shared_info | include/linux/skbuff.h | skb_shinfo(skb) macro → (skb_shared_info*)skb->end |
| GSO | net/core/gso.c | __skb_gso_segment(), skb_gso_reset() |
| GRO | net/core/gro.c | napi_gro_receive(), dev_gro_receive() |
| TX path | net/core/dev.c | dev_queue_xmit() → __dev_queue_xmit() → sch_direct_xmit() |
| RX path | net/core/dev.c | netif_receive_skb() → __netif_receive_skb() → deliver_skb() |
| NAPI | net/core/dev.c | napi_schedule(), net_rx_action(), napi_complete_done() |
| struct net_device | include/linux/netdevice.h | alloc_etherdev(), register_netdev() |
| TC qdisc | net/sched/sch_generic.c | qdisc_enqueue(), qdisc_dequeue_head() |
| HTB | net/sched/sch_htb.c | htb_enqueue(), htb_dequeue(), htb_charge_class() |
| fq_codel | net/sched/sch_fq_codel.c | fq_codel_enqueue(), codel_should_drop() |
| IP receive | net/ipv4/ip_input.c | ip_rcv(), ip_rcv_finish(), ip_local_deliver() |
| IP forward / output | net/ipv4/ip_forward.c, ip_output.c | ip_forward(), ip_output(), ip_finish_output2() |
| IP fragmentation | net/ipv4/ip_output.c, ip_fragment.c | ip_fragment(), ip_defrag(), struct ipq |
| FIB trie / LPM | net/ipv4/fib_trie.c | fib_table_lookup(), tnode_get_child_rcu() |
| Route lookup | net/ipv4/route.c | ip_route_input_noref(), ip_route_output_key(), alloc_cache_rt() |
| Neighbour / ARP | net/core/neighbour.c, net/ipv4/arp.c | neigh_lookup(), neigh_update(), arp_send() |
| TCP state machine | net/ipv4/tcp_input.c | tcp_rcv_state_process(), tcp_v4_rcv() |
| TCP handshake | net/ipv4/tcp_ipv4.c | tcp_v4_connect(), tcp_v4_syn_recv_sock() |
| SYN cookie | net/ipv4/syncookies.c | cookie_v4_check(), __cookie_v4_init_sequence() |
| Congestion control | net/ipv4/tcp_cong.c, tcp_cubic.c | tcp_cong_ops, bictcp_cong_avoid() |
| UDP receive | net/ipv4/udp.c | udp_rcv(), udp4_lib_rcv(), udp_queue_rcv_skb() |
| Netfilter hooks | net/netfilter/core.c | nf_hook_slow(), nf_register_net_hook() |
| Connection tracking | net/netfilter/nf_conntrack_core.c | nf_conntrack_in(), init_conntrack(), __nf_ct_refresh_acct() |
| NAT | net/netfilter/nf_nat_core.c | nf_nat_packet(), nf_nat_get_unique_tuple(), nf_nat_setup_info() |
| iptables / nftables | net/ipv4/netfilter/ip_tables.c, net/netfilter/nf_tables_core.c | ipt_do_table(), nft_do_chain() |
| Network bridge | net/bridge/br.c, br_fdb.c, br_forward.c | br_handle_frame(), br_fdb_update(), br_forward() |
| STP | net/bridge/br_stp.c | br_stp_enable_port(), br_received_config_bpdu() |
| VXLAN | drivers/net/vxlan/vxlan_core.c | vxlan_rcv(), vxlan_xmit(), vxlan_fdb_find() |
| GRE tunnel | net/ipv4/ip_gre.c | ipgre_rcv(), ipgre_xmit() |
| IPv6 receive | net/ipv6/ip6_input.c | ipv6_rcv(), ip6_rcv_finish() |
| NDP / RA | net/ipv6/ndisc.c | ndisc_recv_ra(), ndisc_recv_ns(), addrconf_dad_start() |
| SLAAC / addrconf | net/ipv6/addrconf.c | addrconf_prefix_rcv(), ipv6_generate_eui64() |
| IGMP | net/ipv4/igmp.c | igmp_rcv(), ip_mc_join_group(), igmpv3_sendpack() |
| MLD | net/ipv6/mcast.c | mld_rcv(), ipv6_sock_mc_join() |
| Multicast forwarding | net/ipv4/ipmr.c | ip_mr_forward(), ipmr_cache_find(), mfc_cache_put() |
| OSPF (user-space) | FRR: ospfd/ospf_interface.c, BIRD: proto/ospf/ | ospf_hello_send(), ospf_spf_calculate(), ospf_lsa_install() |
| BGP (user-space) | FRR: bgpd/bgp_best.c, BIRD: proto/bgp/ | bgp_best_selection(), bgp_process(), bgp_update_main() |
| Netlink RIB install | net/core/rtnetlink.c | rtnl_newroute(), fib_new_table(), ip_rt_ioctl() |
| DHCP (user-space) | isc-dhcp client, dhcpcd, systemd-networkd | Sends DISCOVER via raw socket (AF_PACKET), listens on udp/68 |
19. Interview Prep
head/end bound the allocated buffer (never move). data points to the start of current packet data; tail to the end. Headroom (head→data) is reserved for TX header prepending via skb_push(). skb_shared_info lives at end for scatter-gather fragments and GSO metadata.
NAPI switches from per-packet hard IRQs to a polling model under load. The first packet triggers a hard IRQ; the handler disables NIC RX interrupts and schedules a softirq poll. The poll processes up to budget packets per call. When the ring drains, the NIC interrupt is re-armed. This amortizes interrupt overhead: at 14 Mpps, per-packet IRQs would consume 100% CPU in interrupt context.
write()/sendmsg() → tcp_sendmsg() (segments, adds TCP header) → ip_queue_xmit() (IP header + route lookup) → dev_queue_xmit() (enqueue to qdisc) → qdisc dequeues → ndo_start_xmit() (driver fills TX ring descriptor) → NIC DMA reads descriptor + payload → TX complete interrupt frees skb.
skb_clone() creates a new sk_buff header pointing to the same data buffer (users++ on shared_info). Zero cost but cloned skb data is read-only. skb_copy() allocates a completely new buffer and copies all data — required before in-place modification (e.g., NAT IP rewrite). pskb_copy() copies the linear area but shares frags[].
HTB builds a class hierarchy. Each leaf class has a rate (guaranteed) and ceil (burst). The dequeue path walks the hierarchy, selects the class with the earliest eligible packet by virtual time (deficit round-robin at each level). A class below rate borrows tokens from its parent up to ceil when the parent has surplus. This ensures minimum guaranteed bandwidth while allowing burst into idle capacity.
ip_rcv_finish() calls ip_route_input_noref() which performs a FIB lookup. If the destination matches a local address the dst.input is set to ip_local_deliver(). If IP forwarding is enabled and the route is to a remote host, dst.input is ip_forward(). ip_forward() decrements TTL, checks MTU (fragments if needed), then calls ip_output() to re-enter the TX path.
LPM (Longest Prefix Match) returns the routing entry whose prefix is the most specific match for a given destination IP. Linux uses an LC-trie (level-compressed trie) called fib_trie. Internal tnodes index into child arrays using multi-bit slices of the address; leaves hold fib_alias entries. Compression reduces the average lookup depth from 32 to 4–6 steps even with 800K BGP routes.
NUD (Neighbour Unreachability Detection) tracks MAC→IP cache validity: INCOMPLETE (waiting for reply) → REACHABLE (confirmed) → STALE (timer expired) → DELAY (wait for upper-layer confirmation) → PROBE (send unicast ARP) → FAILED (drop). Gratuitous ARP is an ARP request where sender IP = target IP — it announces IP ownership, updates neighbour caches on peers, and detects IP conflicts. Used heavily for VIP failover in virtual switches.
Client: tcp_v4_connect() sends SYN, state→SYN_SENT. Server: tcp_v4_rcv()→tcp_rcv_state_process() receives SYN in LISTEN, allocates request_sock, sends SYN-ACK, state→SYN_RECEIVED. Client receives SYN-ACK, sends ACK, state→ESTABLISHED. Server: receives ACK, calls tcp_v4_syn_recv_sock()→tcp_create_openreq_child(), state→ESTABLISHED, wakes accept().
SYN cookies encode all connection state (src/dst IP, ports, MSS index, timestamp) in the initial sequence number. The server sends SYN-ACK without allocating any backlog entry. When the client's ACK arrives, the server decodes ack_num-1 and verifies the hash — if valid, creates the socket. This eliminates the SYN backlog as an attack surface. Tradeoff: TCP options like SACK and window scaling may not be negotiated, slightly reducing throughput.
Slow start: begin with cwnd=1 MSS, double cwnd each RTT (exponential growth) until cwnd reaches ssthresh, then switch to congestion avoidance (+1 MSS/RTT, linear). Fast retransmit: 3 duplicate ACKs indicate a lost segment without waiting for RTO; retransmit immediately, set ssthresh=cwnd/2, cwnd=ssthresh (fast recovery). RTO timeout is more severe: ssthresh=cwnd/2 and cwnd=1, restarting slow start.
Forwarded path: PREROUTING (conntrack, DNAT) → routing decision → FORWARD (firewall rules) → POSTROUTING (SNAT/masquerade). For locally destined packets: PREROUTING → INPUT → local socket. For locally generated packets: OUTPUT → routing → POSTROUTING. nftables and iptables both register callbacks at these same NF_INET_* hook points.
nf_nat_get_unique_tuple() tries the original port first; if it is already used by another conntrack entry (same dst IP:port), it iterates through the configured port range (1024–65535 by default) hashing through the range until it finds one where no nf_conn with the same (proto, nat_ip, nat_port, dst_ip, dst_port) exists. The result is stored in nf_conn so every subsequent packet in the same flow reuses it.
Every frame received on a bridge port triggers br_fdb_update() which looks up (or creates) an FDB entry keyed on src MAC, records the ingress port, and resets the ageing timer (default 300 s). For forwarding, br_forward() does an FDB lookup on dst MAC — if found, unicast; if not (unknown unicast, broadcast, or multicast), br_flood() copies the skb to all ports except the ingress. STP marks ports as forwarding/blocking to prevent loops.
VXLAN (RFC 7348) wraps an inner Ethernet frame in an 8-byte VXLAN header (flags + 24-bit VNI) then UDP + IP. UDP is used because: (1) the UDP src port is set to a hash of the inner 5-tuple, enabling ECMP load balancing across LAG/ECMP paths that hash on UDP ports — GRE lacks a src port and cannot ECMP. (2) UDP checksum can be disabled (inner layers handle their own). VTEP overhead is 50 bytes, reducing inner MTU from 1500 to 1450 bytes.
SLAAC (RFC 4862): (1) Host sends Router Solicitation to ff02::2. (2) Router replies with RA containing prefix (e.g., 2001:db8:1::/64) + valid/preferred lifetimes. (3) Host generates interface ID using EUI-64 from MAC (flip U/L bit, insert FF:FE) or stable privacy extension (RFC 8981). (4) Combine: prefix + IID = full /128 address. (5) DAD: send NS to solicited-node multicast address; wait 1 RTT; if no NA reply, the address is unique and assigned.
Reverse Path Forwarding check: when a multicast packet arrives on interface I from source S, the kernel performs a unicast route lookup for S. If the route to S would exit via a different interface than I, the packet is dropped. This prevents routing loops: in a loop, a multicast packet could bounce between routers indefinitely; RPF ensures the packet arrived from the correct upstream (towards the source tree). Implemented in ipmr.c for IPv4, ip6mr.c for IPv6.
IGMPv2: hosts report group membership (join *,G — any source) and send Leave Group messages, triggering Last Member Query before removing state. IGMPv3 (RFC 3376) adds Source-Specific Multicast (SSM): hosts specify which sources they want to receive from (INCLUDE mode) or exclude (EXCLUDE mode). This allows the network to build source-rooted shortest-path trees (S,G) instead of shared RP trees, eliminating the Rendezvous Point bottleneck and enabling the kernel to drop traffic from unwanted sources at the router before it reaches the receiver.
When a VM boots and broadcasts a DHCP DISCOVER, the vSwitch intercepts it on the VM's virtual port before forwarding. It looks up the VM's MAC (or port ID) in a config database pre-populated by the VPC control plane with the assigned IP, mask, gateway, DNS, and lease time. The vSwitch synthesizes a DHCP OFFER and sends it directly to the VM without involving a real DHCP server. This gives the hypervisor full control of addressing (critical for VPC isolation), eliminates the single-point-of-failure DHCP server, and allows infinite lease times so VMs never try to renew.
RIP: distance-vector, hop-count metric (max 15), 30-second full-table broadcast, slow convergence, count-to-infinity risk. Only suitable for small flat networks. OSPF: link-state, Dijkstra SPF, fast convergence (sub-second with BFD), area hierarchy scales to large enterprise. The preferred IGP. BGP: path-vector, AS_PATH prevents loops, policy-driven via LOCAL_PREF/MED/COMMUNITY, carries the full Internet table (800K+ routes). Used between ISPs, within large DCs (BGP-only fabric), and for advertising VM host routes during live migration via BIRD/FRR.
Down → Init (Hello received, neighbor's Router-ID not yet in our Hello) → 2-Way (our Router-ID appears in neighbor's Hello — basic bidirectional) → ExStart (master/slave negotiated via DBD, higher Router-ID becomes master) → Exchange (exchange full DBD summaries) → Loading (request missing LSAs via LSR, receive LSU) → Full (databases synchronized — neighbor is now an SPF input). On multi-access Ethernet, only DR/BDR reach Full with all routers; DROthers stay in 2-Way with each other.
1. Highest WEIGHT (Cisco-local, not advertised). 2. Highest LOCAL_PREF (intra-AS policy — which exit AS to prefer). 3. Locally originated (network/aggregate > iBGP learned). 4. Shortest AS_PATH (fewer AS hops). 5. Lowest ORIGIN (IGP < EGP < Incomplete). 6. Lowest MED (only compared among routes from same neighbor AS). 7. eBGP over iBGP. 8. Lowest IGP metric to NEXT_HOP. 9. Oldest eBGP route (stability). 10. Lowest BGP Router-ID (deterministic tiebreak). The winning route is installed in RIB and, if changed, triggers UPDATE messages to peers.
The destination hypervisor configures BIRD with a static /32 route for the migrating VM's IP pointing to the local vSwitch. BIRD's BGP session to the ToR switch advertises this /32 via iBGP or eBGP (depending on fabric design). The ToR's longest-prefix-match selects /32 over the existing /24 subnet route, redirecting all traffic to the new hypervisor immediately — without waiting for gratuitous ARP to propagate or neighbor caches to expire. When migration completes, the source hypervisor's BIRD withdraws its /32, and the /24 handles the now-unified traffic.