Web3 Docs

1. Overview

The Linux network stack is layered: the socket layer abstracts transport protocols (TCP/UDP) behind a VFS file interface; the sk_buff is the universal packet container that flows between every layer; the net_device abstraction decouples protocol code from hardware drivers; and NAPI (New API) switches from interrupt-driven to polling-driven processing under load, dramatically improving throughput at high PPS.

2. § 7.1 — Socket Layer

Abstraction Layers — fd → file → socket → sock → tcp_sock

A socket descriptor is just a Linux file descriptor. The VFS layer maps it through struct file → struct socket → struct sock. Protocol-specific state (TCP congestion window, sequence numbers) lives in struct tcp_sock, which extends struct sock by embedding it as its first member — so a simple pointer cast promotes between the two types.

Struct	Key Fields	Purpose
`struct socket`	sock, ops, type, state	VFS-visible wrapper; holds proto_ops (inet_stream_ops) and pointer to sock
`struct sock`	sk_state, sk_sndbuf, sk_rcvbuf, sk_write_queue, sk_receive_queue, sk_prot	Protocol-independent socket state; send/receive buffers; wait queues
`struct inet_sock`	inet_saddr, inet_daddr, inet_sport, inet_dport	IPv4-specific addresses and ports, embedded in tcp_sock
`struct tcp_sock`	rcv_nxt, snd_nxt, snd_una, cwnd, ssthresh, retransmit_timer	Full TCP state machine and congestion control variables
`proto_ops`	bind, connect, accept, sendmsg, recvmsg, poll	VFS-level socket operations — dispatch table per address family
`struct proto`	sendmsg, recvmsg, connect, close, hash	Transport-level operations — tcp_prot / udp_prot

Send/Receive Buffer Watermarks

Each socket has a bounded send buffer (sk_sndbuf) and receive buffer (sk_rcvbuf). When the write queue fills, tcp_sendmsg() blocks the caller (or returns EAGAIN on a non-blocking socket). When the receive queue exceeds sk_rcvbuf, new incoming segments are silently dropped — TCP flow control (receiver window advertisement) prevents this in steady state.

3. § 7.2 — sk_buff: The Core Packet Container

Memory Layout

Every network packet in the kernel is represented by an sk_buff. A single contiguous buffer is allocated (via kmalloc), and four pointers divide it into zones: headroom for headers prepended during TX (Ethernet → IP → TCP), data for the current payload, tailroom for trailers, and a trailer skb_shared_info struct for paged data (scatter-gather I/O) and GSO metadata.

Pointer Operations

Function	What moves	When used
`skb_reserve(skb, n)`	data += n, tail += n	Create headroom before any data is written (TX path, before filling payload)
`skb_put(skb, n)`	tail += n → return old tail	Append n bytes of payload to the end (caller fills returned pointer)
`skb_push(skb, n)`	data -= n → return new data	Prepend n-byte header into headroom (TCP → IP → Ethernet on TX)
`skb_pull(skb, n)`	data += n	Strip n-byte header from front (Ethernet header consumed on RX, then IP, then TCP)

Clone vs Copy

skb_clone() creates a new header struct pointing to the same data buffer — zero cost for forwarding a packet to multiple consumers (e.g., packet sniffer + routing). The clone is read-only for the data area. skb_copy() allocates a fresh buffer and copies everything — required before modifying packet bytes (e.g., NAT rewrite).

GSO & GRO

Feature	Direction	What it does
`TSO (TCP Segmentation Offload)`	TX	NIC splits one large skb into MTU-sized segments — CPU never touches per-segment headers
`GSO (Generic Segmentation Offload)`	TX	Software fallback when NIC lacks TSO — kernel splits at the last moment before driver
`GRO (Generic Receive Offload)`	RX	Coalesce many small TCP segments into one large skb in NAPI poll — reduces per-packet overhead
`LRO (Large Receive Offload)`	RX	Hardware coalescing — deprecated; GRO is preferred (GRO is protocol-aware)

Minimal C Demo — sk_buff Pointer Manipulation

sk_buff — skb_reserve / skb_put / skb_push / skb_pull — C Demo

stdin (optional)

4. § 7.3 — Network Device Layer & NAPI

TX Path — write() to NIC Ring

On the TX side the kernel walks a layered dispatch chain: the socket send path calls tcp_sendmsg() which segments the data and calls ip_queue_xmit() to add the IP header and perform a route lookup. dev_queue_xmit()then hands the skb to the queueing discipline (qdisc) for rate control and shaping. The qdisc dequeues and calls the driver's ndo_start_xmit() which fills a TX ring descriptor and signals the NIC.

RX Path — NIC Interrupt to Socket Buffer

On receive, the NIC DMA-writes the frame into a pre-allocated ring buffer and raises a hard IRQ. The IRQ handler does the minimum: disable NIC RX interrupt, call napi_schedule() to register a poll callback, and return. The actual packet processing happens in a softirq (NET_RX_SOFTIRQ) outside interrupt context, up to a configurable budget — avoiding the per-packet interrupt overhead that collapses throughput at high PPS.

Core Mechanism — NAPI Budget & Interrupt Coalescing

Background: A 10 Gbps NIC at minimum frame size (64 B) delivers ~14.8 Mpps. If each packet triggers a hard IRQ, the CPU spends 100% of its time in interrupt context — no user code runs. NAPI solves this by switching to polling when the arrival rate is high.

Plan:

First packet: NIC fires hard IRQ → IRQ handler disables NIC RX interrupt + calls napi_schedule()
Softirq net_rx_action() runs on the same CPU, calls napi.poll(quota)
Driver processes up to quota descriptors per poll — builds one sk_buff per packet
If work < quota → ring drained → call napi_complete_done() → re-enable NIC interrupt → back to interrupt-driven
If work == quota → ring may still have more → softirq re-schedules poll (no interrupt re-arm)

Walkthrough — 40 packets arrive, budget = 16 per poll:

Round	napi_poll(16) returns	Ring left	Action
1	16	24	work == quota → reschedule poll
2	16	8	work == quota → reschedule poll
3	8	0	work < quota → napi_complete_done() → re-arm IRQ

net_device field	Type	Purpose
`name[IFNAMSIZ]`	char[]	Interface name (eth0, ens3, lo)
`dev_addr[MAX_ADDR_LEN]`	unsigned char[]	MAC address
`mtu`	unsigned int	Maximum transmission unit (default 1500)
`features`	netdev_features_t	Offload capability flags: NETIF_F_TSO, NETIF_F_GRO, NETIF_F_HW_CSUM
`netdev_ops`	struct net_device_ops *	Driver operations: ndo_open, ndo_stop, ndo_start_xmit, ndo_get_stats64
`napi_list`	struct list_head	List of napi_struct — one per queue (multi-queue NICs have one per TX/RX queue pair)
`tx_queue_len`	unsigned long	TX qdisc queue depth limit
`num_tx_queues / num_rx_queues`	unsigned int	Multi-queue NIC: RSS spreads RX across multiple queues, one per CPU

Minimal C Demo — NAPI Budget Simulation

NAPI Poll Budget — packet processing loop — C Demo

stdin (optional)

5. § 7.4 — Traffic Control (TC) & QoS

Linux Traffic Control (TC) inserts a queueing discipline (qdisc) between the network layer and the NIC driver. Every packet leaving via dev_queue_xmit() passes through the root qdisc. TC enables per-VM bandwidth limiting in cloud hypervisors, ingress policing, and fair queueing between flows — all without modifying application code.

TC Pipeline

Key Qdiscs

Qdisc	Algorithm	Best for
`pfifo_fast`	Three-band FIFO, TOS-based priority	Default; lowest overhead; no rate control
`tbf (Token Bucket Filter)`	Token bucket — tokens accumulate at rate R; burst up to B bytes	Simple rate limiting (e.g., cap a VM to 1 Gbps)
`htb (Hierarchical Token Bucket)`	Class hierarchy with rate + ceil; excess lent to children	Per-VM bandwidth guarantee + burst borrowing in cloud hypervisors
`fq_codel`	Fair Queue + CoDel AQM; per-flow FIFO + delay-based drop	ISP edge; eliminates bufferbloat; default on many Linux routers
`fq (Fair Queue)`	Per-flow pacing using TCP pacing rate; reduces RTT variance	High-throughput servers with many TCP flows
`netem`	Adds delay / jitter / loss / reorder / corruption	Network emulation in test environments

HTB — Per-VM Bandwidth Limiting

HTB is the standard qdisc for per-tenant bandwidth limiting in cloud hypervisors. Each VM gets an HTB class with a guaranteed rate and an optional burst ceil. When a VM is below its rate it may borrow unused tokens from sibling classes up to ceil — ensuring work-conserving behaviour without permanent starvation.

Background: A hypervisor hosts 20 VMs on a 10 Gbps uplink. Without TC one noisy VM can saturate the NIC. With HTB: assign each VM a guaranteed rate = 400 Mbps and allow burst to ceil = 1 Gbps when the link is idle.

Plan:

Create root htb qdisc on eth0: tc qdisc add dev eth0 root handle 1: htb default 10
Add per-VM class: tc class add dev eth0 parent 1: classid 1:1 htb rate 400mbit ceil 1000mbit
Attach leaf qdisc (fq_codel for AQM): tc qdisc add dev eth0 parent 1:1 handle 10: fq_codel
Attach filter to steer VM packets to the class: tc filter add dev eth0 ... match ip src 10.0.0.5 classid 1:1

struct Qdisc field	Type	Purpose
`ops`	struct Qdisc_ops *	enqueue, dequeue, peek, init, reset, destroy callbacks
`q`	struct qdisc_skb_head	Internal packet queue (or class structure for classful qdiscs)
`rate_est`	struct net_rate_estimator *	Rate estimation for tc stats
`handle`	u32	TC handle — 16-bit major:minor (e.g. 1:0 = root, 1:1 = first class)
`parent`	u32	Parent handle (0 for root)
`limit`	u32	Max queue depth (for pfifo-like qdiscs)

6. § 7.5 — IP Layer

The IP layer sits between transport protocols (TCP/UDP) and the device layer. Every packet — locally generated or forwarded — passes through ip_rcv() on the receive side. The routing subsystem attaches a dst_entry to each sk_buff that decides whether the packet goes to a local socket (ip_local_deliver) or is forwarded to another host (ip_forward).

IP Header — Key Fields

Field	Size	Purpose
`version / IHL`	4b / 4b	Version=4; IHL = header length in 32-bit words (min 5 = 20 bytes)
`DSCP / ECN`	6b / 2b	Differentiated services (QoS); ECN = explicit congestion notification
`total length`	16b	Header + payload in bytes (max 65535)
`identification`	16b	Fragment group ID — all fragments of same datagram share this value
`flags / frag offset`	3b / 13b	DF (don't fragment), MF (more fragments); offset in 8-byte units
`TTL`	8b	Decremented each hop; reaches 0 → ICMP TTL Exceeded + drop
`protocol`	8b	Next header: 6=TCP, 17=UDP, 1=ICMP, 89=OSPF
`header checksum`	16b	One's complement of IP header; recomputed each hop (TTL changed)
`src / dst address`	32b each	Source and destination IPv4 addresses

IP Fragmentation & Reassembly

When a packet exceeds path MTU, the kernel fragments it or drops and sends ICMP. Modern stacks prefer PMTU discovery — set DF=1, let the network signal the bottleneck MTU via ICMP Frag Needed, resend smaller. Fragmentation is expensive: it stresses stateful firewalls and can block the reassembly queue. The kernel reassembles at the destination using per-flow fragment queues (struct ipq) with a 30-second GC timeout.

7. § 7.6 — Routing Subsystem

Linux stores its routing table in a FIB (Forwarding Information Base)implemented as a level-compressed trie (fib_trie). Every route lookup calls fib_lookup() which traverses the trie from the most-significant bit, always choosing the longest matching prefix.

FIB Trie — Longest Prefix Match

Route Lookup Path

Struct / Symbol	Purpose
`struct fib_table`	One routing table; Linux has main + local by default (255 max)
`struct fib_info`	Nexthop: gateway IP, output device, priority, scope
`struct rtable`	Per-packet route cache: dst_entry + rt_gateway + rt_iif
`struct dst_entry`	Embedded in rtable; holds .output() and .input() function pointers
`fib_lookup()`	Traverse fib_trie → return fib_result (route type + nexthop)
`ip rule (policy routing)`	Match src/dst/tos → select which FIB table to consult

Core Mechanism — LPM Walkthrough

Background: A router with 800K BGP routes must make a forwarding decision per packet in nanoseconds. A sorted array costs 20 binary-search steps; a full 32-level trie wastes memory. The LC-trie compresses single-child subtrees into multi-bit steps, typically reaching any leaf in 4–6 comparisons.

Example — lookup 10.1.2.99:

Step	Node	Prefix check	Action
1	`root`	`0.0.0.0/0`	match — record candidate
2	`tnode /8`	`10.0.0.0/8`	match — better candidate
3	`tnode /16`	`10.1.0.0/16`	match — better candidate
4	`leaf /24`	`10.1.2.0/24`	match — longest, return nexthop 10.1.2.1

Minimal C Demo — LPM Route Lookup

FIB LPM — longest prefix match routing — C Demo

stdin (optional)

8. § 7.7 — Neighbor Subsystem (ARP/NDP)

Before a packet can leave via the NIC, the kernel must know the next-hop's MAC address. The neighbour subsystem caches IP→MAC mappings in a per-device hash table. ARP (IPv4) and NDP (IPv6) are the discovery protocols. Their lifecycle is managed by the NUD state machine (Neighbour Unreachability Detection) which periodically confirms that cached entries are still valid.

NUD State Machine

ARP Request / Reply Flow

Concept	Detail
struct neighbour	One entry per (IP, dev); holds MAC, NUD state, timer, queue of skbs pending resolution
neigh_table	Per-protocol hash table (arp_tbl / nd_tbl); LRU eviction when gc_thresh3 reached
Gratuitous ARP (GARP)	ARP request where sender IP = target IP; announces ownership; used for failover + IP conflict detection
Proxy ARP	Kernel answers ARP on behalf of another host; used in virtual switches (VM behind NAT gateway)
NDP Neighbor Solicitation	IPv6 ARP equivalent; multicast to solicited-node address ff02::1:ff<last 24b of target IP>
NDP Router Advertisement	Router advertises prefix + gateway; hosts use SLAAC to derive IPv6 address (EUI-64 + DAD)

9. § 7.8 — TCP/IP Protocol Deep Dive

TCP provides reliable, ordered, byte-stream delivery over unreliable IP. Its complexity comes from three interlocking mechanisms: a state machine tracking connection lifecycle, a sliding window providing flow control, and congestion control preventing the sender from overwhelming the network.

TCP State Machine

3-Way Handshake

Kernel Function	Role in Handshake
`tcp_v4_connect()`	Client: build SYN, set SYN_SENT, start connect timer
`tcp_rcv_state_process()`	Server: receive SYN in LISTEN → allocate request_sock → send SYN-ACK → SYN_RECEIVED
`tcp_v4_syn_recv_sock()`	Server: receive ACK → tcp_create_openreq_child() → ESTABLISHED → wake accept()
`tcp_fastopen_create_child()`	TFO: send data with SYN to skip 1 RTT on repeat connections

Congestion Control

Algorithm	Key idea	Default?
`CUBIC`	cwnd grows as cubic function of time since last loss; fast recovery on high-BDP links	Yes (since 2.6.19)
`BBR`	Model-based: estimate BtlBw + min RTT; pace at BtlBw rate — doesn't wait for loss	No — opt-in
`RENO`	Classic halve cwnd on loss, linear growth; basis for all others	Fallback

SYN Cookie — Defending Against SYN Floods

Background: A SYN flood sends millions of SYNs with spoofed IPs. Each SYN allocates a struct request_sock in the SYN backlog. When the backlog fills, legitimate SYNs are dropped. SYN cookies encode all connection state in the ISN, eliminating the backlog entry.

Key TCP Mechanisms

Mechanism	How it works
Nagle Algorithm	Delay small segments until ACK arrives or MSS-sized segment ready. Reduces chatty sends. Disable with TCP_NODELAY for latency-sensitive apps.
TCP_CORK	Hold data until uncorked or MSS full — application-controlled coalescing. Used by HTTP servers to bundle headers + body.
TCP Keepalive	After tcp_keepalive_time (default 2h) of idle, send keepalive probes. Close if no response. Detects dead peers behind stateful firewalls.
Zero Window Probe	When receiver advertises window=0, sender periodically probes to detect when window reopens, avoiding deadlock.
PMTU Discovery	Set DF=1; on ICMP Frag Needed, reduce MSS to the path MTU. Avoids fragmentation along the path.
RTT Estimation (Karn)	SRTT = 7/8SRTT + 1/8sample. Karn: never sample RTT from retransmitted segments — ambiguous which copy was ACKed.

Minimal C Demo — TCP Congestion Control Simulation

TCP Congestion Control — slow start, CA, fast retransmit — C Demo

stdin (optional)

10. § 7.9 — UDP

UDP adds port numbers and an optional checksum to IP, then delivers datagrams directly to the socket. No connection setup, no retransmission, no ordering. The receive path is primarily a 4-tuple hash lookup to find the right socket, followed by an skb enqueue.

Feature	Detail
4-tuple demux	udp4_lib_lookup(): hash(src IP, src port, dst IP, dst port) → linked list of matching sockets
Multicast delivery	ip_check_mc_rcu(): fan-out to all sockets joined to the group; IGMP controls group membership at the router
UDP checksum	Optional in IPv4 (0 = disabled); mandatory in IPv6. Offloaded via NETIF_F_HW_CSUM on capable NICs
RCVBUF overflow	sk_rcvbuf limit: if receive queue full, skb silently dropped — no retransmission unlike TCP
SO_REUSEPORT	Multiple sockets bind same port; kernel distributes by hash — enables multi-threaded UDP servers without lock contention
UDP GRO	Coalesce UDP packets from same flow into one skb — reduces per-packet overhead for QUIC/WireGuard tunnels

11. § 7.10 — Netfilter & NAT

Netfilter is Linux's packet filtering framework. It inserts five hook points on the IP packet path where kernel modules (iptables, nftables, conntrack) can inspect or modify packets. Connection tracking (nf_conntrack) maintains a state table so that stateful firewalls and NAT can match reply packets to their original flows.

Hook Points on the Packet Path

Connection Tracking States

nf_conn field	Type	Purpose
`tuplehash[IP_CT_DIR_ORIGINAL]`	struct nf_conntrack_tuple_hash	Original direction tuple: src/dst IP + port + proto
`tuplehash[IP_CT_DIR_REPLY]`	struct nf_conntrack_tuple_hash	Reply direction tuple — reverse of original; used to match reply packets
`status`	unsigned long	Bitfield: IPS_CONFIRMED, IPS_NAT_MASQ, IPS_SEEN_REPLY, IPS_ASSURED
`timeout`	struct timer_list	Per-state timeout: TCP ESTABLISHED=432000s, UDP=30s, ICMP=30s
`nat.info`	struct nf_nat_conn_info	Stores allocated NAT port and IP (for SNAT/DNAT)
`proto`	union nf_conntrack_proto	Protocol-specific state (TCP: window scale, sequence numbers)

Core Mechanism — SNAT Port Selection

Background: Multiple VMs share one public IP. Each outgoing TCP connection must be given a unique (src_ip, src_port) on the public side so that return traffic can be demultiplexed back to the correct VM.

Plan:

First packet hits POSTROUTING hook; nf_nat module sees no existing entry
nf_nat_get_unique_tuple() picks an ephemeral port (1024–65535) not already used by another conn
Kernel rewrites skb src IP + src port; updates IP/TCP checksums
Stores original ↔ NATted tuple pair in nf_conn
Reply arrives with dst = NATted port; conntrack lookup finds the entry; reverse rewrite restores original dst

Minimal C Demo — SNAT Connection Tracker

SNAT conntrack — port allocation and reuse — C Demo

stdin (optional)

12. § 7.11 — Network Bridge

The Linux kernel bridge (br_*) implements an IEEE 802.1D Ethernet bridge entirely in software. Each port is a net_device enslaved to a bridge device. Frames are forwarded using a MAC learning table (FDB — Forwarding Database). Unknown destination MACs are flooded to all ports; known MACs are forwarded unicast.

FDB Learning & Forwarding

Struct / Function	Purpose
`struct net_bridge`	Master bridge device — holds port list, FDB hash table, STP state
`struct net_bridge_port`	One per enslaved interface — port state (forwarding/blocking/learning), STP timers
`struct net_bridge_fdb_entry`	One FDB entry: MAC → port, ageing timer (default 300 s)
`br_fdb_update()`	Called on every received frame to learn src MAC → ingress port mapping
`br_forward() / br_flood()`	Unicast forward to known port or flood to all ports (minus ingress)
`ebtables`	L2 filtering framework — analogous to iptables but operates on Ethernet frames

Minimal C Demo — Bridge MAC Learning Table

Bridge FDB — MAC learning and unicast forwarding — C Demo

stdin (optional)

13. § 7.12 — GRE & VXLAN Tunnels

Overlay tunnels encapsulate one packet inside another so that tenant traffic can traverse an underlay network without leaking tenant MAC/IP addresses. VXLAN (Virtual eXtensible LAN) wraps an Ethernet frame in UDP/IP, adding a 24-bit VNI (VXLAN Network Identifier) to separate up to 16 million tenant networks over the same physical fabric.

VXLAN Encapsulation Format

VXLAN RX Path

Concept	Detail
VTEP (VXLAN Tunnel End Point)	The host NIC + vxlan driver that adds/strips the outer UDP/IP + VXLAN header
VNI (VXLAN Network Identifier)	24-bit tenant ID — analogous to VLAN ID but 16M namespaces vs 4096
VXLAN port 4789	IANA-assigned UDP destination port for all VXLAN traffic
Outer UDP src port	Hash of inner 5-tuple (for ECMP load balancing across LAG / ECMP paths)
Underlay requirement	MTU ≥ inner MTU + 50 bytes overhead — typically set to 1550 or jumbo frames
GRE vs VXLAN	GRE: IP proto 47, no src port → no ECMP. VXLAN: UDP → ECMP-friendly. GENEVE extends VXLAN with TLV options
Checksum offload	vxlan driver sets NETIF_F_GSO_UDP_TUNNEL so NIC can offload inner TCP segmentation

Minimal C Demo — VXLAN Header Encode / Decode

VXLAN header — encode VNI and decode on receive — C Demo

stdin (optional)

14. § 7.13 — IPv6

IPv6 uses 128-bit addresses and eliminates broadcast in favor of multicast. The Neighbor Discovery Protocol (NDP) replaces ARP; Router Advertisements (RA) distribute prefixes so hosts can self-configure via SLAAC without a DHCP server.

IPv6 Address Types & Scope

SLAAC — Stateless Address Autoconfiguration

NDP Message	ICMPv6 Type	Purpose
Router Solicitation (RS)	`133`	Host → ff02::2 asking for prefix + gateway info
Router Advertisement (RA)	`134`	Router → ff02::1 advertising prefix, MTU, default route; triggers SLAAC
Neighbor Solicitation (NS)	`135`	Who has this IP? (like ARP request); also used for DAD
Neighbor Advertisement (NA)	`136`	I have this IP (like ARP reply); Source Link-Layer Address (SLLA) option carries MAC
Redirect	`137`	Router tells host about a better next-hop for a destination

Minimal C Demo — EUI-64 Interface ID from MAC

EUI-64 — generate IPv6 IID from MAC-48 — C Demo

stdin (optional)

15. § 7.14 — Multicast (IGMP & MLD)

Multicast delivers one stream to many receivers without per-receiver replication at the source. IGMP (IPv4) and MLD (IPv6) are the group management protocols that hosts use to join/leave groups. PIM-SM builds the distribution tree. The RPF check prevents routing loops by verifying that multicast traffic arrives on the correct upstream interface.

IGMP Join Flow (SSM — Source-Specific Multicast)

RPF Check — Loop Prevention

Concept	Detail
IGMP v1/v2/v3	v1: join only. v2: + leave (prune faster). v3: SSM — report specific source(s) to receive from
MLD v1/v2	IPv6 multicast listener discovery — MLD v2 adds SSM like IGMPv3; uses ICMPv6 types 130–132
PIM-SM (Sparse Mode)	Builds shared tree via Rendezvous Point (RP), then optionally switches to source tree (SPT)
*Source tree (,G) vs (S,G)**	(*,G) = any source for group G — shared RP tree. (S,G) = specific source S — lowest latency, shortest path
RPF interface	The interface that unicast routing would use to reach the multicast source — packet must arrive here
mfc_cache	Multicast Forwarding Cache entry: (src, group) → OIF list (output interfaces)
IGMP snooping	Switch learns group membership from IGMP frames — prevents flooding multicast to all ports

Minimal C Demo — RPF Check

RPF check — validate multicast source reachability — C Demo

stdin (optional)

16. § 7.15 — DHCP Protocol & Proxy DHCP

DHCP (Dynamic Host Configuration Protocol) automates IP address assignment via a four-message exchange called DORA. In cloud VPC environments, virtual switches implement a DHCP proxy (代答) that intercepts DISCOVER packets and synthesizes replies from a local config database — eliminating dependency on a real DHCP server and giving the hypervisor full control over tenant addressing.

DHCP DORA Sequence (with Relay)

Virtual Switch DHCP Proxy (代答)

DHCP Option	Code	Purpose
Subnet Mask	`1`	Netmask for assigned IP (e.g., 255.255.255.0)
Router (gateway)	`3`	Default gateway IP — must be set for off-subnet traffic
DNS Servers	`6`	Up to 8 DNS server IPs
Lease Time	`51`	Seconds until IP expires; use 0xFFFFFFFF for infinite (VPC)
DHCP Message Type	`53`	1=DISCOVER 2=OFFER 3=REQUEST 5=ACK 6=NAK
Server Identifier	`54`	DHCP server IP — client uses this to select among multiple offers
Relay Agent Info	`82`	Sub-options: circuit-id (ingress port), remote-id (switch MAC) — added by relay

DHCPv6 — Stateful vs Stateless

Mode	Flow	Address Source
Stateful (IA_NA)	Solicit → Advertise → Request → Reply	DHCPv6 server assigns full /128
Stateless (SLAAC + options)	RA → SLAAC address + DHCPv6 for options only	Host self-configures via EUI-64; DHCPv6 supplies DNS/NTP

Minimal C Demo — DHCP DORA State Machine

DHCP DORA — client state machine simulation — C Demo

stdin (optional)

17. § 7.16 — Routing Protocols: RIP, OSPF, BGP

Dynamic routing protocols allow routers to exchange reachability information and converge on consistent forwarding tables. RIP is distance-vector (hop count); OSPF is link-state (Dijkstra SPF); BGP is path-vector (AS-path, policy-driven). In cloud infrastructure, BGP (via BIRD or FRR) is used for advertising VM IP prefixes during live migration, anycast VIP advertisement, and cross-datacenter routing.

Protocol	Type	Metric	Algorithm	Scope
RIP v2	Distance-vector	Hop count (max 15)	Bellman-Ford, periodic full-table broadcast every 30 s	Small LANs; obsolete in production
OSPF v2	Link-state	Cost (bandwidth-based)	Dijkstra SPF; LSA flooding within area	Enterprise / datacenter IGP
BGP-4	Path-vector	AS_PATH + policy attributes	Policy-driven best-path selection	Inter-AS Internet routing; datacenter BGP-only fabric

OSPF Neighbor State Machine

OSPF Area Hierarchy

LSA Type	Name	Flooded to	Purpose
`Type 1`	Router LSA	Within area	Each router's links and their costs
`Type 2`	Network LSA	Within area	DR-generated: lists all routers on multi-access segment
`Type 3`	Summary LSA	Other areas	ABR advertises inter-area prefixes into backbone/areas
`Type 4`	ASBR Summary LSA	Other areas	ABR advertises path to ASBR
`Type 5`	AS-External LSA	Entire OSPF domain	ASBR redistributes external routes (blocked by Stub areas)
`Type 7`	NSSA External LSA	NSSA area only	Local ASBR in NSSA; ABR translates to Type 5 at border

BGP State Machine

BGP Best-Path Selection (in order)

BGP Attribute	Type	Meaning
`AS_PATH`	Well-known mandatory	Ordered list of ASes the route has traversed — loop prevention + path length metric
`NEXT_HOP`	Well-known mandatory	IP of the next-hop router; iBGP peers must resolve this via IGP
`LOCAL_PREF`	Well-known discretionary	Set within an AS to prefer one exit point over another; higher wins
`MED`	Optional non-transitive	Multi-Exit Discriminator — hint to neighboring AS which entry to prefer; lower wins
`COMMUNITY`	Optional transitive	32-bit tag for grouping routes; used for policy (e.g., blackhole, no-export)
`ORIGIN`	Well-known mandatory	Route origin: IGP (i) < EGP (e) < Incomplete (?) — lower preferred

RIP — Count-to-Infinity Problem

When a route becomes unreachable, RIP routers may advertise stale hop counts back to each other, incrementing forever until max (16 = infinity). Mitigations: split horizon (never advertise a route back out the interface it was learned on), poison reverse (advertise with metric 16 back on that interface — faster convergence), and triggered updates (send immediately on topology change rather than waiting 30 s).

BIRD — BGP route publishing in VM live migration

During VM live migration, the destination hypervisor needs to attract traffic for the migrated VM's IP without waiting for ARP to age out on every ToR switch. The hypervisor runs a BIRD BGP daemon that announces the VM's /32 host route to the ToR via iBGP/eBGP. The ToR installs a more-specific /32 that overrides the /24 subnet route, redirecting traffic immediately. After migration completes, the source hypervisor withdraws its announcement.

Minimal C Demo — BGP Best-Path Selection

BGP best-path — LOCAL_PREF, AS_PATH, ORIGIN, MED — C Demo

stdin (optional)

18. Kernel Source Pointers

Concept	File	Key Function / Symbol
struct socket / sock	`include/linux/net.h, include/net/sock.h`	`sock_alloc(), sk_alloc()`
struct tcp_sock	`include/linux/tcp.h`	`tcp_sk(sk) cast macro`
Socket creation	`net/socket.c`	`sys_socket() → __sock_create() → inet_create()`
struct sk_buff	`include/linux/skbuff.h`	`alloc_skb(), skb_put(), skb_push(), skb_pull(), skb_clone()`
skb_shared_info	`include/linux/skbuff.h`	`skb_shinfo(skb) macro → (skb_shared_info*)skb->end`
GSO	`net/core/gso.c`	`__skb_gso_segment(), skb_gso_reset()`
GRO	`net/core/gro.c`	`napi_gro_receive(), dev_gro_receive()`
TX path	`net/core/dev.c`	`dev_queue_xmit() → __dev_queue_xmit() → sch_direct_xmit()`
RX path	`net/core/dev.c`	`netif_receive_skb() → __netif_receive_skb() → deliver_skb()`
NAPI	`net/core/dev.c`	`napi_schedule(), net_rx_action(), napi_complete_done()`
struct net_device	`include/linux/netdevice.h`	`alloc_etherdev(), register_netdev()`
TC qdisc	`net/sched/sch_generic.c`	`qdisc_enqueue(), qdisc_dequeue_head()`
HTB	`net/sched/sch_htb.c`	`htb_enqueue(), htb_dequeue(), htb_charge_class()`
fq_codel	`net/sched/sch_fq_codel.c`	`fq_codel_enqueue(), codel_should_drop()`
IP receive	`net/ipv4/ip_input.c`	`ip_rcv(), ip_rcv_finish(), ip_local_deliver()`
IP forward / output	`net/ipv4/ip_forward.c, ip_output.c`	`ip_forward(), ip_output(), ip_finish_output2()`
IP fragmentation	`net/ipv4/ip_output.c, ip_fragment.c`	`ip_fragment(), ip_defrag(), struct ipq`
FIB trie / LPM	`net/ipv4/fib_trie.c`	`fib_table_lookup(), tnode_get_child_rcu()`
Route lookup	`net/ipv4/route.c`	`ip_route_input_noref(), ip_route_output_key(), alloc_cache_rt()`
Neighbour / ARP	`net/core/neighbour.c, net/ipv4/arp.c`	`neigh_lookup(), neigh_update(), arp_send()`
TCP state machine	`net/ipv4/tcp_input.c`	`tcp_rcv_state_process(), tcp_v4_rcv()`
TCP handshake	`net/ipv4/tcp_ipv4.c`	`tcp_v4_connect(), tcp_v4_syn_recv_sock()`
SYN cookie	`net/ipv4/syncookies.c`	`cookie_v4_check(), __cookie_v4_init_sequence()`
Congestion control	`net/ipv4/tcp_cong.c, tcp_cubic.c`	`tcp_cong_ops, bictcp_cong_avoid()`
UDP receive	`net/ipv4/udp.c`	`udp_rcv(), udp4_lib_rcv(), udp_queue_rcv_skb()`
Netfilter hooks	`net/netfilter/core.c`	`nf_hook_slow(), nf_register_net_hook()`
Connection tracking	`net/netfilter/nf_conntrack_core.c`	`nf_conntrack_in(), init_conntrack(), __nf_ct_refresh_acct()`
NAT	`net/netfilter/nf_nat_core.c`	`nf_nat_packet(), nf_nat_get_unique_tuple(), nf_nat_setup_info()`
iptables / nftables	`net/ipv4/netfilter/ip_tables.c, net/netfilter/nf_tables_core.c`	`ipt_do_table(), nft_do_chain()`
Network bridge	`net/bridge/br.c, br_fdb.c, br_forward.c`	`br_handle_frame(), br_fdb_update(), br_forward()`
STP	`net/bridge/br_stp.c`	`br_stp_enable_port(), br_received_config_bpdu()`
VXLAN	`drivers/net/vxlan/vxlan_core.c`	`vxlan_rcv(), vxlan_xmit(), vxlan_fdb_find()`
GRE tunnel	`net/ipv4/ip_gre.c`	`ipgre_rcv(), ipgre_xmit()`
IPv6 receive	`net/ipv6/ip6_input.c`	`ipv6_rcv(), ip6_rcv_finish()`
NDP / RA	`net/ipv6/ndisc.c`	`ndisc_recv_ra(), ndisc_recv_ns(), addrconf_dad_start()`
SLAAC / addrconf	`net/ipv6/addrconf.c`	`addrconf_prefix_rcv(), ipv6_generate_eui64()`
IGMP	`net/ipv4/igmp.c`	`igmp_rcv(), ip_mc_join_group(), igmpv3_sendpack()`
MLD	`net/ipv6/mcast.c`	`mld_rcv(), ipv6_sock_mc_join()`
Multicast forwarding	`net/ipv4/ipmr.c`	`ip_mr_forward(), ipmr_cache_find(), mfc_cache_put()`
OSPF (user-space)	`FRR: ospfd/ospf_interface.c, BIRD: proto/ospf/`	`ospf_hello_send(), ospf_spf_calculate(), ospf_lsa_install()`
BGP (user-space)	`FRR: bgpd/bgp_best.c, BIRD: proto/bgp/`	`bgp_best_selection(), bgp_process(), bgp_update_main()`
Netlink RIB install	`net/core/rtnetlink.c`	`rtnl_newroute(), fib_new_table(), ip_rt_ioctl()`
DHCP (user-space)	`isc-dhcp client, dhcpcd, systemd-networkd`	`Sends DISCOVER via raw socket (AF_PACKET), listens on udp/68`

19. Interview Prep

Q: Walk through the sk_buff memory layout — what do head, data, tail, end point to?

head/end bound the allocated buffer (never move). data points to the start of current packet data; tail to the end. Headroom (head→data) is reserved for TX header prepending via skb_push(). skb_shared_info lives at end for scatter-gather fragments and GSO metadata.

Q: What is NAPI and why does it improve throughput at high PPS?

NAPI switches from per-packet hard IRQs to a polling model under load. The first packet triggers a hard IRQ; the handler disables NIC RX interrupts and schedules a softirq poll. The poll processes up to budget packets per call. When the ring drains, the NIC interrupt is re-armed. This amortizes interrupt overhead: at 14 Mpps, per-packet IRQs would consume 100% CPU in interrupt context.

Q: Describe the full TX path from write() to NIC ring.

write()/sendmsg() → tcp_sendmsg() (segments, adds TCP header) → ip_queue_xmit() (IP header + route lookup) → dev_queue_xmit() (enqueue to qdisc) → qdisc dequeues → ndo_start_xmit() (driver fills TX ring descriptor) → NIC DMA reads descriptor + payload → TX complete interrupt frees skb.

Q: What is the difference between skb_clone() and skb_copy()?

skb_clone() creates a new sk_buff header pointing to the same data buffer (users++ on shared_info). Zero cost but cloned skb data is read-only. skb_copy() allocates a completely new buffer and copies all data — required before in-place modification (e.g., NAT IP rewrite). pskb_copy() copies the linear area but shares frags[].

Q: How does HTB implement per-VM bandwidth limiting?

HTB builds a class hierarchy. Each leaf class has a rate (guaranteed) and ceil (burst). The dequeue path walks the hierarchy, selects the class with the earliest eligible packet by virtual time (deficit round-robin at each level). A class below rate borrows tokens from its parent up to ceil when the parent has surplus. This ensures minimum guaranteed bandwidth while allowing burst into idle capacity.

Q: How does ip_rcv() decide to forward vs deliver locally?

ip_rcv_finish() calls ip_route_input_noref() which performs a FIB lookup. If the destination matches a local address the dst.input is set to ip_local_deliver(). If IP forwarding is enabled and the route is to a remote host, dst.input is ip_forward(). ip_forward() decrements TTL, checks MTU (fragments if needed), then calls ip_output() to re-enter the TX path.

Q: Explain LPM routing and the fib_trie data structure.

LPM (Longest Prefix Match) returns the routing entry whose prefix is the most specific match for a given destination IP. Linux uses an LC-trie (level-compressed trie) called fib_trie. Internal tnodes index into child arrays using multi-bit slices of the address; leaves hold fib_alias entries. Compression reduces the average lookup depth from 32 to 4–6 steps even with 800K BGP routes.

Q: What is the ARP NUD state machine? What is gratuitous ARP?

NUD (Neighbour Unreachability Detection) tracks MAC→IP cache validity: INCOMPLETE (waiting for reply) → REACHABLE (confirmed) → STALE (timer expired) → DELAY (wait for upper-layer confirmation) → PROBE (send unicast ARP) → FAILED (drop). Gratuitous ARP is an ARP request where sender IP = target IP — it announces IP ownership, updates neighbour caches on peers, and detects IP conflicts. Used heavily for VIP failover in virtual switches.

Q: Walk through the TCP 3-way handshake with kernel functions.

Client: tcp_v4_connect() sends SYN, state→SYN_SENT. Server: tcp_v4_rcv()→tcp_rcv_state_process() receives SYN in LISTEN, allocates request_sock, sends SYN-ACK, state→SYN_RECEIVED. Client receives SYN-ACK, sends ACK, state→ESTABLISHED. Server: receives ACK, calls tcp_v4_syn_recv_sock()→tcp_create_openreq_child(), state→ESTABLISHED, wakes accept().

Q: What is a SYN cookie and how does it defend against SYN floods?

SYN cookies encode all connection state (src/dst IP, ports, MSS index, timestamp) in the initial sequence number. The server sends SYN-ACK without allocating any backlog entry. When the client's ACK arrives, the server decodes ack_num-1 and verifies the hash — if valid, creates the socket. This eliminates the SYN backlog as an attack surface. Tradeoff: TCP options like SACK and window scaling may not be negotiated, slightly reducing throughput.

Q: Explain TCP slow start and fast retransmit.

Slow start: begin with cwnd=1 MSS, double cwnd each RTT (exponential growth) until cwnd reaches ssthresh, then switch to congestion avoidance (+1 MSS/RTT, linear). Fast retransmit: 3 duplicate ACKs indicate a lost segment without waiting for RTO; retransmit immediately, set ssthresh=cwnd/2, cwnd=ssthresh (fast recovery). RTO timeout is more severe: ssthresh=cwnd/2 and cwnd=1, restarting slow start.

Q: What are the five netfilter hook points and in what order are they traversed for a forwarded packet?

Forwarded path: PREROUTING (conntrack, DNAT) → routing decision → FORWARD (firewall rules) → POSTROUTING (SNAT/masquerade). For locally destined packets: PREROUTING → INPUT → local socket. For locally generated packets: OUTPUT → routing → POSTROUTING. nftables and iptables both register callbacks at these same NF_INET_* hook points.

Q: Explain SNAT port selection — how does nf_nat avoid port collisions?

nf_nat_get_unique_tuple() tries the original port first; if it is already used by another conntrack entry (same dst IP:port), it iterates through the configured port range (1024–65535 by default) hashing through the range until it finds one where no nf_conn with the same (proto, nat_ip, nat_port, dst_ip, dst_port) exists. The result is stored in nf_conn so every subsequent packet in the same flow reuses it.

Q: How does a Linux bridge learn MAC addresses and forward frames?

Every frame received on a bridge port triggers br_fdb_update() which looks up (or creates) an FDB entry keyed on src MAC, records the ingress port, and resets the ageing timer (default 300 s). For forwarding, br_forward() does an FDB lookup on dst MAC — if found, unicast; if not (unknown unicast, broadcast, or multicast), br_flood() copies the skb to all ports except the ingress. STP marks ports as forwarding/blocking to prevent loops.

Q: What is VXLAN and why is UDP used for the outer encapsulation?

VXLAN (RFC 7348) wraps an inner Ethernet frame in an 8-byte VXLAN header (flags + 24-bit VNI) then UDP + IP. UDP is used because: (1) the UDP src port is set to a hash of the inner 5-tuple, enabling ECMP load balancing across LAG/ECMP paths that hash on UDP ports — GRE lacks a src port and cannot ECMP. (2) UDP checksum can be disabled (inner layers handle their own). VTEP overhead is 50 bytes, reducing inner MTU from 1500 to 1450 bytes.

Q: Describe IPv6 SLAAC — how does a host get a global unicast address without DHCPv6?

SLAAC (RFC 4862): (1) Host sends Router Solicitation to ff02::2. (2) Router replies with RA containing prefix (e.g., 2001:db8:1::/64) + valid/preferred lifetimes. (3) Host generates interface ID using EUI-64 from MAC (flip U/L bit, insert FF:FE) or stable privacy extension (RFC 8981). (4) Combine: prefix + IID = full /128 address. (5) DAD: send NS to solicited-node multicast address; wait 1 RTT; if no NA reply, the address is unique and assigned.

Q: What is the RPF check in multicast routing and what does it prevent?

Reverse Path Forwarding check: when a multicast packet arrives on interface I from source S, the kernel performs a unicast route lookup for S. If the route to S would exit via a different interface than I, the packet is dropped. This prevents routing loops: in a loop, a multicast packet could bounce between routers indefinitely; RPF ensures the packet arrived from the correct upstream (towards the source tree). Implemented in ipmr.c for IPv4, ip6mr.c for IPv6.

Q: Compare IGMPv2 and IGMPv3 — what does SSM add?

IGMPv2: hosts report group membership (join *,G — any source) and send Leave Group messages, triggering Last Member Query before removing state. IGMPv3 (RFC 3376) adds Source-Specific Multicast (SSM): hosts specify which sources they want to receive from (INCLUDE mode) or exclude (EXCLUDE mode). This allows the network to build source-rooted shortest-path trees (S,G) instead of shared RP trees, eliminating the Rendezvous Point bottleneck and enabling the kernel to drop traffic from unwanted sources at the router before it reaches the receiver.

Q: Explain the DHCP proxy pattern — how does a virtual switch implement DHCP 代答?

When a VM boots and broadcasts a DHCP DISCOVER, the vSwitch intercepts it on the VM's virtual port before forwarding. It looks up the VM's MAC (or port ID) in a config database pre-populated by the VPC control plane with the assigned IP, mask, gateway, DNS, and lease time. The vSwitch synthesizes a DHCP OFFER and sends it directly to the VM without involving a real DHCP server. This gives the hypervisor full control of addressing (critical for VPC isolation), eliminates the single-point-of-failure DHCP server, and allows infinite lease times so VMs never try to renew.

Q: Compare RIP, OSPF, and BGP — when would you use each?

RIP: distance-vector, hop-count metric (max 15), 30-second full-table broadcast, slow convergence, count-to-infinity risk. Only suitable for small flat networks. OSPF: link-state, Dijkstra SPF, fast convergence (sub-second with BFD), area hierarchy scales to large enterprise. The preferred IGP. BGP: path-vector, AS_PATH prevents loops, policy-driven via LOCAL_PREF/MED/COMMUNITY, carries the full Internet table (800K+ routes). Used between ISPs, within large DCs (BGP-only fabric), and for advertising VM host routes during live migration via BIRD/FRR.

Q: Walk through the OSPF neighbor state machine to Full.

Down → Init (Hello received, neighbor's Router-ID not yet in our Hello) → 2-Way (our Router-ID appears in neighbor's Hello — basic bidirectional) → ExStart (master/slave negotiated via DBD, higher Router-ID becomes master) → Exchange (exchange full DBD summaries) → Loading (request missing LSAs via LSR, receive LSU) → Full (databases synchronized — neighbor is now an SPF input). On multi-access Ethernet, only DR/BDR reach Full with all routers; DROthers stay in 2-Way with each other.

Q: Walk through BGP route selection attributes in order.

1. Highest WEIGHT (Cisco-local, not advertised). 2. Highest LOCAL_PREF (intra-AS policy — which exit AS to prefer). 3. Locally originated (network/aggregate > iBGP learned). 4. Shortest AS_PATH (fewer AS hops). 5. Lowest ORIGIN (IGP < EGP < Incomplete). 6. Lowest MED (only compared among routes from same neighbor AS). 7. eBGP over iBGP. 8. Lowest IGP metric to NEXT_HOP. 9. Oldest eBGP route (stability). 10. Lowest BGP Router-ID (deterministic tiebreak). The winning route is installed in RIB and, if changed, triggers UPDATE messages to peers.

Q: How does BIRD publish a VM's /32 host route during live migration?

The destination hypervisor configures BIRD with a static /32 route for the migrating VM's IP pointing to the local vSwitch. BIRD's BGP session to the ToR switch advertises this /32 via iBGP or eBGP (depending on fabric design). The ToR's longest-prefix-match selects /32 over the existing /24 subnet route, redirecting all traffic to the new hypervisor immediately — without waiting for gratuitous ARP to propagate or neighbor caches to expire. When migration completes, the source hypervisor's BIRD withdraws its /32, and the /24 handles the now-unified traffic.

§ 7.1 – 7.16 Linux Network Stack — Sockets to BGP

1. Overview

2. § 7.1 — Socket Layer

Abstraction Layers — fd → file → socket → sock → tcp_sock

Send/Receive Buffer Watermarks

3. § 7.2 — sk_buff: The Core Packet Container

Memory Layout

Pointer Operations

Clone vs Copy

GSO & GRO

Minimal C Demo — sk_buff Pointer Manipulation

4. § 7.3 — Network Device Layer & NAPI

TX Path — write() to NIC Ring

RX Path — NIC Interrupt to Socket Buffer

Core Mechanism — NAPI Budget & Interrupt Coalescing

Minimal C Demo — NAPI Budget Simulation

5. § 7.4 — Traffic Control (TC) & QoS

TC Pipeline

Key Qdiscs

HTB — Per-VM Bandwidth Limiting

6. § 7.5 — IP Layer

IP Header — Key Fields

IP Fragmentation & Reassembly

7. § 7.6 — Routing Subsystem

FIB Trie — Longest Prefix Match

Route Lookup Path

Core Mechanism — LPM Walkthrough

Minimal C Demo — LPM Route Lookup

8. § 7.7 — Neighbor Subsystem (ARP/NDP)

NUD State Machine

ARP Request / Reply Flow

9. § 7.8 — TCP/IP Protocol Deep Dive

TCP State Machine

3-Way Handshake

Congestion Control

SYN Cookie — Defending Against SYN Floods

Key TCP Mechanisms

Minimal C Demo — TCP Congestion Control Simulation

10. § 7.9 — UDP

11. § 7.10 — Netfilter & NAT

Hook Points on the Packet Path

Connection Tracking States

Core Mechanism — SNAT Port Selection

Minimal C Demo — SNAT Connection Tracker

12. § 7.11 — Network Bridge

FDB Learning & Forwarding

Minimal C Demo — Bridge MAC Learning Table

13. § 7.12 — GRE & VXLAN Tunnels

VXLAN Encapsulation Format

VXLAN RX Path

Minimal C Demo — VXLAN Header Encode / Decode

14. § 7.13 — IPv6

IPv6 Address Types & Scope

SLAAC — Stateless Address Autoconfiguration

Minimal C Demo — EUI-64 Interface ID from MAC

15. § 7.14 — Multicast (IGMP & MLD)

IGMP Join Flow (SSM — Source-Specific Multicast)

RPF Check — Loop Prevention

Minimal C Demo — RPF Check

16. § 7.15 — DHCP Protocol & Proxy DHCP

DHCP DORA Sequence (with Relay)

Virtual Switch DHCP Proxy (代答)

Minimal C Demo — DHCP DORA State Machine

17. § 7.16 — Routing Protocols: RIP, OSPF, BGP

OSPF Neighbor State Machine

OSPF Area Hierarchy

BGP State Machine

BGP Best-Path Selection (in order)

Minimal C Demo — BGP Best-Path Selection

18. Kernel Source Pointers

19. Interview Prep