I run a small OpenWrt network: two routers, a wireless mesh backbone across a hard-to-wire span, encrypted with WireGuard, bridging a few VLANs with VXLAN. The architecture is described in its own deep-dive. This post is the other half: the part that does not make it into architecture diagrams.
These are eleven failures from a real small-network deployment, roughly in descending order of how much grief they caused. A few are specific to this class of hardware; most are the kind of thing that bites any layered network. The pattern underneath nearly all of them is the same: the failure is silent. The interface comes up. The service reports running. The ping succeeds. And the thing still does not work.
I will call the two routers gw (gateway) and ap (access point). Addresses are illustrative: LAN is 10.0.10.0/24, the mesh transport is 169.254.100.0/30, and the WireGuard mesh tunnel is 10.255.0.0/30.
1. The route that ate its own tail
Symptom. The mesh radio link is healthy: the 802.11s peer shows ESTAB. But Layer 3 does not pass. Pinging the far router’s mesh address fails, and the WireGuard handshake goes stale within a minute.
Cause. OpenWrt’s WireGuard protocol handler, when it brings up a tunnel whose endpoint is not yet routable, inserts a host route to that endpoint via the default gateway:
169.254.100.1 via 10.0.10.1 dev br-lan proto staticThat endpoint is the other router’s mesh address. The /32 is more specific than the connected /30 on the mesh interface, so the kernel uses it, and now ap reaches the mesh endpoint through the LAN bridge, which is itself carried over the mesh by VXLAN. Every packet to the endpoint loops back through the tunnel it is trying to establish.
Fix. Delete the route. But it is regenerated on every tunnel bring-up, so a one-shot ip route del lasts until the next ifup. The durable fix is a hotplug hook that deletes it whenever the WireGuard interface comes up.
ip route | grep 169.254.100.1 # if "via <gateway>" appears, it is the rogue route
ip route del 169.254.100.1 via 10.0.10.1Warning
A WireGuard tunnel whose endpoint is a link-local mesh address, on a router with a default route, is a circular-dependency trap. The proto handler “helps” by routing the endpoint via the gateway, straight back into the tunnel. This was the single worst bug in the whole build because every layer below it looked perfectly healthy.
Lesson. When a lower layer reports green and the layer above is still dead, suspect routing precedence before anything else. A more-specific route silently winning is invisible unless you go looking at the routing table itself.
2. The radios are hardware-locked to different bands
Symptom. Move an access point to a “better” 5 GHz channel and it simply does not come up. No client can see the SSID.
Cause. On this hardware the two 5 GHz radios cover different, non-overlapping slices of the band, enforced by the regulatory domain in firmware:
| Radio | Allowed | Not allowed |
|---|---|---|
| radio0 | UNII-3 (149–165) | UNII-1, UNII-2 |
| radio2 | UNII-1 / UNII-2A (36–64) | UNII-2C, UNII-3 |
Assign a channel outside a radio’s range and hostapd refuses to start:
Frequency XXXX (primary) not allowed for AP modeNo fallback, no warning in the web UI; the AP is just gone.
Lesson. “It worked before I changed the channel” is a complete diagnosis if you know the radios are band-locked. Read the regulatory constraints of your specific silicon before treating channel numbers as interchangeable. They are not.
3. Encrypted 802.11s silently does not work on this driver
Symptom. Configure SAE or PSK on the mesh interface. Stations associate. The mesh peer link never reaches ESTAB; it sits in LISTEN forever. No error anywhere.
Cause. The ath11k driver on this hardware class does not establish encrypted 802.11s peer links, with either SAE or WPA2-PSK. Both fail the same silent way.
Fix. Run the mesh unencrypted and move confidentiality up a layer: WireGuard rides on top, so the radio only ever carries ciphertext. (That is the whole thesis of the architecture deep-dive.)
Note
Do not re-test mesh encryption on this driver after every update on a hunch. Re-test only when a changelog specifically claims a mesh-encryption fix. Otherwise you are paying for someone else’s open bug with your evening.
Lesson. “Supported in the standard” and “works on your driver” are different claims. When a standard feature fails silently, check whether it is a known driver gap before assuming you misconfigured it.
4. VXLANs vanish on a config reload
Symptom. Clients on the remote AP associate fine but get no DHCP and no internet. A VXLAN interface has disappeared from ip link show.
Cause. The VXLAN interfaces are created by a boot script, not by OpenWrt’s config system (UCI). Any operation that re-runs the network stack (network reload, network restart) tears them down and does not recreate them.
Fix. Re-run the boot script, or reboot.
ip -d link show type vxlan # confirm which are missing
/etc/rc.local # recreates themLesson. Imperative state living outside your config manager is invisible to it, and will be silently destroyed by it. If you must keep state in a boot script, make “do these interfaces still exist?” the first line of your verification checklist after any network change.
5. The wrong wpad package, and a binary that would not let go
Symptom. Multi-SSID APs refuse to come up. hostapd logs an unknown configuration item and bails:
unknown configuration item 'bss_transition'
hostapd.add_iface failedCause. The router shipped with wpad-basic, which is compiled without the wireless-network-management features that roaming (802.11k/v) needs. hostapd rejects the config option outright.
Fix. Install the full wpad build. But there is a second trap: after swapping the package, wifi reload does not re-exec the running hostapd binary; the old one keeps serving, so the new features still appear missing.
killall -9 hostapd wpa_supplicant && wifi down && wifi upLesson. “I installed the fix and it still fails” is often “the old process is still running.” A config reload is not always a binary restart. When behavior does not match the installed version, check what is actually executing.
6. The firewall I disabled kept turning itself back on
Symptom. SSH to the access-point router stops working. Ping succeeds, but TCP connections are refused and WireGuard will not handshake.
Cause. That router runs with its firewall intentionally disabled: all filtering happens on the gateway. But editing interfaces through the web UI quietly re-enables the firewall service and reloads a default-REJECT ruleset.
Fix. Flush it:
nft flush rulesetThe durable mitigation is to set that router’s firewall defaults to ACCEPT, so even if the service comes back it does not drop traffic.
Lesson. “Disabled” is not a stable state if a management tool can re-enable it as a side effect. Either remove the capability or make its default-on behavior harmless. Do not rely on a service staying off because you turned it off once.
7. The ISP rotates my IPv6 prefix without warning
Symptom. Something that hard-coded an IPv6 address derived from the delegated prefix suddenly stops working. Nothing on the internal network breaks.
Cause. The ISP re-delegates a new prefix periodically. Anything pinned to the old prefix is now pointing at addresses that no longer exist.
Fix / mitigation. A hotplug script re-applies the default IPv6 route when the WAN prefix changes, so internal routing self-heals. The deeper fix is architectural: internal addressing and the WireGuard layer use a stable ULA prefix that never rotates, so only externally-facing, prefix-derived addresses are ever affected.
Lesson. A delegated prefix is a lease, not an address. If you build anything on top of it, assume it will change and derive from a stable internal prefix instead.
8. Nearby metal cost me 5 dB of signal
Symptom. Marginal mesh signal. Mediocre throughput on the backbone for no obvious reason.
Cause. The gateway router sat too close to a large metal surface. Metal in the near field of the antennas distorts the radiation pattern and absorbs energy.
Fix. I put a few inches of non-conductive spacing between the router and the metal. Every band gained roughly 5 dB.
Tip
Before you chase a weak wireless link through driver settings and channel changes, look at what the antennas are physically sitting on and next to. Metal, mass, and proximity move RF more than most config knobs do.
Lesson. Some of your network’s behavior is governed by physics, not configuration. When the numbers are bad and the config is right, change the environment, not the settings.
9. Two reasonable features that combined into TCP stalls
Symptom. Intermittent connection timeouts (apps throwing “502”-style errors) on a device that is sitting still at home, every couple of minutes.
Cause. Two independently sensible things interacting:
- The device runs an always-on VPN to the home network and connects via a hairpin NAT path. WireGuard re-keys sessions periodically, and the brief loss during a re-key stalls in-flight TCP connections.
- The band-steering daemon sends the same device an 802.11v transition request every couple of minutes. Each request can make the Wi-Fi stack go off-channel briefly to evaluate the suggested AP, a 50–200 ms blackout.
Either alone is tolerable. Together, on the same device, they line up into repeated stalls.
Fix. Stop the steering daemon for stationary devices, or split-tunnel the always-on VPN so home traffic does not hairpin at all.
Lesson. The hardest failures are not single bugs: they are two correct behaviors that compose badly. When nothing is individually broken, ask what is interacting. Look for the beat frequency between two periodic events.
10. The DHCP server that listened on the wrong interfaces
Symptom. Confusion about which router answers DHCP and DNS, and why the access-point router’s own shell resolves names through the gateway rather than locally.
Cause. The access-point router’s dnsmasq is deliberately scoped to listen only on the guest and IoT bridges, not on the trusted LAN. So LAN devices always get DHCP and DNS from the gateway, while only the isolated VLANs are served locally to avoid a mesh round-trip on association.
Lesson. Not every surprise is a bug; some are an intentional scope you forgot you set. Document why a service is deliberately limited, or future-you will “fix” it back into the problem it was avoiding. (I am writing this down precisely so I do not do that.)
11. No independent path to the AP
Symptom. When the mesh is down, the access-point router is completely unreachable: no internet, no management.
Cause. That router reaches everything, including its own management plane, through the gateway over the mesh. Its wired uplink was disconnected once the mesh was verified. There is no out-of-band path by design.
Mitigation. Know the recovery move in advance: reconnect the wired uplink and bring the management interface up locally. Knowing this beforehand turns a panic into a two-minute chore.
Lesson. A single dependency you chose on purpose is still a single point of failure. Decide deliberately whether you want an out-of-band path, and if you accept not having one, write down the recovery procedure before you need it, not while the link is down and you cannot reach anything.
The thread running through all of them
If there is one habit these failures rewarded, it is building diagnostics that bisect rather than guess. A layered network (radio, then tunnel, then bridge) fails layer by layer, and almost every failure above was “a lower layer is green but the one above is dead.” The fast path to the cause is a per-layer check you can run top to bottom in under a minute:
iw dev <mesh-iface> station dump | grep -E 'plink|signal' # radio up?
wg show wg_mesh | grep -E 'handshake|transfer' # tunnel up?
ip -d link show type vxlan # bridges exist?
ip route | grep 169.254.100.1 # rogue route back?The second habit is humbler: write down the things that are intentional but surprising: the disabled firewall, the scoped DHCP, the missing out-of-band path. Half of debugging your own network is remembering which weird behaviors you chose. The other half is physics, and you cannot grep that.