Monday, January 25, 2010

Troubleshooting DMVPN

DMVPN is a great suite of protocols, from time to time something goes wrong though. Here's a few tips on how to troubleshoot it. It will not tell you how to troubleshoot every part but will rather guide you in narrowing the problem down.

In typical DMVPN scenario you will have following "layers" each dependent on all the ones before.
- Physical and IP
- Crypto
- GRE
- NHRP
- Routing protocol

1. Physical and IP - I'm putting those together since they are not really specific to DMVPN but you need to check if they work.
1.1 Check reachability, from spoke to hub by a simple ping or traceroute.
1.2 Typical problem: IPsec not starting to establish.
Do some basic testing - ping from spoke to hub, make sure not firewall on the way is blocking - UDP/500, UDP/4500 - if NAT-T is needed, ESP/AH.
If everything is configured but tunnel is not initiating... Did you configured NHRP network id?

A typical exercise here and at any level will be to verify CEF switching statistics "show cef drop" in old IOSes or "show cef switching statistic feature" on newer ones.

2. Crypto (IPsec), once you know that nothing is blocked and crypto show start establishing.
- Check that you have phase 1 SAs "show crypto isakmp sa det" the state you're looking for usually is QM_IDLE (or no IKE SA if lifetime is very short)
- Check IPsec SAs show crypto ipsec sa - both inbound and outbound SPI should be there and should be mirrored on other side of the tunnel. (Inbound SPI on spoke will be outbound SPI on hub and vice versa)
debug crypto ipsec and debug cry isakmp are your friends.

Two notable mentions here:
2.1 If in debugs you see tunnels establishing properly but they get torn down in few minutes it most likely means that NHRP relation is not establishing.
2.2 Crypto socket - there is a magic being called a crypto socket that is what is binding crypto and nhrp together - you can debug it - debug crypto socket. Problems with crypto socket can cause 2.1 but can be usually mitigated in short term by removing tunnel interface configuration and adding it back again. Many cases, different IOS versions affected, multiple bugs on Cisco side.

There is also a whole subset of problems with crypto accelerator cards that can show themselves here. Verify "show crypto engine accelerator statistic" and "show crypto engine configurtion" or "show crypto eli" - this will show you statistics and which accelerator is currently being used. You generally check for errors.

3. GRE - here's a fun fact, I've never seen a problem with GRE encapsulation or processing. But I would start by monitoring show interface tunnel X for input or output drops.
One problem you may encounter is .... NAT.
3.1 I've seen a scenario on a fairly recent 12.4T software where NAT was done for GRE traffic (no tunnel protection scenario). Check "sh ip nat trans".
3.2 If by any chance you're using "ip nat outside/inside" on tunnel interface, please check if you're not NATing too much.

4. NHRP - remember that even though the spoke has static NHRP mapping and "show ip nhrp brief" will always show you a mapping present (as opposed to the hub) it is the spoke that is initiating NHRP registration by sending registration request.
Useful debugs:
debug nhrp pack
debug nhrp ext 
debug nhrp err
debug nhrp rate
For each NHRP registration request you should see a packet encapsulated into IPsec (show crypto ipsec sa), if it's not the case enable debug from 2.2 and get in touch with Cisco TAC.

A hood value for NHRP holdtime would be around 300 seconds (as opposed to 7200 default).

5. Routing protocol - once you know all the "layers" below there is the RP level that makes it all tick. I've seen a range of problems here, some bugs, some platform specific (ASR hub taking longer to converge comparing to 7200 with same config). They will range from RP flapping (can be driven by NHRP or load) to downright instability of RP once spokes start connecting to a hub. It can be bug or platform limitation, one can write a book about this :-)

This post was meant to show you what are some common problems and how to track down the failing component. Hope it helps. If you're interested to learn more let me know.

No comments: