How to troubleshoot packet loss and latency for Internet VPN

Jephe Wu - http://linuxtechres.blogspot.com

Objective:  use all kinds of open source free softwares to troubleshoot the Internet vpn slowness issue and pinpoint where is the packet loss router.
Environment: OpenBSD, Freebsd as vpn firewall. When accessing servers such as web servers, Oracle database servers through Internet vpn, we experiencd very slow connection.


Steps:

The following is the vpn network diagram:
10.0.5.x__10.0.5.1||1.2.3.4++++++++++++++++5.6.7.8||10.0.6.1__10.0.6.X

1. Check latency and packet loss from host 1.2.3.4 to 5.6.7.8

a. ping check if you are able to ping from 1.2.3.4 to 5.6.7.8, if yes, check the latency and packet loss rate.
The latency is not accurate because it might be due to icmp rate limiting confiugred by router, only tcp traffic is in priority.
b. if the ping is blocked, use tcptraceroute or traceroute -T (on CentOS 5) and tracetcp on Windows (http://tracetcp.sourceforge.net/), the latency here is more accurate as it's real tcp traffic from sender to receiver, although the return traffic is icmp TTL exceeded message.

How does tcptraceroute and tracetcp work?

When you issue command like 'tracetcp www.redhat.com:443', Wireshark captures the traffic below, for each hop, it will send 3 tcp packets and set TTL starting from 1. When the final destination reached and it gets ACK reply from destination host, it will immediately tear down the connection.

13    3.732592000    192.168.100.20    184.85.48.112    TCP    20527 > https [SYN] Seq=0 Win=16383 Len=0  (time to live is 1 in ip header)   
14    3.734182000    192.168.100.1    192.168.100.20    ICMP    Time-to-live exceeded (Time to live exceeded in transit)
15    4.232399000    192.168.100.20    184.85.48.112    TCP    24043 > https [SYN] Seq=0 Win=16383 Len=0  (time to live is 1 in ip header)   
16    4.241227000    192.168.100.1    192.168.100.20    ICMP    Time-to-live exceeded (Time to live exceeded in transit) 
17    4.732323000    192.168.100.20    184.85.48.112    TCP    13233 > https [SYN] Seq=0 Win=16383 Len=0  (time to live is 1 in ip header)   
18    4.735323000    192.168.100.1    192.168.100.20    ICMP    Time-to-live exceeded (Time to live exceeded in transit)     

2. use mtr or pathping to check packet loss rate

You can install mtr (http://en.wikipedia.org/wiki/MTR_%28software%29) on Linux/FreeBSD/OpenBSD to check the packet loss rate for the trace path. There are also winmtr and pathping on Windows for similiar functionality.

MTR relies on ICMP Time Exceeded (type 11) packets coming back from routers, or ICMP Echo Reply packets when the packets have hit their destination host.

2082    191.558234    192.168.100.20    184.85.48.112    ICMP    Echo (ping) request
2083    191.590458    203.117.34.14    192.168.100.20    ICMP    Time-to-live exceeded (Time to live exceeded in transit)
2084    191.673071    192.168.100.20    184.85.48.112    ICMP    Echo (ping) request
2085    191.804462    192.168.100.20    184.85.48.112    ICMP    Echo (ping) request
2086    191.874170    198.32.176.127    192.168.100.20    ICMP    Time-to-live exceeded (Time to live exceeded in transit)
........
2090    191.804462    192.168.100.20    184.85.48.112    ICMP    Echo (ping) request
2091    174.139518    184.85.48.112    192.168.100.20    ICMP    Echo (ping) reply

Note: mtr will send icmp ping request with incremental TTL value starting from 1 to the destination host, by getting reply from each hop to get round trip time and packet loss rate.

3. the importance of having no/low packet loss and how to read the mtr report for packet loss

a. Packet loss kills throughput.
b. a slower connection with zero packet loss can easily outperform a faster connection with some packet loss
c. packet loss on the last hop, the desination, is what is most important; packet loss will happen on the return path which is totally different with the outgoing path.
d. sometimes routers in-between will not send ICMP "TTL expired in transit" messages, it will see 3 asterisk which is normal.
e. some routers may specifically block (or down-prioritize) ICMP echo requests, or might do the same where TTL=0. These routers (or the final destination) might show 100% packet loss
f. the router may also be programmed to limit the number of responses it sends to ICMP packets in an effort to mitigate DoS attacks
g. just because you see a hop with high loss doesn't mean it's slowing down "real" traffic; it may only be throwing away ICMP.

References:
a. http://help.rr.com/hmsfaqs/e_packetloss.aspx 
2. http://library.linode.com/linux-tools/mtr/   --for how to read mtr report


4. how to configure vpn firewall for icmp traffic in OpenBSD or FreeBSD packet filter firewall


pass out log quick
pass in log quick on $ext inet proto icmp all icmp-type { echorep, timex, unreach }
pass in log quick on $ext inet proto udp from 1.2.3.4 to $ext keep state
pass in log quick on $ext inet proto icmp all icmp-type { echo } from 1.2.3.4 to $ext keep state


References:
ICMP filtering on the firewall -
http://www.richweb.com/icmp_filter