MTR, also known as Matt’s Trace Route, is an enahanced traceroute utility which after making the initial run continues to rerun the traceroute and calculate hop-specific packet loss and latencies.
Unfortunately, virtually everytime someone calls me and mentions “packet loss” and “MTR” in the same breath, it’s because they do not understand the output.
I’m going to assume you already know what a traceroute is, and what it does. MTR runs a traceroute over and over for infinity in order to identify possible faulty routers or links. For example, this is an mtr from my server to www.linx.net:
My traceroute [v0.72]
mashed (0.0.0.0) Mon Sep 8 10:14:13 2008
Keys: Help Display mode Restart statistics Order of fields quit
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. gw0.potato-people.com 0.0% 11 0.6 1.0 0.5 1.4 0.4
2. gi0-3.br1.heron.bytel.net.uk 0.0% 11 98.8 13.9 0.5 98.8 31.1
3. vlan1.br0.heron.bytel.net.uk 0.0% 11 2.3 1.9 1.0 2.6 0.5
4. collector.linx.net 0.0% 10 19.6 18.4 17.3 20.0 1.0
5. pink.linx.net 0.0% 10 18.5 18.1 17.2 19.3 0.7
Pretty simple – each hop is identified, and then MTR repeats this (note the “Snt”, or sent packets column) and records the loss and latencies.
Packet Loss
If we saw a sudden jump to 50% loss at hop 3 and beyond, then we know there is a problem between hops 2 and 3, or at 3 itself. Eg:
My traceroute [v0.72]
mashed (0.0.0.0) Mon Sep 8 10:14:13 2008
Keys: Help Display mode Restart statistics Order of fields quit
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. gw0.potato-people.com 0.0% 11 0.6 1.0 0.5 1.4 0.4
2. gi0-3.br1.heron.bytel.net.uk 0.0% 11 98.8 13.9 0.5 98.8 31.1
3. vlan1.br0.heron.bytel.net.uk 50.0% 11 2.3 1.9 1.0 2.6 0.5
4. collector.linx.net 50.0% 10 19.6 18.4 17.3 20.0 1.0
5. pink.linx.net 50.0% 10 18.5 18.1 17.2 19.3 0.7
Measuring Routers
Unfortuantely what I more often than not see, is something like this:
My traceroute [v0.72]
mashed (0.0.0.0) Mon Sep 8 10:14:13 2008
Keys: Help Display mode Restart statistics Order of fields quit
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. gw0.potato-people.com 0.0% 11 0.6 1.0 0.5 1.4 0.4
2. gi0-3.br1.heron.bytel.net.uk 9.0% 11 98.8 13.9 0.5 98.8 31.1
3. vlan1.br0.heron.bytel.net.uk 3.0% 11 2.3 1.9 1.0 2.6 0.5
4. collector.linx.net 0.0% 10 19.6 18.4 17.3 20.0 1.0
5. pink.linx.net 0.0% 10 18.5 18.1 17.2 19.3 0.7
This example shows lost packets at hops 2 and 3 but – and here’s the important part – not beyond hops 2 or 3. In this case, the MTR is measuring the CPU load of the router at those hops, not the packet loss on the connection. Check hop 5 – no packets have been dropped at the actual destination.
You see, nearly all routers, much like computers, have a list of priorities of things they have to deal with. Forwarding packets between ports is the highest priority. Things such as routing protocols come second, the management interface (whether it be by web, telnet or serial console) come second. Responding to packets sent directly to the router comes long after everything else.
So, if a router is paticularly busy and has a lot of packets to forward, it’ll drop the lowest priority things to get a few more CPU cycles. This means the first thing to get dropped from it’s list of things to do, when under stress, is responding to packets sent directly to the router.
ICMP is lossy
Ping, traceroute and MTR all use the ICMP protocol, and ICMP is very, very lossy. That means that packets will and should be expected to drop. In the example below, we can see a level of packet loss across all hops. However, check the “Snt” column – this MTR has been running for some time, and sent over 1400 packets to each hop. This MTR measures nothing more than the lossy nature of ICMP over a long time period. Pure background noise.
My traceroute [v0.72]
mashed (0.0.0.0) Mon Sep 8 10:38:03 2008
Keys: Help Display mode Restart statistics Order of fields quit
Packets Pings
Host Loss% Snt Last Avg Best Wrst StDev
1. gw0.potato-people.com 0.8% 1439 2.6 1.0 0.5 12.0 1.1
2. gi0-3.br1.heron.bytel.net.uk 1.4% 1439 1.1 4.3 0.3 217.2 21.5
3. vlan1.br0.heron.bytel.net.uk 1.3% 1438 1.6 6.0 0.6 208.5 24.2
4. collector.linx.net 1.3% 1438 18.3 27.5 16.2 395.3 36.0
5. pink.linx.net 1.5% 1438 18.9 18.3 16.2 27.9 1.0
Turning off ICMP
It’s for these very reasons that an increasing number of ISPs are disabling the ability to do traceroutes across their network. It used to be that this was done for security – it’s much harder to hack into someone’s network if you do not know the addresses of any of the routers or switches – but now it’s done for a combination of security and to stop calls from customers who don’t know how to interpret the results of a tool that, for example, some VoIP company said they should run.
0 Comments.