Posted on Monday, 28th January 2013 by Michael

Troubleshooting network issues by graphing network statistics over time.

Recently one of my clients have been having a lot of random disconnects and internet drops since implementing BGP. At random times throughout the day they would notice that a large number of remote users would be disconnected from their remote solution offering and at the same time some internal users would lose internet connectivity. The keyword is some in both scenarios but never all. This led me to believe that at some point we may be having the outside routes change and traffic get dropped as we also started seeing out of state packets on the firewalls.

I came up with the idea if we could run an continues trace route from a source to see if during these issues if we have a fail over of carriers. Basically get a baseline of what a good day looks like then run the trace route to collect information if paths changes and to see what carrier is operating / handling our requests at the time of the failure.

At first I started coding a solution then I stumbled upon a Linux utility called MTR. MTR reads as follows:

“mtr, better than traceroute and ping combined

mtr combines the functionality of the traceroute and ping programs in a single network diagnostic tool.

As mtr starts, it investigates the network connection between the host mtr runs on and HOSTNAME. by sending packets with purposely low TTLs. It continues to send packets with low TTL, noting the response time of the intervening routers. This allows mtr to print the response percentage and response times of the internet route to HOSTNAME. A sudden increase in packet loss or response time is often an indication of a bad (or simply over‐loaded) link.”

The only issue with it is the reporting is lacking. You can run a single report of as many attempts you want but you can’t run it in the background and have it continuously report. To overcome this issue I wrote a simple shell script that will launch the program and have it run for x amount of times then log the results. Then I placed it in the cron to run every minute. The program will check to see if mtr is running and if it is it will not run again. Once it sees it has stopped it will then re-launch it and run for x amount of times again and log. Once it has run a few times we can view the output for issues and try to coordinate it with the problem we are experiencing.

#!/bin/bash

# check mtr

ps -ef | grep -v grep | grep mtr

if [ $? -eq 1 ]

then

echo -e "############`date`############" >> mtr-SERVER.log

             /usr/sbin/mtr --report --report-cycle=1000 xxx.xxx.xxx.xxx >> mtr-SERVER.log &

else

echo "eq 0 - mtr  found - do nothingi >> mtr-running.log"

fi

You want to chance the report-cycle=1000 to the value you want. This is how many traces it will run and log. Next you want to change the xxx.xxx.xxx.xxx to the IP you want to run the traces to or the hostname. Then finally you can change the log name if you want. Make sure you do not use the name mtr in the naming of the script. I called mine check.sh.

Once you edited the script and saved it next you need to put it into the cron. To do this issue the command crontab -e and enter the following

*/10 * * * * /path_to_your_script/check.sh 2>&1 /dev/null

I chose 10 minutes as I noticed running it for a 1000 times takes close to that if not longer, it helps cut down on resource usage.

After a few runs your log will have entries that look like this:

########## Mon Jan 28 09:18:07 EST 2013 ###########

HOST: scanner01                   Loss%   Snt   Last   Avg  Best  Wrst StDev
1.|-- xxx.xxx.xxx.xxx              0.0%  1000    0.2   0.3   0.2  63.8   3.1
2.|-- aaa.aaa.aaa.aaa                 0.0%  1000    0.2   0.2   0.2  11.6   0.5
3.|-- bbb.bbb.bbb.bbb              0.0%  1000    0.3   0.3   0.2  17.8   0

########## Mon Jan 28 09:50:07 EST 2013 ###########

HOST: scanner01                   Loss%   Snt   Last   Avg  Best  Wrst StDev
1.|-- xxx.xxx.xxx.xxx              0.0%  1000    0.2   0.3   0.2  63.8   3.1
2.|-- aaa.aaa.aaa.aaa                 0.0%  1000    0.2   0.2   0.2  11.6   0.5
3.|-- bbb.bbb.bbb.bbb              0.0%  1000    0.3   0.3   0.2  17.8   0

This will allow you to search the logs for the time frame of a incident and to see what the tool has caught if anything.

Hopefully this helps some one else. If you have any questions please feel free to leave a comment.

Posted in Code | Comments (0)

Leave a Reply

*