Archive for the ‘networking’ Category

Webtrace.info - Traceroute on steroids

Wednesday, June 4th, 2008

There are a lot of traceroute programs out there. This one called WinMTR was recently recommended by Akamai support during one of the troubleshooting sessions. Its based of another Linux tool called mtr (Matt’s traceroute) which is another one I had never heard off.

I liked it so much, that ended up making an enhanced web interface to it. Check it out here at Webtrace.info.

image

Webtrace provides extra networking information which is really helpful for folks who are trying to investigate networking issues. There are a lot of hyper links which allows them to quickly drill down into issues (like who is loosing packets).

Technorati Tags: ,

Internet Health monitoring Reports

Sunday, July 9th, 2006

I was looking for worldwide internet health statistics and found some interesting links.

General Connectivity Reports

BGP and DNS Reports

Where is my root dns server ?

Sunday, July 9th, 2006

I’m sure you have heard that there are 13 root servers in the world. This cache file (root hint) provided by internic/IANA http://www.internic.net/zones/named.root should confirm that. So how does these 13 servers brave a DDOS attack.

Aparently 6 of the 13 root servers are mirrored using Anycast routing to loadbalance between multiple servers. The F Root server itself has about 37 mirrors in the world. Anycast routing is implemented using BGP by simultaneously announcing the same destination IP range from many different places on the internet. So even though an IP might be registered for a location here in US, if someone announces that a route to the same IP block in Tokyo, hosts in or around that country will try to pick the cheapest route to get to a DNS server. DDOS attacks against root dns servers have happened in the past, and will continue to happen in future. Anycast routing is probably why these “13″ DNS servers are still alive today.

The next question some might ask is why we can’t have more than 13 IP addresses for root servers… or why can’t we just have a large root hint (cache). The answer is simple. For DNS to work using UDP protocol (which is stateless) there is a recommended upper limit on the size of a DNS packet (512 bytes). TCP/IP, which is much more expensive because of its overhead, is the recommended protocol for queries/replies beyond that packet size. The root server administrators understand this very well (who else will know better) and decided to restrict the total number of servers to 13 which can easily be embedded as a list of IPs inside a 512 byte UDP packet if required.

Here is a map of the 13 registered root servers on the global map. A complete list of root servers are listed at http://www.root-servers.org/.

Dont mess with my packets

Monday, March 6th, 2006

We had some emergency network maintenance done over the weekend which went well except that I started noticing that I couldn’t “cat” a file on a server for some reason. Every time I login to the box everything would go fine, until I tried to cat a large log file which would freeze my terminal. I tried fsck (like chkdsk), reboots and everything else I could think off without any success. Regardless of what I did, my console would freeze as soon as I tried to cat this log file.

My first impression was that the network died, then when I was able to get back in I thought may be the file was corrupted, or even worse, that we got hacked and “cat” itself was corrupted. To make sure I was not hacked, I tried to “cat /etc/paswd”. And that worked fine. Then I tried to cat a different file in the logs directory and found that to freeze too. I figured that something is wrong with the box and gave up on it for the night, and decided to worry about it on Monday morning. Which was today.I go in to work this morning, and find a whole bunch of users complaining that they can’t go to any webserver on a particular loadbalancer in a this part of the DMZ. So, now I have a network modification, a bad unix file system and a loadbalancer (with few webservers behind it) all malfunctioning at the same time. With adrenalin kicked in, blood pressure rising, and 2 cups of coffee, I figured that there had to be something common between all of these.
After a little bit of investigation I found out that none of the users in my network are able to get to any of the servers in the target network using web. And though ssh is working fine, we couldn’t “cat” any large file on any of the servers in that network. Weird.
I tried to recollect a previous incident where some packets were not getting through a firewall which made the ssh session freeze. If every server on the same network has the same problem, it had to be a problem with one of the routers or firewall in between. So I did the next logical thing, which was to setup tcpdump on both sides of communication. This would allow me to sniff traffic at the moment the “freeze” happens.
Sure enough I see a whole bunch of packets going by, until I do a “cat logfile”. Thats when hell freezes over.

11:07:10.955656 server1.634 > server2.22: . ack 4046 win 24840  (DF) [tos 0x10]
11:07:10.958896 server1.634 > server2.22: . ack 4046 win 24840  (DF) [tos 0x10]
11:07:10.959221 server1.634 > server2.22: . ack 4046 win 24840  (DF) [tos 0x10]
11:07:10.959252 server2.22 > server1.634: . 4046:5426(1380) ack 1607 win 24840 (DF) [tos 0x10]
11:07:10.959538 server1.634 > server2.22: . ack 4046 win 24840  (DF) [tos 0x10]
11:07:10.959573 server2.22 > server1.634: . 6498:7878(1380) ack 1607 win 24840 (DF) [tos 0x10]
11:07:10.962011 server1.634 > server2.22: . ack 4046 win 24840  (DF) [tos 0x10]
11:07:10.962040 server2.22 > server1.634: . 7878:9258(1380) ack 1607 win 24840 (DF) [tos 0x10]
11:07:11.443579 server2.22 > server1.634: . 4046:5426(1380) ack 1607 win 24840 (DF) [tos 0x10]
11:07:12.433550 server2.22 > server1.634: . 4046:5426(1380) ack 1607 win 24840 (DF) [tos 0x10]
11:07:14.413493 server2.22 > server1.634: . 4046:5426(1380) ack 1607 win 24840 (DF) [tos 0x10]
11:07:18.373444 server2.22 > server1.634: . 4046:5426(1380) ack 1607 win 24840 (DF) [tos 0x10]
11:07:26.303489 server2.22 > server1.634: . 4046:5426(1380) ack 1607 win 24840 (DF) [tos 0x10]
11:07:42.172971 server2.22 > server1.634: . 4046:5426(1380) ack 1607 win 24840 (DF) [tos 0x10]

In the sniff above “server2″ is the server which is freezing and server1 was my desktop from where I was logging into. The interesting thing about the capture I did on my desktop was, that it accounted for all the packets which you see here except the last few packets which have the “ack 1607″ string in them. For those who don’t understand tcpdump, this is a capture of repeating packets which are not getting acknowledged by the other end.
So now we knew for sure that it has to be a routing or firewalling glitch of some kind. But it still didn’t explain why it was repeating. On a hunch I looked at the firewall logs to see if there is anything there about why its dropping my packets. May be it thinks that all of these servers are attacking it or something. It didn’t revile anything.

Mar 6 11:09:56 [10.1.10.5.2.2] Mar 06 2006 13:05:24: %PIX-4-106023: Deny icmp src inside:router1 dst vlan server2.22 (type 3, code 4) by access-group “inside”

But what I did see, is that once in a while, there is a weird log entry from the PIX (cisco firewall) complaining about an ICMP packet being dropped due to an ACL restriction. ICMP is a great protocol and almost every kid in the world knows how to use ping to find if a remote host is alive. What its also used for is error reporting and tracerouting. In our network we had ICMP enabled in such a way that errors being reported to the admin network are allowed to go through. And since there are too many reasons why errors should be going into a DMZ, they are generally blocked by edge-routers or firewalls. So the ACL which dropped the packet wasn’t that surprising. But what the heck is “type 3, code 4″ ?

Type 3, Code 4 according to RFCs is “The datagram is too big. Packet fragmentation is required but the DF bit in the IP header is set.” Fragmentation is the process of breaking down of large packets into smaller packets so that it can travel through network media which have different packet size limits. Finally, we know the reason why the packets were getting dropped. Apparently for some reason “DF” flag was getting set on the packets. DF (Dont Fragment) flag is a bit inside IP header which tells all intermediary devices not to ever “fragment” that particular IP packet.

Based on the PIX logs, it seems router1 dropped the packet and generated a “type 3, code 4″ error indicating the reasons why it dropped. Under normal scenarios any sniffer would have noticed an ICMP error packet coming back to server2. But since this was in a DMZ, and since inbound ICMP errors are getting dropped there was no way to know the reasons why some of these packets were going through.
The solution to this problem, apparently was to force the DF flag to be removed which then resolved all the connectivity problems. We also found out that all of our problems started sometime after the maintenance window during which some key networking devices were reconfigured.