Dont mess with my packets

We had some emergency network maintenance done over the weekend which went well except that I started noticing that I couldn't "cat" a file on a server for some reason. Every time I login to the box everything would go fine, until I tried to cat a large log file which would freeze my terminal. I tried fsck (like chkdsk), reboots and everything else I could think off without any success. Regardless of what I did, my console would freeze as soon as I tried to cat this log file.

My first impression was that the network died, then when I was able to get back in I thought may be the file was corrupted, or even worse, that we got hacked and "cat" itself was corrupted. To make sure I was not hacked, I tried to "cat /etc/paswd". And that worked fine. Then I tried to cat a different file in the logs directory and found that to freeze too. I figured that something is wrong with the box and gave up on it for the night, and decided to worry about it on Monday morning. Which was today.I go in to work this morning, and find a whole bunch of users complaining that they can't go to any webserver on a particular loadbalancer in a this part of the DMZ. So, now I have a network modification, a bad unix file system and a loadbalancer (with few webservers behind it) all malfunctioning at the same time. With adrenalin kicked in, blood pressure rising, and 2 cups of coffee, I figured that there had to be something common between all of these.
After a little bit of investigation I found out that none of the users in my network are able to get to any of the servers in the target network using web. And though ssh is working fine, we couldn't "cat" any large file on any of the servers in that network. Weird.
I tried to recollect a previous incident where some packets were not getting through a firewall which made the ssh session freeze. If every server on the same network has the same problem, it had to be a problem with one of the routers or firewall in between. So I did the next logical thing, which was to setup tcpdump on both sides of communication. This would allow me to sniff traffic at the moment the "freeze" happens.
Sure enough I see a whole bunch of packets going by, until I do a "cat logfile". Thats when hell freezes over.
11:07:10.955656 server1.634 > server2.22: . ack 4046 win 24840  (DF) [tos 0x10]
11:07:10.958896 server1.634 > server2.22: . ack 4046 win 24840 (DF) [tos 0x10]
11:07:10.959221 server1.634 > server2.22: . ack 4046 win 24840 (DF) [tos 0x10]
11:07:10.959252 server2.22 > server1.634: . 4046:5426(1380) ack 1607 win 24840 (DF) [tos 0x10]
11:07:10.959538 server1.634 > server2.22: . ack 4046 win 24840 (DF) [tos 0x10]
11:07:10.959573 server2.22 > server1.634: . 6498:7878(1380) ack 1607 win 24840 (DF) [tos 0x10]
11:07:10.962011 server1.634 > server2.22: . ack 4046 win 24840 (DF) [tos 0x10]
11:07:10.962040 server2.22 > server1.634: . 7878:9258(1380) ack 1607 win 24840 (DF) [tos 0x10]
11:07:11.443579 server2.22 > server1.634: . 4046:5426(1380) ack 1607 win 24840 (DF) [tos 0x10]
11:07:12.433550 server2.22 > server1.634: . 4046:5426(1380) ack 1607 win 24840 (DF) [tos 0x10]
11:07:14.413493 server2.22 > server1.634: . 4046:5426(1380) ack 1607 win 24840 (DF) [tos 0x10]
11:07:18.373444 server2.22 > server1.634: . 4046:5426(1380) ack 1607 win 24840 (DF) [tos 0x10]
11:07:26.303489 server2.22 > server1.634: . 4046:5426(1380) ack 1607 win 24840 (DF) [tos 0x10]
11:07:42.172971 server2.22 > server1.634: . 4046:5426(1380) ack 1607 win 24840 (DF) [tos 0x10]

In the sniff above "server2" is the server which is freezing and server1 was my desktop from where I was logging into. The interesting thing about the capture I did on my desktop was, that it accounted for all the packets which you see here except the last few packets which have the "ack 1607" string in them. For those who don't understand tcpdump, this is a capture of repeating packets which are not getting acknowledged by the other end.
So now we knew for sure that it has to be a routing or firewalling glitch of some kind. But it still didn't explain why it was repeating. On a hunch I looked at the firewall logs to see if there is anything there about why its dropping my packets. May be it thinks that all of these servers are attacking it or something. It didn't revile anything.

Mar 6 11:09:56 [10.1.10.5.2.2] Mar 06 2006 13:05:24: %PIX-4-106023: Deny icmp src inside:router1 dst vlan server2.22 (type 3, code 4) by access-group "inside"

But what I did see, is that once in a while, there is a weird log entry from the PIX (cisco firewall) complaining about an ICMP packet being dropped due to an ACL restriction. ICMP is a great protocol and almost every kid in the world knows how to use ping to find if a remote host is alive. What its also used for is error reporting and tracerouting. In our network we had ICMP enabled in such a way that errors being reported to the admin network are allowed to go through. And since there are too many reasons why errors should be going into a DMZ, they are generally blocked by edge-routers or firewalls. So the ACL which dropped the packet wasn't that surprising. But what the heck is "type 3, code 4" ?

Type 3, Code 4 according to RFCs is "The datagram is too big. Packet fragmentation is required but the DF bit in the IP header is set." Fragmentation is the process of breaking down of large packets into smaller packets so that it can travel through network media which have different packet size limits. Finally, we know the reason why the packets were getting dropped. Apparently for some reason "DF" flag was getting set on the packets. DF (Dont Fragment) flag is a bit inside IP header which tells all intermediary devices not to ever "fragment" that particular IP packet.

Based on the PIX logs, it seems router1 dropped the packet and generated a "type 3, code 4" error indicating the reasons why it dropped. Under normal scenarios any sniffer would have noticed an ICMP error packet coming back to server2. But since this was in a DMZ, and since inbound ICMP errors are getting dropped there was no way to know the reasons why some of these packets were going through.
The solution to this problem, apparently was to force the DF flag to be removed which then resolved all the connectivity problems. We also found out that all of our problems started sometime after the maintenance window during which some key networking devices were reconfigured.

Comments

Popular posts from this blog

Chrome Frame - How to add command line parameters

Creating your first chrome app on a Chromebook

Brewers CAP Theorem on distributed systems