June 17, 2006

Sun AMD V20z hardware problems

Sun Microsystems was one of the first big companies to come up with 64Bit AMD V20Z servers which quickly replaced our ancient Sparc servers. Compared to the old E220s and E420s, AMD servers were about 3 to 5 times faster depending on what we wanted it to do.

The first round of V20z's we deployed saved us a lot of rack space, but the heating and power requirements were little higher than expected. Though the v20z's did reduce the footprint on the racks, the heat generated forced us to leave room on the top of the servers where the ventilation holes were placed. For all practical reasons, we couldn't use it as one U system.

We ordered a second round of V20Z's a few months back and though we were prepared for the extra rack space, we stumbled upon a whole new problem this time. We noticed that some of these servers were randomly rebooting, especially at times of high activity. We were using a mirror image of the Suse distribution which we installed on the first set of servers which rules out any change in the software/os side. Whats funny is that some of these servers were so predictable faulty that a simple "tar -xvzf filename.tgz" would kill it. Putting the boot drive from the faulty server in a perfectly working server confirmed that it wasn't the OS or Harddisk which was faulty, but the server hardware itself.

These problems have been going on for over atleast a couple of months and we have opened up a case with sun for few weeks now. Among the things we have done to fix this includes updating different firmwares in various V20z components, play around with the memory modules add more space for ventilation and we even checked the voltage regulator to see if its defective. These servers are brand new and of the 30 or so which we bought we can consistently reproduce this problem on 6 of them. Infact we had the sun engineer (2 of them) come on site and see it for themselves and yet its hard for them to agree that they need to replace the server.

So the question is, how long does it take for someone to admit a mistake and give us a replacement ? Does Sun realize that while they request us to upgrade firmwares on our servers and do other time delaying steps, 20% of these servers can't be used at all ? Do they understand that if we just wanted to keep them unused, we would probably not have bought it in the first place ?

Our company has tried to escalate this problem with Sun so many times, and the guy on the other end just refuses to sign off on the replacements.

Which leads me to the next question, how many other servers are there which have this problem ? If you have this problem, could you please reply to this blog, or let me know by email ? If 20% of the servers sold to us were badly defective, there has to be others out there who are having the same problem.

We have spent between 300 to 600 man hours trying to debug this problem and setting up  workarounds instead of resolve this issue. Posting of this blog online is not just an act of desperation on my part, but is also a message for Sun Microsystems to let them know that they are not the only server vendor out there.