Sun AMD V20z hardware problems

Sun Microsystems was one of the first big companies to come up with 64Bit AMD V20Z servers which quickly replaced our ancient Sparc servers. Compared to the old E220s and E420s, AMD servers were about 3 to 5 times faster depending on what we wanted it to do.

The first round of V20z's we deployed saved us a lot of rack space, but the heating and power requirements were little higher than expected. Though the v20z's did reduce the footprint on the racks, the heat generated forced us to leave room on the top of the servers where the ventilation holes were placed. For all practical reasons, we couldn't use it as one U system.

We ordered a second round of V20Z's a few months back and though we were prepared for the extra rack space, we stumbled upon a whole new problem this time. We noticed that some of these servers were randomly rebooting, especially at times of high activity. We were using a mirror image of the Suse distribution which we installed on the first set of servers which rules out any change in the software/os side. Whats funny is that some of these servers were so predictable faulty that a simple "tar -xvzf filename.tgz" would kill it. Putting the boot drive from the faulty server in a perfectly working server confirmed that it wasn't the OS or Harddisk which was faulty, but the server hardware itself.

These problems have been going on for over atleast a couple of months and we have opened up a case with sun for few weeks now. Among the things we have done to fix this includes updating different firmwares in various V20z components, play around with the memory modules add more space for ventilation and we even checked the voltage regulator to see if its defective. These servers are brand new and of the 30 or so which we bought we can consistently reproduce this problem on 6 of them. Infact we had the sun engineer (2 of them) come on site and see it for themselves and yet its hard for them to agree that they need to replace the server.

So the question is, how long does it take for someone to admit a mistake and give us a replacement ? Does Sun realize that while they request us to upgrade firmwares on our servers and do other time delaying steps, 20% of these servers can't be used at all ? Do they understand that if we just wanted to keep them unused, we would probably not have bought it in the first place ?

Our company has tried to escalate this problem with Sun so many times, and the guy on the other end just refuses to sign off on the replacements.

Which leads me to the next question, how many other servers are there which have this problem ? If you have this problem, could you please reply to this blog, or let me know by email ? If 20% of the servers sold to us were badly defective, there has to be others out there who are having the same problem.

We have spent between 300 to 600 man hours trying to debug this problem and setting up  workarounds instead of resolve this issue. Posting of this blog online is not just an act of desperation on my part, but is also a message for Sun Microsystems to let them know that they are not the only server vendor out there.


tom said…
Hello, I have similar problems with my 2 v20z :(
Maybe any solution?
royans said…
Its been months and we still have no confirmed solution. The best current guess is that the V20Z's are very very very sensitive to the type of memory in it. Seems like one o fthe problem is that all the memory banks require to have the same exact model and manufacturer. Sun has been replacing memory modules for us and it seemed to have fixed a few of the boxes. However we still have a few which are unpredictable.

And thats just hardware... I still have a story about java support call which I made the other week. Its frustrating as hell.
john said…
we bought 20 v20zs, and so far 3 have rebooted themselves. Sun is playing dumb with us about replacements (reminds me of 3500 issues in the past). we're setting up syslog services to hopefully record the root cause(s). Anyone have any system messages prior to the reboots?
tom said…
Hello again,
It seems that I'm lucky, because it was a problem with SecurePlatform NGX HFA_03, not with servers. After HFA_04 installation it works ok.

Popular posts from this blog

Chrome Frame - How to add command line parameters

Brewers CAP Theorem on distributed systems

Creating your first chrome app on a Chromebook