James Hamilton: Data center infrastructure innovation

Summary from James’ keynote talk at Velocity 2010 James Hamilton

  • Pace of Innovation – Datacenter pace of innovation is increasing.  The high focus on infrastructure innovation is driving down the cost, increasing reliability and reducing resource consumption which ultimate drives down cost.
  • Where does the money go ?
    • 54% on servers, 8% on networking, 21% on power distribution, 13% on power, 5% on other infrastructure requirements
    • 34% costs related to power
    • Cost of power is trending up
  • Clouds efficiency – server utilization in our industry is around 10 to 15% range
    • Avoid holes in the infrastructure use
    • Break jobs into smaller chunks, queue them where ever possible
  • Power distribution – 11 to 12% lost in distribution
    • Rules to minimize power distribution losses
      • Oversell power – setup more servers than power available. 100% of servers never required in a regular datacenter.
      • Avoid voltage conversions
      • Increase efficiency of conversions
      • High voltage as close to load as possible
      • Size voltage regulators to load and use efficient parts
      • High voltage direct current a small potential gain
  • Mechanical Systems – One of the biggest saving is in cooling
    • What parts are involved ? – Cooling tower, heat exchanges, pumps, evaporators, compressors, condensers, pumps… and so on.
    • Efficiency of these systems and power required to get this done depends on the difference in the desired temperature and the current room temperature
    • Separate hot and cold isles… insulate them (don’t break the fire codes)
    • Increase the operating temperature of servers
      • Most are between 61 and 84
      • Telco standard is 104F (Game consoles are even higher)
  • Temperature
    • Limiting factors to high temp operation
      • Higher fan power trade-off
      • More semiconductor leakage current
      • Possible negative failure rate impact
    • Avoid direct expansion cooling entirely
      • Air side economization 
      • Higher data center temperature
      • Evaporative cooling
    • Requires filtration
      • Particulate and chemical pollution
  • Networking gear
    • Current networks are over-subscribed
      • Forces workload placement restrictions
      • Goal: all points in datacenter equidistant.
    • Mainframe model goes commodity
      • Competition at each layer rather than vertical integration
    • Openflow: open S/W platform
      • Distributed control plane to central control

Understanding Cloud computing efficiency

Picking a cloud service at times, unfortunately,  is far more complex  than picking up a brand new car. I remember how torn I was between a honda-hybrid, which came with some tax rebates and a carpool sticker and a non-hybrid one which was significantly cheaper. Understanding the short term and long term benefits is the key.

Today AWS is not the only game in the town. There are lots of other reliable efficiency_light_bulb(or some flavor off) options. GoGrid, JoyentMicrosoft and GoogleAppEngine are some.

Here are the key differences which one should understand before deciding which one to go for.

* IAAS (Infrastructure as a service) providers like AWS (EC2) and Rackspace provide virtual infrastructure which you can manage and control. In most cases you are billed by a time-unit and you would have control to increase or decrease resources available for your application. PAAS (Platform as a service) on the other hand only provides APIs for your application. PAAS based infrastructure is usually billed by number of requests or by the CPU cycles spent on supporting the requests. Microsoft’s Azure places itself somewhere in between these two paradigms which makes this even more interesting.

* If your application’s resource requirements fluctuate a lot on a daily basis and you don’t want to invest in building a scalable architecture and the logic to manage/monitor the process of scaling up and down, then PAAS based service might help you. But if you want higher performance, more control of your code and infrastructure (and the way it scales) then IAAS is the way to go.

* If you have consistent load throughout the year, you should think about reserving resources for longer term if possible. It could turn out to be cheaper. But at the same time more servers/resources you reserve, more expensive it gets for you. There is a point at which it might be cheaper to host the infrastructure yourself.

* If your application is has short but high CPU resource peaks, you should look at a vendor which doesn’t put a performance ceiling. “The BitSource” did some performance tests between Rackspace vs Amazon EC2 which explains this problem very well.

* Finally, If you already have a large computing infrastructure within your organization and want more “long term” computing resources, based on the studies I have seen, its cheaper to manage/setup new servers/storage within the organization than outsourcing it to AWS/Rackspace.

At the end of the day remember that vendors are there to make money as well. If you plan to make significant long term investment into cloud services, you should do some research to make sure that is really the cheapest solution.