Netflix: Dev and Ops internals

I’ve seen a number of posts from Netflix folks talking about their architecture in the recent weeks. And part of that is due to an ongoing effort to expand their business for which they seem to be hiring like crazy. Here is the yet another interesting deck of slides which mentions stuff across both Dev and Ops.

One of the most interesting deck of slides I’ve seen in recent past.

James Hamilton: Data center infrastructure innovation

Summary from James’ keynote talk at Velocity 2010 James Hamilton

  • Pace of Innovation – Datacenter pace of innovation is increasing.  The high focus on infrastructure innovation is driving down the cost, increasing reliability and reducing resource consumption which ultimate drives down cost.
  • Where does the money go ?
    • 54% on servers, 8% on networking, 21% on power distribution, 13% on power, 5% on other infrastructure requirements
    • 34% costs related to power
    • Cost of power is trending up
  • Clouds efficiency – server utilization in our industry is around 10 to 15% range
    • Avoid holes in the infrastructure use
    • Break jobs into smaller chunks, queue them where ever possible
  • Power distribution – 11 to 12% lost in distribution
    • Rules to minimize power distribution losses
      • Oversell power – setup more servers than power available. 100% of servers never required in a regular datacenter.
      • Avoid voltage conversions
      • Increase efficiency of conversions
      • High voltage as close to load as possible
      • Size voltage regulators to load and use efficient parts
      • High voltage direct current a small potential gain
  • Mechanical Systems – One of the biggest saving is in cooling
    • What parts are involved ? – Cooling tower, heat exchanges, pumps, evaporators, compressors, condensers, pumps… and so on.
    • Efficiency of these systems and power required to get this done depends on the difference in the desired temperature and the current room temperature
    • Separate hot and cold isles… insulate them (don’t break the fire codes)
    • Increase the operating temperature of servers
      • Most are between 61 and 84
      • Telco standard is 104F (Game consoles are even higher)
  • Temperature
    • Limiting factors to high temp operation
      • Higher fan power trade-off
      • More semiconductor leakage current
      • Possible negative failure rate impact
    • Avoid direct expansion cooling entirely
      • Air side economization 
      • Higher data center temperature
      • Evaporative cooling
    • Requires filtration
      • Particulate and chemical pollution
  • Networking gear
    • Current networks are over-subscribed
      • Forces workload placement restrictions
      • Goal: all points in datacenter equidistant.
    • Mainframe model goes commodity
      • Competition at each layer rather than vertical integration
    • Openflow: open S/W platform
      • Distributed control plane to central control