Latest Publications

Scalability updates for Aug 27th 2010

My updates have been slow recently due to other things I’m involved in. If you need more updates around what I’m reading, please feel free to follow me on twitter or buzz.

Here are some of the big ones I have mentioned on my twitter/buzz feeds.

Share:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • DZone
  • HackerNews
  • Reddit
  • RSS
  • StumbleUpon
  • Suggest to Techmeme via Twitter
  • Twitter
  • FriendFeed
  • Slashdot
  • email

Continuous deployments may not be for everyone: Culture

If you have read this blog before, you know how much I admire those who use continuous deployments in production. Doing that at scale is even more impressive. But the message which gets lost sometimes is that Continuous deployments may not be for everyone.

Most continuous integration environments usually do all of their deployments from trunk. Which means every check-in has to be production quality. Digg’s Andrew Bayer gives a good explanation of how they do code reviews and pre-code check-ins before code is merged into trunk.

Site uptime and reliability depends on a comprehensive QA process to protect against unintentional mistakes. And for rapid deployments one has to abandon manual QA processes in favor of 100% automated testing with the goal of getting close to 100% code coverage. Thats hard if the code is not written in a way which can be tested easily.

image

But, unit and integration tests alone cannot guarantee quality. In addition to testing code which has been implemented in the application, there needs to be tests to look for things which shouldn’t be implemented. For example, it would be nice to have tests to look for non-parameterized SQL calls in parts of code where it shouldn’t exist. If you know there is a wrong way to do something, write a test case for it so that its caught as soon as someone does it.

Some of this would be easy to do if you already follow a test driven development process where you have to write tests before you write code.

The biggest difference between an organization which follows Continuous deployment and one which doesn’t is in how QA is done. QA becomes a shared responsibility where everyone has to contribute. No matter how many tools or guidelines one publishes, if teams using this process don’t believe in it, the quality and availability of website will suffer. Pascal-Louis Perez (from KaChing) used a diagram like the one here to explain how this “culture” is at the heart of continuous deployment.

“Culture” also explains why most of the older organizations who follow a more traditional form of deployment are having a hard time understanding and adapting to this process.

Are you using Continuous deployments in your environment ? What was your biggest hurdle ?

Share:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • DZone
  • HackerNews
  • Reddit
  • RSS
  • StumbleUpon
  • Suggest to Techmeme via Twitter
  • Twitter
  • FriendFeed
  • Slashdot
  • email

TCP and the Lower Bound of web performance

One of the less discussed, but highly informative and very thought provoking talk during Velocity 2010 was the one about TCP, latency, window sizes and its relation to web performance. The slides to this talk by “John Rauser” can be found here. And thanks to Mike Bailey, there is a video recording as well.

Follow the slides as you watch the video to understand the talk.

TCP and the Lower Bound of Web Performance – John Rauser from Goodfordogs on Vimeo.

Share:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • DZone
  • HackerNews
  • Reddit
  • RSS
  • StumbleUpon
  • Suggest to Techmeme via Twitter
  • Twitter
  • FriendFeed
  • Slashdot
  • email

All Velocity conference 2010 Slides/Notes

Here are all the slides/PDFs which I’ve come across from the first 2 days at velocity, please let me know if I missed any.

 

    • Slides

    Share:
    • Digg
    • del.icio.us
    • Facebook
    • Google Bookmarks
    • DZone
    • HackerNews
    • Reddit
    • RSS
    • StumbleUpon
    • Suggest to Techmeme via Twitter
    • Twitter
    • FriendFeed
    • Slashdot
    • email

    Speeding up 3rd party widgets using ASWIFT

    This is a summary of the talk by Arvind Jain, Michael Kleber from Google at velocityconf about how to write widgets using same domain iframe using document.write. Speed improvements of over 90% in loading widgets with this change.

    • Web is slow
      • Avg page load time 4.9s
      • 44 resources, 7 dns requests, 320kb
      • Lot of 3rd party widgets
        • digg/facebook/etc
    • Measurements of 3rd party widgets
      • Digg widget
        • 9 HTTP requests, 52 kB
        • scripts block the main page from downloading
        • stylesheets blocks the main page from rendering in IE
      • Adsense takes up  12.8% page load time
      • Analytics takes up < 5%   ( move to async widget )
      • Doubleclick takes up 11%
    • How to make Google AdSense “fast by default”
      • Goals / Challenges
        • Minimize blocking the publishers page
        • Show the ad right where the code is inserted
        • Must run in publishers Domain
      • Solution (ASWIFT) – Asynchronous Script Written into IFrame Tag
        • Make show_ads.js a tiny loader script
        • Loader creates a same-domain iframe (using document.write)
        • Loads the rest of the show_ads into the iframe by document.write() of a <script> tag
        • This loading of iframe is asynchronous.
      • Browser specific surprises
        • Problems with parallel downloads of same script in IE
        • Iframe creation inside <head> in Firefox has a problem
        • Requesting headers in Chrome was buggy
        • Forward-Back-Reload behavior is buggy (refetching instead of using cache)
        • document.domain vs friendly iframes
    Share:
    • Digg
    • del.icio.us
    • Facebook
    • Google Bookmarks
    • DZone
    • HackerNews
    • Reddit
    • RSS
    • StumbleUpon
    • Suggest to Techmeme via Twitter
    • Twitter
    • FriendFeed
    • Slashdot
    • email

    Urs Holzle from google on “Speed Matters”

    From Urs’ talk at the velocity2010 conference [ More info : Google, datacenterknowledge ]

    • Average web page – 320kb, 44 resources, 7 dns lookups, doesn’t compress 3rd of its content
    • Aiming for 100ms page load times for chrome
    • Chrome: HTML5, V8 JS engine, DNS prefetching, VP8 codec, opensource, spurs competition
    • TCP improvements
      • Fast start (higher initial congestion window)
      • Quick loss recovery (lower retransmit timeouts)
      • Makes Google products 12% faster
      • No handshake delay (app payload in SYN packets)  [ Didn’t know this was possible !!! ]
    • DNS improvements
      • Propagate client IP in DNS requests (to allow servers to better map users to the closest servers)
    • SSL improvements
      • False start (reduce 1 round trip from handshake)
        • 10% faster (for Android implementation)
      • Snap start (zero round trip handshakes, resumes)
      • OCSP stapling (avoid inline roundtrips)
    • HTTP improvements (SPDY):
      • Header compression
      • Stream multiplexing and prioritization
      • Server push/hints
      • 25% faster
    • Test done
      • Download the same “top 25” pages via HTTP and SPDY, network simulates a 2Mbps DSL link, 0% packet loss – Number of packets dropped by 40%
      • On low bandwidth links, headers are surprisingly costly. Can add 1 second of latency.
    • Public DNS:
      • reduces recursive resolve time by continuously refreshing cache
      • Increases availability through adequate provisioning
    • Broadband pilot testing going on
      • Fix the “last mile” complaint
      • Huge increase of 100x
    • More developer tools by Google
      • Page speed, speed tracer, closure compiler, Auto spriter
    • More awareness about performance
    Share:
    • Digg
    • del.icio.us
    • Facebook
    • Google Bookmarks
    • DZone
    • HackerNews
    • Reddit
    • RSS
    • StumbleUpon
    • Suggest to Techmeme via Twitter
    • Twitter
    • FriendFeed
    • Slashdot
    • email

    James Hamilton: Data center infrastructure innovation

    Summary from James’ keynote talk at Velocity 2010 James Hamilton

    • Pace of Innovation – Datacenter pace of innovation is increasing.  The high focus on infrastructure innovation is driving down the cost, increasing reliability and reducing resource consumption which ultimate drives down cost.
    • Where does the money go ?
      • 54% on servers, 8% on networking, 21% on power distribution, 13% on power, 5% on other infrastructure requirements
      • 34% costs related to power
      • Cost of power is trending up
    • Clouds efficiency – server utilization in our industry is around 10 to 15% range
      • Avoid holes in the infrastructure use
      • Break jobs into smaller chunks, queue them where ever possible
    • Power distribution – 11 to 12% lost in distribution
      • Rules to minimize power distribution losses
        • Oversell power – setup more servers than power available. 100% of servers never required in a regular datacenter.
        • Avoid voltage conversions
        • Increase efficiency of conversions
        • High voltage as close to load as possible
        • Size voltage regulators to load and use efficient parts
        • High voltage direct current a small potential gain
    • Mechanical Systems – One of the biggest saving is in cooling
      • What parts are involved ? – Cooling tower, heat exchanges, pumps, evaporators, compressors, condensers, pumps… and so on.
      • Efficiency of these systems and power required to get this done depends on the difference in the desired temperature and the current room temperature
      • Separate hot and cold isles… insulate them (don’t break the fire codes)
      • Increase the operating temperature of servers
        • Most are between 61 and 84
        • Telco standard is 104F (Game consoles are even higher)
    • Temperature
      • Limiting factors to high temp operation
        • Higher fan power trade-off
        • More semiconductor leakage current
        • Possible negative failure rate impact
      • Avoid direct expansion cooling entirely
        • Air side economization 
        • Higher data center temperature
        • Evaporative cooling
      • Requires filtration
        • Particulate and chemical pollution
    • Networking gear
      • Current networks are over-subscribed
        • Forces workload placement restrictions
        • Goal: all points in datacenter equidistant.
      • Mainframe model goes commodity
        • Competition at each layer rather than vertical integration
      • Openflow: open S/W platform
        • Distributed control plane to central control
    Share:
    • Digg
    • del.icio.us
    • Facebook
    • Google Bookmarks
    • DZone
    • HackerNews
    • Reddit
    • RSS
    • StumbleUpon
    • Suggest to Techmeme via Twitter
    • Twitter
    • FriendFeed
    • Slashdot
    • email

    Web performance Metrics 101

    This talk by Sean and Alistair is one of the talks I couldn’t attend today due to conflicts, but I’m glad the slides are already up.

    Performance measurement is often the starting point for most web applications and that can’t be done without understanding what goes on between the browser and the server.

    Share:
    • Digg
    • del.icio.us
    • Facebook
    • Google Bookmarks
    • DZone
    • HackerNews
    • Reddit
    • RSS
    • StumbleUpon
    • Suggest to Techmeme via Twitter
    • Twitter
    • FriendFeed
    • Slashdot
    • email

    Thoughts on scalable web operations

    Interesting observations/thoughts on  web operations collected from a few sessions at Velocity conference 2010 [ most are from a talk by Theo Schlossnagle, author of “Scalable internet architectures” ]

    • Optimization O'Reilly Radar Logo
      • Don’t over optimize. Could take away precious resources away from critical functions. 
      • Don’t scale early. Planning for more than 10 times the load you currently have or are planning to support might be counter-productive in most cases. RDBMS is fine until you really need something which can’t fit on 2 or 3 servers.
      • Optimize performance on single node before you optimize and re-architect a solution for horizontal scalability.
    • Tools
      • Tools are what a master craftsman makes… tools don’t make a craftsman a master.
      • Tools can never solve a problem, its correct use does.
      • Master the tools which need to be (could be ) used in production at short notice. Looking for man page for these tools during an outage isn’t ideal.
    • Cookies
      • Use cookies to store data wherever possible.
      • Sign them if you are concerned about tampering
      • Encrypt them if you are concerned about users having visibility into it
      • Its cheaper to use user’s browser as a datastore replication node, than build redundant servers
    • Datastores
      • NoSQL is not the solution for everything [ example: so long MongoDB ]
      • Ditto RDBMS
      • Ditto everything else
      • Get the requirements, understand the problem and then pick the solution. Instead of the other way around.
    • Automation
      • When you find yourself doing something more than 2 times, write scripts to automate it
      • When users report failures before monitoring systems do, write better monitoring tools.
    • Revision control
      • Revision control as much as possible.
      • Provides audit trail to help understand what happened before. One can’t remember everything. Excellent place to search during hard to solve production problems.
    • Networking
      • Think in packets and not bytes to save load time.
      • There is no point in compressing a CSS file which is 400 bytes since the smallest data IP packet will store is about 1300 bytes (rest of the packet is padded with empty bytes if the data being sent is smaller).
      • In fact compression and decompression will take away precious CPU resources on server and the client.
      • Instead think of embedding short CSS files in HTML to save a few extra packets.
    • Caching
      • Static objects
        • Cache all static objects for ever
        • Add random numbers/strings to objects to force a reload of the object.
          • For example instead of requesting “/images/myphoto.jpg” request “/images/myphoto.123245.jpg
          • Remove the random ID using something like an htaccess rewrite rule
        • Use CDNs where ever possible, but make sure you understand all the objects part of your page before you shove the problem to a CDN. pointless redirects can steal away previous loading time.
    • People
      • When you hire someone for operations team, never hire someone who can’t remember a single production issue he/she was caused. People learn the most from mistakes, so recognizing people who have been on the hot seat and have fixed their mistakes.
      • Allow people to take risks in production and watch them how they recover from it. Taking risk is part of adapting to new ideas, and letting them fail helps them understand how to improve.
  • Systems
      • Know your systems baseline. An instant/snapshot view of a system’s current statistics is never sufficient to fully classify a systems current state. ( for example is 10 load average abnormal on server XYZ ?)
      • Use tools which periodically poll and archive data to help you give this information
    • Moderation
      • Moderate the tools and process you use
      • Moderate the moderation

    What did I miss ? :) Let me know and I’ll add it here…

  • Share:
    • Digg
    • del.icio.us
    • Facebook
    • Google Bookmarks
    • DZone
    • HackerNews
    • Reddit
    • RSS
    • StumbleUpon
    • Suggest to Techmeme via Twitter
    • Twitter
    • FriendFeed
    • Slashdot
    • email

    Pingdom: Software behind facebook

    Pingdom has an interesting post which lists the various components which runs facebook. “Exploring the software behind Facebook, the world’s largest siteFacebook

    Few interesting statistics listed

      • Facebook serves 570 billion page views per month (according to Google Ad Planner).
      • There are more photos on Facebook than all other photo sites combined (including sites like Flickr).
      • More than 3 billion photos are uploaded every month.
      • Facebook’s systems serve 1.2 million photos per second. This doesn’t include the images served by Facebook’s CDN.
      • More than 25 billion pieces of content (status updates, comments, etc) are shared every month.
      • Facebook has more than 30,000 servers (and this number is from last year!)

    I’m not sure facebook is really the “largest site” based on servers alone, but its definitely the largest based on unique users in US.

    Share:
    • Digg
    • del.icio.us
    • Facebook
    • Google Bookmarks
    • DZone
    • HackerNews
    • Reddit
    • RSS
    • StumbleUpon
    • Suggest to Techmeme via Twitter
    • Twitter
    • FriendFeed
    • Slashdot
    • email