Thoughts on scalable web operations

Interesting observations/thoughts on  web operations collected from a few sessions at Velocity conference 2010 [ most are from a talk by Theo Schlossnagle, author of “Scalable internet architectures” ]

  • Optimization O'Reilly Radar Logo
    • Don’t over optimize. Could take away precious resources away from critical functions. 
    • Don’t scale early. Planning for more than 10 times the load you currently have or are planning to support might be counter-productive in most cases. RDBMS is fine until you really need something which can’t fit on 2 or 3 servers.
    • Optimize performance on single node before you optimize and re-architect a solution for horizontal scalability.
  • Tools
    • Tools are what a master craftsman makes… tools don’t make a craftsman a master.
    • Tools can never solve a problem, its correct use does.
    • Master the tools which need to be (could be ) used in production at short notice. Looking for man page for these tools during an outage isn’t ideal.
  • Cookies
    • Use cookies to store data wherever possible.
    • Sign them if you are concerned about tampering
    • Encrypt them if you are concerned about users having visibility into it
    • Its cheaper to use user’s browser as a datastore replication node, than build redundant servers
  • Datastores
    • NoSQL is not the solution for everything [ example: so long MongoDB ]
    • Ditto RDBMS
    • Ditto everything else
    • Get the requirements, understand the problem and then pick the solution. Instead of the other way around.
  • Automation
    • When you find yourself doing something more than 2 times, write scripts to automate it
    • When users report failures before monitoring systems do, write better monitoring tools.
  • Revision control
    • Revision control as much as possible.
    • Provides audit trail to help understand what happened before. One can’t remember everything. Excellent place to search during hard to solve production problems.
  • Networking
    • Think in packets and not bytes to save load time.
    • There is no point in compressing a CSS file which is 400 bytes since the smallest data IP packet will store is about 1300 bytes (rest of the packet is padded with empty bytes if the data being sent is smaller).
    • In fact compression and decompression will take away precious CPU resources on server and the client.
    • Instead think of embedding short CSS files in HTML to save a few extra packets.
  • Caching
    • Static objects
      • Cache all static objects for ever
      • Add random numbers/strings to objects to force a reload of the object.
        • For example instead of requesting “/images/myphoto.jpg” request “/images/myphoto.123245.jpg”
        • Remove the random ID using something like an htaccess rewrite rule
      • Use CDNs where ever possible, but make sure you understand all the objects part of your page before you shove the problem to a CDN. pointless redirects can steal away previous loading time.
  • People
    • When you hire someone for operations team, never hire someone who can’t remember a single production issue he/she was caused. People learn the most from mistakes, so recognizing people who have been on the hot seat and have fixed their mistakes.
    • Allow people to take risks in production and watch them how they recover from it. Taking risk is part of adapting to new ideas, and letting them fail helps them understand how to improve.
  • Systems
      • Know your systems baseline. An instant/snapshot view of a system’s current statistics is never sufficient to fully classify a systems current state. ( for example is 10 load average abnormal on server XYZ ?)
      • Use tools which periodically poll and archive data to help you give this information
    • Moderation
      • Moderate the tools and process you use
      • Moderate the moderation

    What did I miss ? 🙂 Let me know and I’ll add it here…