The trouble with ubiquitous computing

The idea of “ubiquitous computing” most people dream about doesn’t usually include the troubles of patching them every week. It doesn’t even mention that there would be new bugs found daily and that most of the fixes would be available weeks if not months after they were discovered.

Windows XP

Windows XP has been in news recently because Microsoft has finally pulled support for this aging OS. 30% of all active desktops are still on XP and now we know of a new security bug, which would never get fixed for these users.

XP may eventually become the epitome of unpatched buggy software because of the visibility this issue got, but I feel this may just be the tip of the iceberg. For every XP out there, I bet there is one or more unpatched networking device just waiting for someone to exploit it, and this number is growing very fast. Some of these bugs are just that… bugs, but I suspect most of them are due to less then reputable code/design quality.  Its a wild-wild-west out there and this has to stop.

The other problem with ubiquitous computing is that the number of devices per house hold is growing rapidly and doing manual updates to every single one is getting close to impossible. We need to get to a place where users won’t have to worry about manually updating the devices. The industry as a whole needs to do a better job at promoting a type of automation and testing which requires significantly higher levels of investment in resources (by manufacturer) to make it happen. Apple with its iOS update infrastructure and Google with its Chrome updates has shown that its possible to do it at scale.

So what can we as users do ? For a start we may have an obligation to ask about auto-updates when we buy new devices. For connected devices at least, shipping updates shouldn’t be “optional”. Vote for the right manufacturer with your wallet.

 

 

Thoughts on scalable web operations

Interesting observations/thoughts on  web operations collected from a few sessions at Velocity conference 2010 [ most are from a talk by Theo Schlossnagle, author of “Scalable internet architectures” ]

  • Optimization O'Reilly Radar Logo
    • Don’t over optimize. Could take away precious resources away from critical functions. 
    • Don’t scale early. Planning for more than 10 times the load you currently have or are planning to support might be counter-productive in most cases. RDBMS is fine until you really need something which can’t fit on 2 or 3 servers.
    • Optimize performance on single node before you optimize and re-architect a solution for horizontal scalability.
  • Tools
    • Tools are what a master craftsman makes… tools don’t make a craftsman a master.
    • Tools can never solve a problem, its correct use does.
    • Master the tools which need to be (could be ) used in production at short notice. Looking for man page for these tools during an outage isn’t ideal.
  • Cookies
    • Use cookies to store data wherever possible.
    • Sign them if you are concerned about tampering
    • Encrypt them if you are concerned about users having visibility into it
    • Its cheaper to use user’s browser as a datastore replication node, than build redundant servers
  • Datastores
    • NoSQL is not the solution for everything [ example: so long MongoDB ]
    • Ditto RDBMS
    • Ditto everything else
    • Get the requirements, understand the problem and then pick the solution. Instead of the other way around.
  • Automation
    • When you find yourself doing something more than 2 times, write scripts to automate it
    • When users report failures before monitoring systems do, write better monitoring tools.
  • Revision control
    • Revision control as much as possible.
    • Provides audit trail to help understand what happened before. One can’t remember everything. Excellent place to search during hard to solve production problems.
  • Networking
    • Think in packets and not bytes to save load time.
    • There is no point in compressing a CSS file which is 400 bytes since the smallest data IP packet will store is about 1300 bytes (rest of the packet is padded with empty bytes if the data being sent is smaller).
    • In fact compression and decompression will take away precious CPU resources on server and the client.
    • Instead think of embedding short CSS files in HTML to save a few extra packets.
  • Caching
    • Static objects
      • Cache all static objects for ever
      • Add random numbers/strings to objects to force a reload of the object.
        • For example instead of requesting “/images/myphoto.jpg” request “/images/myphoto.123245.jpg”
        • Remove the random ID using something like an htaccess rewrite rule
      • Use CDNs where ever possible, but make sure you understand all the objects part of your page before you shove the problem to a CDN. pointless redirects can steal away previous loading time.
  • People
    • When you hire someone for operations team, never hire someone who can’t remember a single production issue he/she was caused. People learn the most from mistakes, so recognizing people who have been on the hot seat and have fixed their mistakes.
    • Allow people to take risks in production and watch them how they recover from it. Taking risk is part of adapting to new ideas, and letting them fail helps them understand how to improve.
  • Systems
      • Know your systems baseline. An instant/snapshot view of a system’s current statistics is never sufficient to fully classify a systems current state. ( for example is 10 load average abnormal on server XYZ ?)
      • Use tools which periodically poll and archive data to help you give this information
    • Moderation
      • Moderate the tools and process you use
      • Moderate the moderation

    What did I miss ? 🙂 Let me know and I’ll add it here…

  • Automated, faster, repeatable, scalable deployments

    While efficient automated deployment tools like Puppet and Capistrano are a big step in the right direction, its not the complete solution for an automated deployment process. This post will explore some of the less discussed issues which are as important for automated, fast, repeatable scalable deployments. 

    Rapid Build and Integration with tests

    • Use Source control to build an audit trail: Put everything possible in it, including configurations and deployment scripts.
    • Continuous Builds triggered by code check-ins can detect and report problems early image
      • Use tools which provide targeted feedback about build failures. It reduces noise and improves over all quality faster
      • Faster the build happens after a check-in, better are the chances for bugs to get fixed quickly. Delays can be costly since broken builds could impact other developers as well
      • Build smaller components (fail fast)
    • Continuous integration tests of all components can detect errors which may not be caught a build time.

    Automated database changes

    Can database changes be automated ? This is probably one of the most interesting challenges for automation, especially if the app requires data migrations which can’t be rolled back. While it would be nice to have only incremental changes introduced into each deployment (which are guaranteed to be forward and backward compatible), there might be some need for non-trivial changes once in a while. As long as there is a process to separate the trivial from non-trivial changes, it might be possible to automate most of the database changes through an automation process.

    Tracking which migrations have been applied and which are pending is a very application specific problem for which there are no silver bullets.

     

    Configuration management

     
    Environment-specific properties

    Its not abnormal to have different sets of configuration for dev and production. But creating different build packages for different target environments is not the right solution. If you need to change properties between environments pick a better way to do it.

    • Either externalize the configuration properties to a file/directory location outside your app folder, such that repeated deployments don’t overwrite properties.
    • Or, update the right properties automatically during deployment using a deployment framework which is capable of that.
    Pushing at deployment time or pulling at run time

    In some cases pulling new configuration files dynamically after application startup might make more sense. This is especially true for applications on an infrastructure like AWS/EC2. If applications were already deployed on the base OS image, then it will come up automatically when the system boots up. Some folks keep only minimal information in the base OS image, and use a datastore like S3 to download the latest configuration from. In a private network where using S3 is not possible, you could replace it with some kind of shared store like SVN/NFS/FTP/SCP/HTTPetc.

    Deployment frameworks

     
    3rd Party frameworks
    • Fabric – Fabric is a Python library and command-line tool for streamlining the use of SSH for application deployment or systems administration tasks.
    • Puppet -  Put simply, Puppet is a system for automating system administration tasks.
    • Capistrano – It is designed with repeatability in mind, letting you easily and reliably automate tasks that used to require login after login and a small army of custom shell scripts.  ( also check out webistrano )
    • Bcfg2 – Bcfg2 helps system administrators produce a consistent, reproducible, and verifiable description of their environment, and offers visualization and reporting tools to aid in day-to-day administrative tasks.
    • Chef – Chef is a systems integration framework, built to bring the benefits of configuration management to your entire infrastructure.
    • Slack – slack is an evolution from the usual "put files in some central directory" that is fairly common practice.
    • Kokki – System configuration management framework influenced by Chef
    Custom or Mixed frameworks

    The tools listed above are not the only set of tools available. Simple bash/sh scripts, ant scripts, even tools like cruisecontrol and hudson can be used for automated deployments. Here are some other interesting observations 

    • Building huge monolithically applications are thing of the past. Understanding how to break them up into self-contained, less inter-dependent components is the challenge.
    • If all of your servers get the same exact copy of application and configuration, then you don’t need to worry about configuration management. Just find a tool which deploys files fast.
    • If your deployments have a lot of inter-dependencies between components then choose a tool which gives you a visual interface of the deployment process if required.
    • Don’t be shy to write wrapper scripts to automate more tasks.
    Push/Pull/P2P Frameworks

    Grig has an interesting post about Push vs Pull where he lists the pros/cons of both the systems. What he forgot to mention is P2P which is the way twitter is going for its deployment. P2P has advantages from both Push and Pull architecture but comes with its own set of challenges. I haven’t seen an opensource tool using P2P yet, but I’m sure its not too far out.

    Outage windows

    Though deployments are easier with long outage windows, thats something hard to come by. In an ideal world one would have a parallel set of servers which one could cut over to with a flip of a switch. Unfortunately if user data is involved this is almost impossible to do. The next best alternative is to do “rolling updates” in small batches of servers. The reason this could be challenging is because the deployment tool needs to make sure the app really has completed initialization before it moves on to the next set of servers.

    This can be further complicated by the fact that at times there are version dependencies between different applications. In such cases there needs to be a robust infrastructure to facilitate discovery of the right version of applications.

    Conclusion

    Deployment automation, in my personal opinion, is about the process, not the tool. If you have any interesting observations, ideas or comments, please feel free to write to me or leave a comment on this blog.