“Chrome instant” feature could break your webapp

The “Google instant” wasn’t a ground breaking idea by itself. We have all been using various forms of imageauto-completes for a while now. What makes it stand out is that unlike all the previous kinds of auto-completes, this one is able to search the entire web archive, at an amazing speed and still be able to serve personalized, hyper-local results.  You can get more information about its backend here and here.

It wasn’t surprising that Google even put this feature inside chrome itself. Take a look at this demo from lifehacker. This is where it gets interesting…

 

At the beginning this looked very exciting. I was pleasantly surprised when chrome brought up websites, in addition to auto-completing URLs,  as I typed. The impact on the servers didn’t sink in until I was debugging a bug in my own application which required me to take a look at the apache logs. Look at the following log snippet from apache. Not surprisingly, I found 17 calls instead of just 1 made to my web application while I was typing the URL. All of this happened in 6 seconds, which is about the time it took me to type the URL.

[29/Sep/2010:02:39:04 -0700] "GET /cfmap/create.jsp?p HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:04 -0700] "GET /cfmap/create.jsp?po HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:04 -0700] "GET /cfmap/create.jsp?por HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:05 -0700] "GET /cfmap/create.jsp?port HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:05 -0700] "GET /cfmap/create.jsp?port= HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:05 -0700] "GET /cfmap/create.jsp?port=1 HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:05 -0700] "GET /cfmap/create.jsp?port=1 HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:08 -0700] "GET /cfmap/create.jsp?port=1& HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:08 -0700] "GET /cfmap/create.jsp?port=1&a HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:08 -0700] "GET /cfmap/create.jsp?port=1&ap HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:08 -0700] "GET /cfmap/create.jsp?port=1&app HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:09 -0700] "GET /cfmap/create.jsp?port=1&appn HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:09 -0700] "GET /cfmap/create.jsp?port=1&appna HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:09 -0700] "GET /cfmap/create.jsp?port=1&appnam HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:09 -0700] "GET /cfmap/create.jsp?port=1&appname HTTP/1.1" 200 60 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:09 -0700] "GET /cfmap/create.jsp?port=1&appname= HTTP/1.1" 200 60 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:10 -0700] "GET /cfmap/create.jsp?port=1&appname=34 HTTP/1.1" 200 60 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:10 -0700] "GET /cfmap/create.jsp?port=1&appname=34 HTTP/1.1" 200 60 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0

There are two issues here which made me very concerned

  1. Volume of requests: This is a no brainer. The example I used above is not a normal use case since we don’t expect users to type URLs every time they use web-applications. But if the app has an easy to use API which can be used by users in this way, the impact of that small percentage of users who use will get magnified many folds very quickly. It may get very important to figure out how to queue requests, and also important to figure out how to distinguish between users who are spamming the website with 10 requests per second from the user who makes 1 request. All this problem could also go away if your app can actually handle 5 to 20 times more traffic already, which is probably the best solution.
  2. Robust APIs: This is a more tricky one which developers need to plan for. Lets say there was an API like this “/api/transfermoney.php?from=account1&to=account2&amount=10000”. How much money will this API transfer if you type this url in a browser which auto-executes partial URLs ?

What broke the camels back was the fact this particular feature was often flagged by Google’s own search engine as being spammy/automated.  It got so bad that I had to switch to firefox to do a simple google search.  image

And here is an example of how my Google history is now polluted with things I didn’t really search for. In this example I was looking for “ohdoctah” after I heard about it on twit. The key here is that while Google might have thought about how to mine this polluted search data, other web applications might find this impossible to deal with without significant addition in resources. 

image

For now I’ve disabled the feature in the browser. I hope that either there is an easy solution to this problem, otherwise I don’t see this feature making it into the production version of Chrome soon.

Thoughts on scalable web operations

Interesting observations/thoughts on  web operations collected from a few sessions at Velocity conference 2010 [ most are from a talk by Theo Schlossnagle, author of “Scalable internet architectures” ]

  • Optimization O'Reilly Radar Logo
    • Don’t over optimize. Could take away precious resources away from critical functions. 
    • Don’t scale early. Planning for more than 10 times the load you currently have or are planning to support might be counter-productive in most cases. RDBMS is fine until you really need something which can’t fit on 2 or 3 servers.
    • Optimize performance on single node before you optimize and re-architect a solution for horizontal scalability.
  • Tools
    • Tools are what a master craftsman makes… tools don’t make a craftsman a master.
    • Tools can never solve a problem, its correct use does.
    • Master the tools which need to be (could be ) used in production at short notice. Looking for man page for these tools during an outage isn’t ideal.
  • Cookies
    • Use cookies to store data wherever possible.
    • Sign them if you are concerned about tampering
    • Encrypt them if you are concerned about users having visibility into it
    • Its cheaper to use user’s browser as a datastore replication node, than build redundant servers
  • Datastores
    • NoSQL is not the solution for everything [ example: so long MongoDB ]
    • Ditto RDBMS
    • Ditto everything else
    • Get the requirements, understand the problem and then pick the solution. Instead of the other way around.
  • Automation
    • When you find yourself doing something more than 2 times, write scripts to automate it
    • When users report failures before monitoring systems do, write better monitoring tools.
  • Revision control
    • Revision control as much as possible.
    • Provides audit trail to help understand what happened before. One can’t remember everything. Excellent place to search during hard to solve production problems.
  • Networking
    • Think in packets and not bytes to save load time.
    • There is no point in compressing a CSS file which is 400 bytes since the smallest data IP packet will store is about 1300 bytes (rest of the packet is padded with empty bytes if the data being sent is smaller).
    • In fact compression and decompression will take away precious CPU resources on server and the client.
    • Instead think of embedding short CSS files in HTML to save a few extra packets.
  • Caching
    • Static objects
      • Cache all static objects for ever
      • Add random numbers/strings to objects to force a reload of the object.
        • For example instead of requesting “/images/myphoto.jpg” request “/images/myphoto.123245.jpg”
        • Remove the random ID using something like an htaccess rewrite rule
      • Use CDNs where ever possible, but make sure you understand all the objects part of your page before you shove the problem to a CDN. pointless redirects can steal away previous loading time.
  • People
    • When you hire someone for operations team, never hire someone who can’t remember a single production issue he/she was caused. People learn the most from mistakes, so recognizing people who have been on the hot seat and have fixed their mistakes.
    • Allow people to take risks in production and watch them how they recover from it. Taking risk is part of adapting to new ideas, and letting them fail helps them understand how to improve.
  • Systems
      • Know your systems baseline. An instant/snapshot view of a system’s current statistics is never sufficient to fully classify a systems current state. ( for example is 10 load average abnormal on server XYZ ?)
      • Use tools which periodically poll and archive data to help you give this information
    • Moderation
      • Moderate the tools and process you use
      • Moderate the moderation

    What did I miss ? 🙂 Let me know and I’ll add it here…

  • Automated, faster, repeatable, scalable deployments

    While efficient automated deployment tools like Puppet and Capistrano are a big step in the right direction, its not the complete solution for an automated deployment process. This post will explore some of the less discussed issues which are as important for automated, fast, repeatable scalable deployments. 

    Rapid Build and Integration with tests

    • Use Source control to build an audit trail: Put everything possible in it, including configurations and deployment scripts.
    • Continuous Builds triggered by code check-ins can detect and report problems early image
      • Use tools which provide targeted feedback about build failures. It reduces noise and improves over all quality faster
      • Faster the build happens after a check-in, better are the chances for bugs to get fixed quickly. Delays can be costly since broken builds could impact other developers as well
      • Build smaller components (fail fast)
    • Continuous integration tests of all components can detect errors which may not be caught a build time.

    Automated database changes

    Can database changes be automated ? This is probably one of the most interesting challenges for automation, especially if the app requires data migrations which can’t be rolled back. While it would be nice to have only incremental changes introduced into each deployment (which are guaranteed to be forward and backward compatible), there might be some need for non-trivial changes once in a while. As long as there is a process to separate the trivial from non-trivial changes, it might be possible to automate most of the database changes through an automation process.

    Tracking which migrations have been applied and which are pending is a very application specific problem for which there are no silver bullets.

     

    Configuration management

     
    Environment-specific properties

    Its not abnormal to have different sets of configuration for dev and production. But creating different build packages for different target environments is not the right solution. If you need to change properties between environments pick a better way to do it.

    • Either externalize the configuration properties to a file/directory location outside your app folder, such that repeated deployments don’t overwrite properties.
    • Or, update the right properties automatically during deployment using a deployment framework which is capable of that.
    Pushing at deployment time or pulling at run time

    In some cases pulling new configuration files dynamically after application startup might make more sense. This is especially true for applications on an infrastructure like AWS/EC2. If applications were already deployed on the base OS image, then it will come up automatically when the system boots up. Some folks keep only minimal information in the base OS image, and use a datastore like S3 to download the latest configuration from. In a private network where using S3 is not possible, you could replace it with some kind of shared store like SVN/NFS/FTP/SCP/HTTPetc.

    Deployment frameworks

     
    3rd Party frameworks
    • Fabric – Fabric is a Python library and command-line tool for streamlining the use of SSH for application deployment or systems administration tasks.
    • Puppet -  Put simply, Puppet is a system for automating system administration tasks.
    • Capistrano – It is designed with repeatability in mind, letting you easily and reliably automate tasks that used to require login after login and a small army of custom shell scripts.  ( also check out webistrano )
    • Bcfg2 – Bcfg2 helps system administrators produce a consistent, reproducible, and verifiable description of their environment, and offers visualization and reporting tools to aid in day-to-day administrative tasks.
    • Chef – Chef is a systems integration framework, built to bring the benefits of configuration management to your entire infrastructure.
    • Slack – slack is an evolution from the usual "put files in some central directory" that is fairly common practice.
    • Kokki – System configuration management framework influenced by Chef
    Custom or Mixed frameworks

    The tools listed above are not the only set of tools available. Simple bash/sh scripts, ant scripts, even tools like cruisecontrol and hudson can be used for automated deployments. Here are some other interesting observations 

    • Building huge monolithically applications are thing of the past. Understanding how to break them up into self-contained, less inter-dependent components is the challenge.
    • If all of your servers get the same exact copy of application and configuration, then you don’t need to worry about configuration management. Just find a tool which deploys files fast.
    • If your deployments have a lot of inter-dependencies between components then choose a tool which gives you a visual interface of the deployment process if required.
    • Don’t be shy to write wrapper scripts to automate more tasks.
    Push/Pull/P2P Frameworks

    Grig has an interesting post about Push vs Pull where he lists the pros/cons of both the systems. What he forgot to mention is P2P which is the way twitter is going for its deployment. P2P has advantages from both Push and Pull architecture but comes with its own set of challenges. I haven’t seen an opensource tool using P2P yet, but I’m sure its not too far out.

    Outage windows

    Though deployments are easier with long outage windows, thats something hard to come by. In an ideal world one would have a parallel set of servers which one could cut over to with a flip of a switch. Unfortunately if user data is involved this is almost impossible to do. The next best alternative is to do “rolling updates” in small batches of servers. The reason this could be challenging is because the deployment tool needs to make sure the app really has completed initialization before it moves on to the next set of servers.

    This can be further complicated by the fact that at times there are version dependencies between different applications. In such cases there needs to be a robust infrastructure to facilitate discovery of the right version of applications.

    Conclusion

    Deployment automation, in my personal opinion, is about the process, not the tool. If you have any interesting observations, ideas or comments, please feel free to write to me or leave a comment on this blog.