Scalability and performance – Its all about the customers

Todd pinged me to see how I felt about antirez’s suggession in his post titled “On the webserver scalability and speed are almost the same thing“. While I disagree with parts of the post, I can understand why he believed what he wrote.

If someone were to design a state of the art scalable webserver in place of an existing service (lets say apache) which can deliver content in 50 ms, then by my definition the new webserver should continue to serve that content at 50 ms even if the number of requests handled per second by the service increases 100 times. Antirez’s argument is somewhat correct that just because you can scale doesn’t mean that customers will always forgive you for slower performance. But as it happens I’ve seen many different solutions in the last decade which ignored that concern and provided  a scalable yet slower service and still succeeded in generating enough revenue to prove that slower speed may not be a show stopper.

In my opinion its all about the customers. Lets take S3/SimpleDB for example and compare it with mysql and take hosting content in the cloud (SAAS) vs hosting it internally on a corporate web server.  Apart from being able to scale, the reason why customers are ok with using a slower service has to do with two important things which impacts them.

  1. Customer expectations
  2. Customer perception.

If the customer expected responses from S3/SimpleDB should be as fast as Mysql, then the service wouldn’t ever have succeeded. Yet we have Amazon minting money. Whats worse is that in most of these cases the customers were actually developers themselves who had their own set of customers. And most of these customers didn’t have any idea that their service provider was using S3/SimpleDB behind the scenes which is much slower than Mysql or oracle. If they had  noticed their service taking 3 times as long to load, that would have been a huge problem ( and actually a lot of customers do notice when service providers move to the cloud). And this is what most developers complain about… the customer expectations are really high.

Some service providers are pretty good at explaining what customers will gain by accepting slower speed. For example automated backups, auto-upgrades, cheaper hardware might all be good enough reason for “Customer expectation” to be lowered. They might be willing for a 100ms load time if they are compensated in some way by cheaper service.

Another way to solve this is by trying to meet expectations by changing “customer perception”. If you haven’t heard talks by  web UI specialists talk  about how important the “perception of speed” is in the web business, then you should find a talk to attend at one of the conferences. There are tons of ways to speed up serivices… some by complex caching/geo-loadbalancing/compression magic, others by making slight changes to UI to make it look like things are happening fast. For example, a badly designed html page which requires all its images to load before the page is rendered is going to be noticed a lot faster than than one which lots and renders text first and renders images later. Making AJAX calls behind the scenes, preloading data users might need, etc are other forms of such optimizations which can change user perception. Designing your app meet the customer perception is hard but may not be impossible.

Where I disagree with Antirez’s arguments is statement that scalability and performance are “almost” the same thing. They are extremely different because a system which scales doesn’t have to speed up at all… in fact most customers would be happy if 50 ms response time doesn’t change over years even if the number of users has increased 1000 times.

Similar scalability issues exist in every part of our lives. Even you closes grocery chain has this problem. I will be happy with the speed at which they bill me today even if they grew 100 times over the next few years.

BTW I had briefly covered this topic a few years ago in a post called “What is scalabilty“. Feel free to read that as well if you are interested.

Its logical – IAAS users will move to PAAS

Sysadmins love infrastructure control, and I have to say that there was a time when root access gave me a high. It wasn’t  until I moved to web operations team (and gave up my root access) that I realized that I was  more productive when I wasn’t dealing with day to day hardware and OS issues. After managing my own EC2/Rackspace instance for my blog for a few years , I came to another realization today that IAAS (infrastructure as a service) might be one of these fads which will give way to PAAS (Platform as a service).

WordPress is an excellent blogging platform, and I manage multiple instances of it for my blogs (and one for my  wife’s blog). I chose to run my own wordpress instance because I loved the same control which I used to have when I was a sysadmin. I not only wanted to run my own plugins, configured my own features, play with different kinds of caching features, I also wanted to choose my own linux distribution (Ubuntu ofcourse) and make it work the way I always wanted my servers to work.  But when it came to patching the OS, taking backups, updating wordpress and the zillion other plugins, I found it a little distracting, slightly frustrating and extremely time consuming.

Last week I moved one of my personal blogs to blogger.com and its possible that it may not be the last one. Whats important here is not that I picked blogger.com over wordpress.com, but the fact that I’m ready to give up control to be more productive. Amazon’s AWS started off as the first IAAS service provider, but today they provide a whole lot of other managed services like Elastic MapReduce, Amazon Route 53, Amazon cloudfront and Amazon Relational Database Service which are more of a PAAS than IAAS.

IAAS is a very powerful tool in the hands of professional systems admin. But I’m willing to bet that over the next few years lesser number organizations would be worried about kernel versions and linux distributions and would instead be happy with a simple API to upload “.war” files (if they are running tomcat for example) into some kind of cloud managed tomcat instances (like how hadoop runs in elastic mapreduce). Google App Engine (Java and Python) and Heroku (Ruby based, Salesforce bought them) are two examples of such service today and I’ll be surprised if  AWS doesn’t launch something  (or buy someone out) within the next year to do the same.

Google App Engine 1.4.0 pre-release is out

The complete announcement is here, but here are the changes for the java SDK. The two big changes I liked is the fact that there is now an “always on” feature, and “tasks” feature has graduated out of beta/testing.

  • The Always On feature allows applications to pay and keep 3 instances of
    their application always running, which can significantly reduce application latency.
  • Developers can now enable Warmup Requests. By specifying  a handler in an app’s appengine-web.xml, App Engine will attempt to to send a Warmup Request to initialize new instances before a user interacts with it. This can reduce  the latency an end-user sees for initializing your application.
  • The Channel API is now available for all users.
  • Task Queue has been officially released, and is no longer an experimental feature. The API import paths that use ‘labs’ have been deprecated. Task queue storage will count towards an application’s overall storage quota, and will thus be charged for.
  • The deadline for Task Queue and Cron requests has been raised to 10 minutes.  Datastore and API deadlines within those requests remain unchanged.
  • For the Task Queue, developers can specify task retry-parameters in their queue.xml.
  • Metadata Queries on the datastore for datastore kinds, namespaces, and entity  properties are available.
  • URL Fetch allowed response size has been increased, up to 32 MB. Request
    size is still limited to 1 MB.
  • The Admin Console Blacklist page lists the top blacklist rejected visitors.
  • The automatic image thumbnailing service supports arbitrary crop sizes up to 1600px.
  • Overall average instance latency in the Admin Console is now a weighted  average over QPS per instance.
  • Added a low-level AysncDatastoreService for making calls to the datastore asynchronously.
  • Added a getBodyAsBytes() method to QueueStateInfo.TaskStateInfo, this returns the body of the task state as a pure byte-string.
  • The whitelist has been updated to include all classes from javax.xml.soap.
  • Fixed an issue sending email to multiple recipients. http://code.google.com/p/googleappengine/issues/detail?id=1623

Netflix: Dev and Ops internals

I’ve seen a number of posts from Netflix folks talking about their architecture in the recent weeks. And part of that is due to an ongoing effort to expand their business for which they seem to be hiring like crazy. Here is the yet another interesting deck of slides which mentions stuff across both Dev and Ops.

One of the most interesting deck of slides I’ve seen in recent past.

Building your first cloud application on AWS

Building your first web application on AWS is like shopping for a car at pepboys, part by part. While manuals to build one might be on aisle 5, the experience of having built one already is harder to buy.

Here are some interesting logistical questions, which I don’t think get enough attention, when people discuss issues around building a new AWS based service.

  1. Picking the right Linux distribution: Switching OS distribution may not be too simple if your applications need custom scripts. Picking and sticking with a single distribution will save a lot of lost time.
  2. Automated server builds: There are many ways to skin this cat. Chef, Puppet, Cfengine are all good… Whats important is to pick one early in the game.
  3. Multi-Availability Zone support: Find out if multi availability zone support is important. This can impact over all architecture of the solution and tools used to implement the solution.
  4. Data consistency requirements: Similar to the Multi-AZ support question, its important to understand the data consistency tolerance of the application before one starts designing the application.
  5. Datastore: There are different kinds of datastores available as part of AWS itself (SimpleDB, S3 and RDS). If you are planning to keep your options open about moving out of AWS at some point, you should think about picking a datastore which you could move out with you with little effort. There are many NoSQL and RDBMS solutions to choose from.
  6. Backups: While some think its a waste of time to think about backups too early, I suspect those who don’t will be spending way too much time later. The long term backup strategy is integral part of disaster recovery planning, without which you shouldn’t think of going live.
  7. Integration with external data sources:  If this application is part of a larger cluster of application which is running somewhere else, think about how data would be sent back and forth. There are lots of different options depending on how much data is involved (or how important protection of that data is)
  8. Monitoring/Alerting: Most standard out of the box monitoring tools can’t handle dynamic infrastructure very well. There are, however,  plugins available for many existing monitoring solutions which can handle the dynamic nature of infrastructure. You could also choose to use one of the 3rd party monitoring services if you’d rather pay someone else to do it.
  9. Security: You should be shocked to see this on #9 on my list. If your service involves user data, or some other kind of intellectual property, build multi-tiered architecture to segment different parts of your application from targeted attacks. Security is also very important while picking the right caching and web server technologies.
  10. Development: Figure out how developers would use AWS. Would they share the same AWS account, share parts of the infrastructure, share datastore, etc. How would the developer resources be monitored so that unintentional uses of excessive resources could be flagged for alerting.

Are there other subtle issues which I should have listed here ? Let me know.

 

Shipping Trunk : For web applications

I had briefly blogged about this presentation before from Velocity 2010. I wish they had released the video for this session. I went through this slide deck again today to see if Paul mentioned any of the problems organization like ours are dealing with in its transition from quarterly releases to weekly/continuous releases.

One of the key observations Paul made during his talk is that most organizations still treat web applications as desktop software and have very strict quality controls which may not be as necessary since releasing changes for web app in a SAAS (Software as a service) is much more cheaper than for releasing patches for traditional desktop software.

Here are some of the other points he made. For really detailed info check out the [slides]

  • Deploy frequently, facilitating rapid product iteration
  • Avoid large merges and the associated integration testing
  • Easily perform A/B testing of functionality
  • Run QA and beta testing on production hardware
  • Launch big features without worrying about your infrastructure
  • Provide all the switches your operations team needs to manage the deployed system

Slides: Always ship Trunk: Managing Change In Complex Websites

Why Membase Uses Erlang

It’s totally worth it. Erlang (Erlang/OTP really, which is what most people mean when they say “Erlang has X”) does out of the box a lot of things we would have had to either build from scratch or attempt to piece together existing libraries to do. Its dynamic type system and pattern matching (ala Haskelland ML) make Erlang code tend to be even more concise than Python and Ruby, two languages known for their ability to do a lot in few lines of NorthScalecode.

The single largest advantage to us of using Erlang has got to be its built-in support for concurrency. Erlang models concurrent tasks as “processes” that can communicate with one another only via message-passing (which makes use of pattern matching!), in what is known as the actor model of concurrency. This alone makes an entire class of concurrency-related bugs completely impossible. While it doesn’t completely prevent deadlock, it turns out to be pretty difficult to miss a potential deadlock scenario when you write code this way. While it’s certainly possible to implement the actor model in most if not all of the alternative environments I mentioned, and in fact such implementations exist, they are either incomplete or suffer from an impedance mismatch with existing libraries that expect you to use threads.

The complete original post can be found here: blog.membase.com

Cassandra’s future @facebook and links to other NoSQL slides

I heard an unconfirmed rumor that facebook is moving away from Cassandra. Not sure why, or to what, but rumors like this is a concern regardless. After twitter‘s backing off, and digg’s troubles, which were indirectly linked to either Cassandra’s maturity as a production solution or their understanding of Cassandra’s capability, it might raise more eyebrows if facebook does really abandon cassandra.  Cassandra was created in Facebook, which it opensourced, but its my understanding today that most of the development on the open sourced cassandra happens outside its walls. Rackspace is a big sponsor(may not be the largest anymore) of the open source project and Riptano, which has built a whole compnay around the open source project has done a tremendous job of promoting.

Scalability links for October 30th :

Scalability links for October 18th

Scalability links for October 18th:

Continuous deployments may not be for everyone: Culture

If you have read this blog before, you know how much I admire those who use continuous deployments in production. Doing that at scale is even more impressive. But the message which gets lost sometimes is that Continuous deployments may not be for everyone.

Most continuous integration environments usually do all of their deployments from trunk. Which means every check-in has to be production quality. Digg’s Andrew Bayer gives a good explanation of how they do code reviews and pre-code check-ins before code is merged into trunk.

Site uptime and reliability depends on a comprehensive QA process to protect against unintentional mistakes. And for rapid deployments one has to abandon manual QA processes in favor of 100% automated testing with the goal of getting close to 100% code coverage. Thats hard if the code is not written in a way which can be tested easily.

image

But, unit and integration tests alone cannot guarantee quality. In addition to testing code which has been implemented in the application, there needs to be tests to look for things which shouldn’t be implemented. For example, it would be nice to have tests to look for non-parameterized SQL calls in parts of code where it shouldn’t exist. If you know there is a wrong way to do something, write a test case for it so that its caught as soon as someone does it.

Some of this would be easy to do if you already follow a test driven development process where you have to write tests before you write code.

The biggest difference between an organization which follows Continuous deployment and one which doesn’t is in how QA is done. QA becomes a shared responsibility where everyone has to contribute. No matter how many tools or guidelines one publishes, if teams using this process don’t believe in it, the quality and availability of website will suffer. Pascal-Louis Perez (from KaChing) used a diagram like the one here to explain how this “culture” is at the heart of continuous deployment.

“Culture” also explains why most of the older organizations who follow a more traditional form of deployment are having a hard time understanding and adapting to this process.

Are you using Continuous deployments in your environment ? What was your biggest hurdle ?