Scalability and performance – Its all about the customers

Todd pinged me to see how I felt about antirez’s suggession in his post titled “On the webserver scalability and speed are almost the same thing“. While I disagree with parts of the post, I can understand why he believed what he wrote.

If someone were to design a state of the art scalable webserver in place of an existing service (lets say apache) which can deliver content in 50 ms, then by my definition the new webserver should continue to serve that content at 50 ms even if the number of requests handled per second by the service increases 100 times. Antirez’s argument is somewhat correct that just because you can scale doesn’t mean that customers will always forgive you for slower performance. But as it happens I’ve seen many different solutions in the last decade which ignored that concern and provided  a scalable yet slower service and still succeeded in generating enough revenue to prove that slower speed may not be a show stopper.

In my opinion its all about the customers. Lets take S3/SimpleDB for example and compare it with mysql and take hosting content in the cloud (SAAS) vs hosting it internally on a corporate web server.  Apart from being able to scale, the reason why customers are ok with using a slower service has to do with two important things which impacts them.

  1. Customer expectations
  2. Customer perception.

If the customer expected responses from S3/SimpleDB should be as fast as Mysql, then the service wouldn’t ever have succeeded. Yet we have Amazon minting money. Whats worse is that in most of these cases the customers were actually developers themselves who had their own set of customers. And most of these customers didn’t have any idea that their service provider was using S3/SimpleDB behind the scenes which is much slower than Mysql or oracle. If they had  noticed their service taking 3 times as long to load, that would have been a huge problem ( and actually a lot of customers do notice when service providers move to the cloud). And this is what most developers complain about… the customer expectations are really high.

Some service providers are pretty good at explaining what customers will gain by accepting slower speed. For example automated backups, auto-upgrades, cheaper hardware might all be good enough reason for “Customer expectation” to be lowered. They might be willing for a 100ms load time if they are compensated in some way by cheaper service.

Another way to solve this is by trying to meet expectations by changing “customer perception”. If you haven’t heard talks by  web UI specialists talk  about how important the “perception of speed” is in the web business, then you should find a talk to attend at one of the conferences. There are tons of ways to speed up serivices… some by complex caching/geo-loadbalancing/compression magic, others by making slight changes to UI to make it look like things are happening fast. For example, a badly designed html page which requires all its images to load before the page is rendered is going to be noticed a lot faster than than one which lots and renders text first and renders images later. Making AJAX calls behind the scenes, preloading data users might need, etc are other forms of such optimizations which can change user perception. Designing your app meet the customer perception is hard but may not be impossible.

Where I disagree with Antirez’s arguments is statement that scalability and performance are “almost” the same thing. They are extremely different because a system which scales doesn’t have to speed up at all… in fact most customers would be happy if 50 ms response time doesn’t change over years even if the number of users has increased 1000 times.

Similar scalability issues exist in every part of our lives. Even you closes grocery chain has this problem. I will be happy with the speed at which they bill me today even if they grew 100 times over the next few years.

BTW I had briefly covered this topic a few years ago in a post called “What is scalabilty“. Feel free to read that as well if you are interested.

How facebook ships code

Stumbled on a fascinating post about how facebook ships code. This level of detail is rare from an organization as big as this.  This is a very long piece… but here are a few lines from it to entice you to click on it.

From framethink

  • as of June 2010, the company has nearly 2000 employees, up from roughly 1100 employees 10 months ago.  Nearly doubling staff in under a year!
  • the two largest teams are Engineering and Ops, with roughly 400-500 team members each.  Between the two they make up about 50% of the company.
  • product manager to engineer ratio is roughly 1-to-7 or 1-to-10
  • all engineers go through 4 to 6 week “Boot Camp” training where they learn the Facebook system by fixing bugs and listening to lectures given by more senior/tenured engineers.  estimate 10% of each boot camp’s trainee class don’t make it and are counseled out of the organization.
  • after boot camp, all engineers get access to live DB (comes with standard lecture about “with great power comes great responsibility” and a clear list of “fire-able offenses”, e.g., sharing private user data)
  • any engineer can modify any part of FB’s code base and check-in at-will
  • very engineering driven culture.  ”product managers are essentially useless here.” is a quote from an engineer.  engineers can modify specs mid-process, re-order work projects, and inject new feature ideas anytime.
  • during monthly cross-team meetings, the engineers are the ones who present progress reports.  product marketing and product management attend these meetings, but if they are particularly outspoken, there is actually feedback to the leadership that “product spoke too much at the last meeting.”  they really want engineers to publicly own products and be the main point of contact for the things they built.
  • resourcing for projects is purely voluntary.
    • a PM lobbies group of engineers, tries to get them excited about their ideas.
    • Engineers decide which ones sound interesting to work on.
    • Engineer talks to their manager, says “I’d like to work on these 5 things this week.”
    • Engineering Manager mostly leaves engineers’ preferences alone, may sometimes ask that certain tasks get done first.
  • Engineers handle entire feature themselves — front end javascript, backend database code, and everything in between.  If they want help from a Designer (there are a limited staff of dedicated designers available), they need to get a Designer interested enough in their project to take it on.  Same for Architect help.  But in general, expectation is that engineers will handle everything they need themselves.

Its logical – IAAS users will move to PAAS

Sysadmins love infrastructure control, and I have to say that there was a time when root access gave me a high. It wasn’t  until I moved to web operations team (and gave up my root access) that I realized that I was  more productive when I wasn’t dealing with day to day hardware and OS issues. After managing my own EC2/Rackspace instance for my blog for a few years , I came to another realization today that IAAS (infrastructure as a service) might be one of these fads which will give way to PAAS (Platform as a service).

WordPress is an excellent blogging platform, and I manage multiple instances of it for my blogs (and one for my  wife’s blog). I chose to run my own wordpress instance because I loved the same control which I used to have when I was a sysadmin. I not only wanted to run my own plugins, configured my own features, play with different kinds of caching features, I also wanted to choose my own linux distribution (Ubuntu ofcourse) and make it work the way I always wanted my servers to work.  But when it came to patching the OS, taking backups, updating wordpress and the zillion other plugins, I found it a little distracting, slightly frustrating and extremely time consuming.

Last week I moved one of my personal blogs to blogger.com and its possible that it may not be the last one. Whats important here is not that I picked blogger.com over wordpress.com, but the fact that I’m ready to give up control to be more productive. Amazon’s AWS started off as the first IAAS service provider, but today they provide a whole lot of other managed services like Elastic MapReduce, Amazon Route 53, Amazon cloudfront and Amazon Relational Database Service which are more of a PAAS than IAAS.

IAAS is a very powerful tool in the hands of professional systems admin. But I’m willing to bet that over the next few years lesser number organizations would be worried about kernel versions and linux distributions and would instead be happy with a simple API to upload “.war” files (if they are running tomcat for example) into some kind of cloud managed tomcat instances (like how hadoop runs in elastic mapreduce). Google App Engine (Java and Python) and Heroku (Ruby based, Salesforce bought them) are two examples of such service today and I’ll be surprised if  AWS doesn’t launch something  (or buy someone out) within the next year to do the same.

Windows Azure

Windows Azure is an application platform provided by Microsoft to allow others to run applications on Microsoft’s “cloud” infrastructure. Its finally open for business (as of Feb 1, 2010). Below are some links about Azure for those who are still catching up.Windows Azure logo.jpg

Wikipedia: Windows Azure has three core components: Compute, Storage and Fabric. As the names suggest, Compute provides computation environment with Web Role and Worker Role while Storage focuses on providing scalable storage (Blobs, Tables, Queue) for large scale needs.

The hosting environment of Windows Azure is called the Fabric Controller – which pools individual systems into a network that automatically manages resources, load balancing, geo-replication and application lifecycle without requiring the hosted apps to explicitly deal with those requirements.[3] In addition, it also provides other services that most applications require — such as the Windows Azure Storage Service that provides applications with the capability to store unstructured data such as binary large objects, queues and non-relational tables.[3] Applications can also use other services that are a part of the Azure Services Platform.

Cassandra for service registry/discovery service

My last post was about my struggle to find a good distributed ESB/Service-discovery solution built over open source tools which was simple to use and maintain. Thanks to reader comments (Dan especially) and some other email exchanges, it seems like building a custom solution is unavoidable if I really want to keep things simple.

Dan suggested that I could use DNS to find seed locations for config store which would work very well in a distributed network. If security wasn’t a concern this seed location could have been on S3 or SimpleDB, but the requirement that it needs to be secured on internal infrastructure forced me to investigate simple replicated/eventually-consistent databases which could be hosted internally in different data centers with little or no long term administration cost.

My search lead me to investigate a few different NOSQL options

But the one I finally settled on as a possible candidate was Cassandra. Unlike some of the others, since our application platform was based on java, Cassandra was simple to install and setup. The fact that Facebook used it to store 50TB of data across 150 servers helped us convince it was stable as well.

The documentation on this project isn’t as much as I would have liked, but I did get it running pretty fast. Building a service registry/discovery service on top of this is whats next on my mind..

More on Cassandra

If you are interested in learning more about cassandra I’ll recommend you to listen to this talk by Avinash Lakshman (facebook) and read a few other posts listed here.

Cassandra: Articles

  • Cassandra — Getting Started: Cassandra data model from a Java perspective

  • Using Cassandra’s Thrift interface with Ruby

  • Cassandra and Thrift on OS X: one, two, three

  • Looking to the Future with Cassandra: how Digg migrated their friends+diggs data set to Cassandra from mysql

  • Building Scalable Databases: Denormalization, the NoSQL Movement and Digg

  • WTF is a SuperColumn? An Introduction to the Cassandra Data Model

  • Meet Scalandra: Scala wrapper for Cassandra

  • Cassandra and Ruby: A Love Affair? – Engine Yard’s walk-through of the Cassandra gem

  • Up and Running with Cassandra: featuring data model examples of a Twitter clone and a multi-user blog, and ruby client code

  • Facebook Engineering notes and Cassandra introduction and LADIS 2009 paper

  • ArchitectureInternals

  • ArchitectureGossip

  • Cassandra: Presentations

  • Cassandra in Production at Digg from NoSQL East 09

  • Introduction to Cassandra at OSCON 09

  • What Every Developer Should Know About Database Scalability: presentation on RDBMS vs. Dynamo, BigTable, and Cassandra

  • IBM Research’s scalable mail storage on Cassandra

  • NOSQL VideoNOSQL Slides: More on Cassandra internals from Avinash Lakshman.

  • Video of a presentation about Cassandra at Facebook: covers the data model of Facebook’s inbox search and a lot of implementation details. Prashant Malik and Avinash Lakshman presenting.

  • Cassandra presentation at sigmod: mostly the same slides as above

  • If any of you have worked on cassandra, please let me know how that has been working out for you.