Scalability and performance – Its all about the customers

Todd pinged me to see how I felt about antirez’s suggession in his post titled “On the webserver scalability and speed are almost the same thing“. While I disagree with parts of the post, I can understand why he believed what he wrote.

If someone were to design a state of the art scalable webserver in place of an existing service (lets say apache) which can deliver content in 50 ms, then by my definition the new webserver should continue to serve that content at 50 ms even if the number of requests handled per second by the service increases 100 times. Antirez’s argument is somewhat correct that just because you can scale doesn’t mean that customers will always forgive you for slower performance. But as it happens I’ve seen many different solutions in the last decade which ignored that concern and provided  a scalable yet slower service and still succeeded in generating enough revenue to prove that slower speed may not be a show stopper.

In my opinion its all about the customers. Lets take S3/SimpleDB for example and compare it with mysql and take hosting content in the cloud (SAAS) vs hosting it internally on a corporate web server.  Apart from being able to scale, the reason why customers are ok with using a slower service has to do with two important things which impacts them.

  1. Customer expectations
  2. Customer perception.

If the customer expected responses from S3/SimpleDB should be as fast as Mysql, then the service wouldn’t ever have succeeded. Yet we have Amazon minting money. Whats worse is that in most of these cases the customers were actually developers themselves who had their own set of customers. And most of these customers didn’t have any idea that their service provider was using S3/SimpleDB behind the scenes which is much slower than Mysql or oracle. If they had  noticed their service taking 3 times as long to load, that would have been a huge problem ( and actually a lot of customers do notice when service providers move to the cloud). And this is what most developers complain about… the customer expectations are really high.

Some service providers are pretty good at explaining what customers will gain by accepting slower speed. For example automated backups, auto-upgrades, cheaper hardware might all be good enough reason for “Customer expectation” to be lowered. They might be willing for a 100ms load time if they are compensated in some way by cheaper service.

Another way to solve this is by trying to meet expectations by changing “customer perception”. If you haven’t heard talks by  web UI specialists talk  about how important the “perception of speed” is in the web business, then you should find a talk to attend at one of the conferences. There are tons of ways to speed up serivices… some by complex caching/geo-loadbalancing/compression magic, others by making slight changes to UI to make it look like things are happening fast. For example, a badly designed html page which requires all its images to load before the page is rendered is going to be noticed a lot faster than than one which lots and renders text first and renders images later. Making AJAX calls behind the scenes, preloading data users might need, etc are other forms of such optimizations which can change user perception. Designing your app meet the customer perception is hard but may not be impossible.

Where I disagree with Antirez’s arguments is statement that scalability and performance are “almost” the same thing. They are extremely different because a system which scales doesn’t have to speed up at all… in fact most customers would be happy if 50 ms response time doesn’t change over years even if the number of users has increased 1000 times.

Similar scalability issues exist in every part of our lives. Even you closes grocery chain has this problem. I will be happy with the speed at which they bill me today even if they grew 100 times over the next few years.

BTW I had briefly covered this topic a few years ago in a post called “What is scalabilty“. Feel free to read that as well if you are interested.

@twitter annotations : What I learnt at the hackfest….

A few of us joined in at the new Twitter office in downtown SF (right next to Moscone Center) and were for the first time shown what Twitter is doing about  “Twitter Annotations”. We probably created the first set of 3rd party applications around this new API. During the Hackathon I spent some time to wear my “Scalable web architecture” hat to think what I could learn from this experience which I’ve summarized below.Twitter

Twitter annotations from a developer’s view point is just an extension to existing APIs which now allows posting of additional structured content along with “tweet”. The content stays within the context of the tweet and will be retweeted/shared automatically with the main tweet. Twitter has some recommendations on how the annotations should be structured, for example they were talking about “type” which sounded very much like Open Graph’s “type/category” concept, with the difference that Twitter has left the field open for any kind of “type” users want. Facebook, if I remember right, had strongly recommended users to use a small set of “categories/types” which they published. Twitter accepted these annotations in multiple formats of which I tried the “simple” and “JSON” protocols. The “JSON” way was the most recommended/used medium of annotation during the whole hackathon. While annotation structure (using JSON) did allow multiple “types” in the same tweet, there were a few limitations which were slightly constricting. The first big one was that the current implementation allowed only 512 bytes in the annotations field. The second limitation was that the structure, while its JSON, it only supported a few levels of depth in the structured annotation. This was extremely restrictive for the use case I was trying to hack up.

There were a few things I learnt during the whole 32 hour experience. The first one was that twitter had actually hosted these half baked APIs on http://api.twitter.com and http://www.twitter.com, which I’m glad to say is still accessible using my account from outside twitter’s buildings. Of course the hackers(we) had to be white-listed to get access to use it, but from an operations view point this is extremely gutsy since one bad ACL code fragment could expose a lot of uncooked APIs to the whole world. This approach of testing is not new to Twitter and is frequently used for A/B testing in newer (more agile) organizations around the world.

The second was the fact that while Cassandra is in use at twitter, they don’t use it as the primary datastore for all the tweets (yet). They have other uses for it which they didn’t elaborate. The version of Cassandra they use is close to 0.6.2 which just got released. It also looked (from my discussions with one engineer) like cassandra treated rack-awareness and datacenter-awareness in a slightly different way. In the previous documentations I read, they both were the same for all practical purposes. In other words, I need to research this a little more since optimizations in this area can boost Cassandra’s performance across datacenters.

The third was that while Twitter uses cutting edge tools for a lot of different things, they don’t have service discovery nailed yet. They are playing with zookeeper, and I believe they will use that eventually, but its not there yet. This by itself is amazing because without service discovery, the management of configuration and rolling out configuration changes becomes centralized which has its own advantages and disadvantages. At the organization I work, we are playing with cassandra as a service publication/discovery tool for monitoring and consuming services. The short discussion I had with twitter folks about using cassandra in such a way validated the work work I’m doing with cassandra. But I’m still puzzled why others are not thinking about cassandra (or other eventually-consistent datastore) for service discovery. It sounded like Zookeeper might be an overkill for my organization, but I should take a look at it again.

The fourth was that Twitter employs a lot of very smart/passionate people who are amazingly good with most of the network/application stack. They dove into things like browser/javascript/cookies and then switched to dissecting network traffic using sniffer tools to debug a possible Oauth implementation bug and other weird things. This just adds to the current popular wisdom that scalability/stability/security can’t be done in small little silos.

The fifth and final I’d like to write about is the hackthon itself. Its amazing how Twitter organized this hackathon, got a group of hackers to play with their new APIs and gave them ability to demo their hacks to the likes of Paul Graham  and Ron Conway. In return they got very interesting product-ideas and use-cases for a feature which is still unpolished and unreleased. But more importantly they also got a bunch of hackers to intentionally and unintentionally break the feature and discover some serious and some very annoying bugs. They also got feedback on what does and doesn’t resonates with developers. In a way this is similar to what some other organizations (including Google) already do with their alpha/beta program, but nothing beats the velocity of hacking up 10 to 20 almost-ready products around a brand new feature in less than 32 hours.

References

P.S.: I’m terribly sorry for spamming my twitter followers who were bombarded with twitter test messages for two days. Next time I’ll pick a test account 🙂

Hive @Facebook

Hive is a data warehouse infrastructure built over Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce fromwork to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.

At a user group meeting, Ashish Thusoo from Facebook data team, spoke about how Facebook uses Hive for their data processing needs.

Problem

Facebook is a free service and has been experiencing rapid growth in last few years. The amount of data it collects, which used to be around 200GB per day in March 2008, has now grown to 15TB per day today.  Facebook realized early on that insights derived from simple algorithms on more data is better than insights from complex algorithm on smaller set of data.

But the traditional approach towards ETL on proprietary storage systems was not only getting expensive to maintain, it was also limited in the size it could scale to. This is when they started experimenting with Hadoop.

How Hadoop gave birth to Hive

Hadoop turned out to be superior in availability, scalability and manageability. Its efficiency wasn’t that great, but one could get more throughput by throwing more cheap hardware at it. Ashish pointed out that though at that point partial availability, resilience and scale was more important than ACID they had a hard time finding Hadoop programmers within Facebook to make use of the cluster.

It was this, that eventually forced Facebook, to build a new way of querying data from Hadoop which doesn’t require writing map-reduce jobs in java. That quickly lead to the development of hive, which does exactly what it was set out to do. Lets look at a couple of examples of hive queries.

  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar;
  hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;

Hive’s long term goal was to develop a system for managing and querying structured data built on top of Hadoop. To do that it used map-reduce mechanisms for execution and used HDFS for storage. They modeled the language on SQL, designed it to be extensible, interoperable and be able to out perform traditional processing mechanisms.

How it is usedimage

Facebook has a production Hive cluster which is primarily used for log summarization, including aggregation of impressions, click counts and statistics around user engagement. They have a separate cluster for “Ad hoc analysis” which is free for all/most Facebook employees to use. And over time they figured out how to use it for spam detection, ad optimization and a host of other undocumented stuff.

Facebook Hive/Hadoop statistics

The scribe/Hadoop cluster at Facebook has about 50 nodes in the cluster today and processes about 25TB of raw data. About 99% of its data is available for use within 20 seconds. The Hive/Hadoop cluster where most of the data processing happens has about 8400 cores with roughly about 12.5 PB of raw storage which translates to 4PB of usable storage after replication. Each node in the cluster is a 8 core server with 12TB of storage each.

All in all, Facebook gets 12 TB of compressed new data and scans about 135 TB of compressed data per day. There are more than 7500 Hive jobs which use up about 80000 computer hours each day.

References