Netflix: Dev and Ops internals

I’ve seen a number of posts from Netflix folks talking about their architecture in the recent weeks. And part of that is due to an ongoing effort to expand their business for which they seem to be hiring like crazy. Here is the yet another interesting deck of slides which mentions stuff across both Dev and Ops.

One of the most interesting deck of slides I’ve seen in recent past.

“Chrome instant” feature could break your webapp

The “Google instant” wasn’t a ground breaking idea by itself. We have all been using various forms of imageauto-completes for a while now. What makes it stand out is that unlike all the previous kinds of auto-completes, this one is able to search the entire web archive, at an amazing speed and still be able to serve personalized, hyper-local results.  You can get more information about its backend here and here.

It wasn’t surprising that Google even put this feature inside chrome itself. Take a look at this demo from lifehacker. This is where it gets interesting…

 

At the beginning this looked very exciting. I was pleasantly surprised when chrome brought up websites, in addition to auto-completing URLs,  as I typed. The impact on the servers didn’t sink in until I was debugging a bug in my own application which required me to take a look at the apache logs. Look at the following log snippet from apache. Not surprisingly, I found 17 calls instead of just 1 made to my web application while I was typing the URL. All of this happened in 6 seconds, which is about the time it took me to type the URL.

[29/Sep/2010:02:39:04 -0700] "GET /cfmap/create.jsp?p HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:04 -0700] "GET /cfmap/create.jsp?po HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:04 -0700] "GET /cfmap/create.jsp?por HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:05 -0700] "GET /cfmap/create.jsp?port HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:05 -0700] "GET /cfmap/create.jsp?port= HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:05 -0700] "GET /cfmap/create.jsp?port=1 HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:05 -0700] "GET /cfmap/create.jsp?port=1 HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:08 -0700] "GET /cfmap/create.jsp?port=1& HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:08 -0700] "GET /cfmap/create.jsp?port=1&a HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:08 -0700] "GET /cfmap/create.jsp?port=1&ap HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:08 -0700] "GET /cfmap/create.jsp?port=1&app HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:09 -0700] "GET /cfmap/create.jsp?port=1&appn HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:09 -0700] "GET /cfmap/create.jsp?port=1&appna HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:09 -0700] "GET /cfmap/create.jsp?port=1&appnam HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:09 -0700] "GET /cfmap/create.jsp?port=1&appname HTTP/1.1" 200 60 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:09 -0700] "GET /cfmap/create.jsp?port=1&appname= HTTP/1.1" 200 60 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:10 -0700] "GET /cfmap/create.jsp?port=1&appname=34 HTTP/1.1" 200 60 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:10 -0700] "GET /cfmap/create.jsp?port=1&appname=34 HTTP/1.1" 200 60 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0

There are two issues here which made me very concerned

  1. Volume of requests: This is a no brainer. The example I used above is not a normal use case since we don’t expect users to type URLs every time they use web-applications. But if the app has an easy to use API which can be used by users in this way, the impact of that small percentage of users who use will get magnified many folds very quickly. It may get very important to figure out how to queue requests, and also important to figure out how to distinguish between users who are spamming the website with 10 requests per second from the user who makes 1 request. All this problem could also go away if your app can actually handle 5 to 20 times more traffic already, which is probably the best solution.
  2. Robust APIs: This is a more tricky one which developers need to plan for. Lets say there was an API like this “/api/transfermoney.php?from=account1&to=account2&amount=10000”. How much money will this API transfer if you type this url in a browser which auto-executes partial URLs ?

What broke the camels back was the fact this particular feature was often flagged by Google’s own search engine as being spammy/automated.  It got so bad that I had to switch to firefox to do a simple google search.  image

And here is an example of how my Google history is now polluted with things I didn’t really search for. In this example I was looking for “ohdoctah” after I heard about it on twit. The key here is that while Google might have thought about how to mine this polluted search data, other web applications might find this impossible to deal with without significant addition in resources. 

image

For now I’ve disabled the feature in the browser. I hope that either there is an easy solution to this problem, otherwise I don’t see this feature making it into the production version of Chrome soon.

Operations Dashboards – KaChing

I’ve mentioned this before, but like to do it again because I think these guys are awesome. If you have not listened to devopscafe’s podcasts, this might be the right time to take a look at it. Here is a video of one of their sessions with folks at KaChing who have been doing amazing stuff around continuous deployments.

Continuous Deployment and Operations Dashboards at kaChing from dev2ops.org on Vimeo.

Scalability updates for Aug 27th 2010

My updates have been slow recently due to other things I’m involved in. If you need more updates around what I’m reading, please feel free to follow me on twitter or buzz.

Here are some of the big ones I have mentioned on my twitter/buzz feeds.

Distributed systems and Unique IDs: Snowflake

Most of us who deal with traditional databases take auto-increments for granted. While auto-increments are simple on consistent clusters, it can become a challenge in a cluster of independent nodes which don’t use the same source for the unique-ids. Even bigger challenge is to do it in such a way so that they are roughly in sequence.

While this may be an old problem, I realized the importance of such a sequence only after using Cassandra in my own environment. Twitter, which has been using Cassandra in many interesting ways has proposed a solution for it which they are releasing as open source today.

Here are some interesting sections from their post announcing “Snowflake”.

The Problem

We currently use MySQL to store most of our online data. In the beginning, the data was in one small database instance which in turn became one large database instance and eventually many large database clusters. For various reasons, the details of which merit a whole blog post, we’re working to replace many of these systems with the Cassandra distributed database or horizontally sharded MySQL (using gizzard).

Unlike MySQL, Cassandra has no built-in way of generating unique ids – nor should it, since at the scale where Cassandra becomes interesting, it would be difficult to provide a one-size-fits-all solution for ids. Same goes for sharded MySQL.

Our requirements for this system were pretty simple, yet demanding:

We needed something that could generate tens of thousands of ids per second in a highly available manner. This naturally led us to choose an uncoordinated approach.

These ids need to be roughly sortable, meaning that if tweets A and B are posted around the same time, they should have ids in close proximity to one another since this is how we and most Twitter clients sort tweets.[1]

Additionally, these numbers have to fit into 64 bits. We’ve been through the painful process of growing the number of bits used to store tweet ids before. It’s unsurprisingly hard to do when you have over 100,000 different codebases involved.

 

Solution

To generate the roughly-sorted 64 bit ids in an uncoordinated manner, we settled on a composition of: timestamp, worker number and sequence number.

Sequence numbers are per-thread and worker numbers are chosen at startup via zookeeper (though that’s overridable via a config file).

We encourage you to peruse and play with the code: you’ll find it on github. Please remember, however, that it is currently alpha-quality software that we aren’t yet running in production and is very likely to change.

@twitter annotations : What I learnt at the hackfest….

A few of us joined in at the new Twitter office in downtown SF (right next to Moscone Center) and were for the first time shown what Twitter is doing about  “Twitter Annotations”. We probably created the first set of 3rd party applications around this new API. During the Hackathon I spent some time to wear my “Scalable web architecture” hat to think what I could learn from this experience which I’ve summarized below.Twitter

Twitter annotations from a developer’s view point is just an extension to existing APIs which now allows posting of additional structured content along with “tweet”. The content stays within the context of the tweet and will be retweeted/shared automatically with the main tweet. Twitter has some recommendations on how the annotations should be structured, for example they were talking about “type” which sounded very much like Open Graph’s “type/category” concept, with the difference that Twitter has left the field open for any kind of “type” users want. Facebook, if I remember right, had strongly recommended users to use a small set of “categories/types” which they published. Twitter accepted these annotations in multiple formats of which I tried the “simple” and “JSON” protocols. The “JSON” way was the most recommended/used medium of annotation during the whole hackathon. While annotation structure (using JSON) did allow multiple “types” in the same tweet, there were a few limitations which were slightly constricting. The first big one was that the current implementation allowed only 512 bytes in the annotations field. The second limitation was that the structure, while its JSON, it only supported a few levels of depth in the structured annotation. This was extremely restrictive for the use case I was trying to hack up.

There were a few things I learnt during the whole 32 hour experience. The first one was that twitter had actually hosted these half baked APIs on http://api.twitter.com and http://www.twitter.com, which I’m glad to say is still accessible using my account from outside twitter’s buildings. Of course the hackers(we) had to be white-listed to get access to use it, but from an operations view point this is extremely gutsy since one bad ACL code fragment could expose a lot of uncooked APIs to the whole world. This approach of testing is not new to Twitter and is frequently used for A/B testing in newer (more agile) organizations around the world.

The second was the fact that while Cassandra is in use at twitter, they don’t use it as the primary datastore for all the tweets (yet). They have other uses for it which they didn’t elaborate. The version of Cassandra they use is close to 0.6.2 which just got released. It also looked (from my discussions with one engineer) like cassandra treated rack-awareness and datacenter-awareness in a slightly different way. In the previous documentations I read, they both were the same for all practical purposes. In other words, I need to research this a little more since optimizations in this area can boost Cassandra’s performance across datacenters.

The third was that while Twitter uses cutting edge tools for a lot of different things, they don’t have service discovery nailed yet. They are playing with zookeeper, and I believe they will use that eventually, but its not there yet. This by itself is amazing because without service discovery, the management of configuration and rolling out configuration changes becomes centralized which has its own advantages and disadvantages. At the organization I work, we are playing with cassandra as a service publication/discovery tool for monitoring and consuming services. The short discussion I had with twitter folks about using cassandra in such a way validated the work work I’m doing with cassandra. But I’m still puzzled why others are not thinking about cassandra (or other eventually-consistent datastore) for service discovery. It sounded like Zookeeper might be an overkill for my organization, but I should take a look at it again.

The fourth was that Twitter employs a lot of very smart/passionate people who are amazingly good with most of the network/application stack. They dove into things like browser/javascript/cookies and then switched to dissecting network traffic using sniffer tools to debug a possible Oauth implementation bug and other weird things. This just adds to the current popular wisdom that scalability/stability/security can’t be done in small little silos.

The fifth and final I’d like to write about is the hackthon itself. Its amazing how Twitter organized this hackathon, got a group of hackers to play with their new APIs and gave them ability to demo their hacks to the likes of Paul Graham  and Ron Conway. In return they got very interesting product-ideas and use-cases for a feature which is still unpolished and unreleased. But more importantly they also got a bunch of hackers to intentionally and unintentionally break the feature and discover some serious and some very annoying bugs. They also got feedback on what does and doesn’t resonates with developers. In a way this is similar to what some other organizations (including Google) already do with their alpha/beta program, but nothing beats the velocity of hacking up 10 to 20 almost-ready products around a brand new feature in less than 32 hours.

References

P.S.: I’m terribly sorry for spamming my twitter followers who were bombarded with twitter test messages for two days. Next time I’ll pick a test account 🙂

Google Storage : What it really is…

Yesterday Google formally announced Google Storage to a few (5000?) of us at Google I/O. Here is the gist of this as I see it from the various discussions/talks I attended.

To begin with, I have to point out that there is almost nothing new in what Google has proposed to provide. Amazon has been doing this for years with its S3.  The key difference is that if you are a google customer you won’t have to look elsewhere for storage services like this one.

Lets get the technical details out

  • Its tries to implement a Strong consistency model (CA of the CAP: Consistent and Available). Which means data you store is automatically replicated in a consistent way across multiple datacenter
    • Currently it replicates to multiple locations within US. In future it does plan to replicate across continents.
    • Currently there are no controls to control how replication happens or to where. They plan to learn from usage in beta period and develop controls over time.
  • There are two basic building blocks for objects Google Code Labs
    • Buckets – Containers
        All objects are stored in flat container. However, the tools understand “/” and “*” (wild cards) and does the right thing when used correctly
    • Objects – objects/files inside those containers
  • Implements RESTful APIs (GET/PUT/POST/DELETE/HEAD/etc)
    • All resources are identified by a URI
  • No theoretical size limit of Buckets or containers. However a 100GB limit per account would be imposed during beta phase.
  • Its of course built on Google very well tested, scalable, highly available infrastructure
  • It provides multiple, flexible authentication and sharing models
    • Does support standard public/private key based auth
    • Will also have integration with some kind of groups which will allow object to be shared with  or controlled by with multiple identities.
    • ACLs can be applied to both Buckets and Objects
      • Buckets
        • Control who can list objects
        • Who can create/delete objects
        • Who can read/write into the bucket
      • Objects
        • Who can read
        • Who can read/write
  • Tools
    • There were two tools mentioned during the talk
      • GS manager looks like a web application which allows an admin to manage this service
      • GS util is more like the shell tools AWS provides for S3.
        • As I mentioned before GS util accepts wild card
          • So something like this is possible
            • gsutil cp gs://gs2010/*  /home/rkt/gs2010
  • The service was created with “data liberation” as one of the goals. As shown by the previous command it takes just one line of code to transfer all of your data out.
  • Resume feature (if connection breaks during a big upload) is not available yet, but thats on the roadmap.
  • Groups feature was discussed a lot, but its not ready in the current release
  • Versioning feature is not available. Wasn’t clear if its on the roadmap or how long before its implemented.

Few other notes.

  • Its not clear how this plays with the “storage service” Google currently provides for Gmail/Docs storage. From what I heard this is not related to that storage service at all and there are no plans to integrate it.
  • The service is free in beta period to all developers who get access to it, but when its released it will follow a pricing model similar others in the industry. The pricing model is already published on their website
  • The speakers and the product managers didn’t comment on whether storage access from google apps engine would be charged (or at what rate)
  • They do provide MD5 signatures as a way of verifying if an object on the client is same as the object on the server, but its not used storing files itself. (So MD5 collisions issue shouldn’t be a problem)
  • US Navy is already using this service with about 80TB of data on Google Storage, and from what I heard they looked pretty happy talking about it.

I suspect this product will be in beta for a while before they release it out in the open.

Spanner: Google’s next Massive Storage and Computation infrastructure

MapReduce, Bigtable and Pregel have their origins in Google and they all deal with “large systems”. But all of them may be dwarfed in size and complexity by a new project Google is working on, which was mentioned briefly (may be un-intentionally) at an event last year.

Instead of caching data closer to user, it looks like Google is trying to take “the data” to the user. If you use GMail or a Google Doc service, then with this framework, Google could, auto-magically, “move” one of the master copies of your data to the nearest Google data center without really having to cache anything locally. And because they are building one single datastore cluster around the world, instead of building hundreds of smaller ones for different applications, it looks like they may not don’t need dedicated clusters for specific projects anymore.

Below is the gist of “Spanner” from a talk by Jeff Dean at Symposium held at Cornell. Take a look at the rest of the slides if you are interested in some impressive statistics on hardware performance and reliability.

  • Spanner: Storage & computation system that spans all our datacenters
    • Single global namespace
      • Names are independent of location(s) of data
      • Similarities to Bigtable: table, families, locality groups, coprocessors,…
      • Differences: hierarchical directories instead of rows, fine-grained replication
      • Fine-grained ACLs, replication configuration at the per-directory level
    • support mix of strong and weak consistency across datacenters
      • Strong consistency implemented with Paxos across tablet replicas
      • Full support for distributed transactions across directories/machines
    • much more automated operation
      • System automatically moves and adds replicas of data and computation based on constraints and usage patterns
      • Automated allocation of resources across entire fleet of machines.

image

 
References

Talk on “database scalability”

This is a very interesting talk by Jonathan Ellis on database scalability. He designed and implemented multi-petabyte storage for Mozy and is currently the project chair for Apache Cassandra.

  • Scalability is not improving latency, but increasing throughput
  • But overall performance shouldn’t degrade
  • Throw hardware, not people at the problem
  • Traditional databases use b-tree indexes. But requires the entire index to be in-memory at the same place.
  • Easy bandaid #1– SSD storage is better for b-tree indexes which need to hit disk
  • Easy bandaid #2 – Buy faster server every 2 years. As long as your userbase doesn’t grow faster that Moore’s law
  • Easy bandaid #3 – Use caching to handle hotspots (Distributed)
  • Memcache server failures can change where hashing keys are kept
  • Consistent hashing solves the problem by mapping keys to tokens. The tokens can move around to more or less server. Apps would be able to figure out which keys are where.

Brewers CAP Theorem on distributed systems

Large distributed systems run into a problem which smaller systems don’t usually have to worry about. image“Brewers CAP Theorem” [ Ref 1] [ Ref 2] [ Ref 3]  defines this problem in a very simple way.

It states, that though its desirable to have Consistency, High-Availability and Partition-tolerance in every system, unfortunately no system can achieve all three at the same time.

Consistent: A fully Consistent system is one where the system can guarantee that once you store a state (lets say “x=y”) in the system, it will report the same state in every subsequent operation until the imagestate is explicitly changed by something outside the system. [Example 1] A single MySQL database instance is automatically fully consistent since there is only one node keeping the state. [Example 2] If two MySQL servers are involved, and if the system is designed in such a way that all keys starting “a” to “m” is kept on server 1, and keys “n” to “z” are kept on server 2”,  then system can still easily guarantee consistency.image Lets now setup the DBs as master-master replicas[Example 3] . If one of the database accepts get a “row insert” request, that information has to be committed to the second system before the operation is considered complete. To require 100% consistency in such a replicated environment, communication between nodes is paramount. The over all performance of such a system could drop as number of replica’s required goes up.

Available:  The database in [Example 1] or [Example 2] are not highly Available. In [Example 1] if the node goes down there would 100% data loss. In [Example 2] if one node goes down, you will have 50% data loss. [Example 3] is the only solution which solves that problem. A simple MySQL server replication [ Multi-master mode]  setup could provide 100% availability. Increasing the number of nodes with copies of the data directly increases the availability of the system. Availability is not just a protection from hardware failure. Replicas also help in loadbalancing concurrent operations, especially the read operations. “Slave” MySQL instances are a perfect example of such “replicas”.

Partition-tolerance: So you got Consistency and Availability by replicating data. Lets say you had these two MySQL servers in Example 3, in two different datacenters, and you loose the network connectivity between the two datacenters making both databases incapable of synchronizing state between the two. Would the two DBs be fully functional in such a scenario ? If you somehow do manage allow read/write operations on these two databases, it can be proved that the two servers won’t be consistent anymore. A banking application which keeps “state of your account” at all times is a perfect example where its bad to have inconsistent bank records. If I withdraw 1000 bucks from California, it should be instantly reflected in the NY branch so that the system accurately knows how much more I can withdraw at any given time. If the system fails to do this, it could potentially cause problems which could make some customers very unhappy. If the banks decide consistency is very important, and disable write operations during the outage, it will will loose “availability” of the cluster since all the bank accounts at both the branches will now be frozen until network comes up again.

This gets more interesting when you realize that C.A.P. rules don’t have to be applied in an “all or nothing” fashion. Different systems can choose various levels of consistency, availability or partition tolerance to meet their business objective. Increasing the number of replicas, for example, increases high availability but it could at the same time reduce partition tolerance or consistency.

When you are discussing distributed systems ( or distributed storage ), you should try to identify which of the three rules it is trying to achieve. BigTable, used by Google App engine, and HBase, which runs over Hadoop, claim to be always consistent and highly available. Amazon’s Dynamo, which is used by S3 service and datastores like Cassandra instead sacrifice consistency in favor of availability and partition tolerance.

CAP theorem doesn’t just apply to databases. Even simple web applications which store state in session objects have this problem. There are many “solutions” out there which allow you to replicate session objects and make sessions data “highly available”, but they all suffer from the same basic problem and its important for you to understand it.

Recognizing which of the “C.A.P.” rules your business really needs should be the first step in building any successful distributed, scalable, highly-available system.