Netflix: Dev and Ops internals

I’ve seen a number of posts from Netflix folks talking about their architecture in the recent weeks. And part of that is due to an ongoing effort to expand their business for which they seem to be hiring like crazy. Here is the yet another interesting deck of slides which mentions stuff across both Dev and Ops.

One of the most interesting deck of slides I’ve seen in recent past.

“Chrome instant” feature could break your webapp

The “Google instant” wasn’t a ground breaking idea by itself. We have all been using various forms of imageauto-completes for a while now. What makes it stand out is that unlike all the previous kinds of auto-completes, this one is able to search the entire web archive, at an amazing speed and still be able to serve personalized, hyper-local results.  You can get more information about its backend here and here.

It wasn’t surprising that Google even put this feature inside chrome itself. Take a look at this demo from lifehacker. This is where it gets interesting…

 

At the beginning this looked very exciting. I was pleasantly surprised when chrome brought up websites, in addition to auto-completing URLs,  as I typed. The impact on the servers didn’t sink in until I was debugging a bug in my own application which required me to take a look at the apache logs. Look at the following log snippet from apache. Not surprisingly, I found 17 calls instead of just 1 made to my web application while I was typing the URL. All of this happened in 6 seconds, which is about the time it took me to type the URL.

[29/Sep/2010:02:39:04 -0700] "GET /cfmap/create.jsp?p HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:04 -0700] "GET /cfmap/create.jsp?po HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:04 -0700] "GET /cfmap/create.jsp?por HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:05 -0700] "GET /cfmap/create.jsp?port HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:05 -0700] "GET /cfmap/create.jsp?port= HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:05 -0700] "GET /cfmap/create.jsp?port=1 HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:05 -0700] "GET /cfmap/create.jsp?port=1 HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:08 -0700] "GET /cfmap/create.jsp?port=1& HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:08 -0700] "GET /cfmap/create.jsp?port=1&a HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:08 -0700] "GET /cfmap/create.jsp?port=1&ap HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:08 -0700] "GET /cfmap/create.jsp?port=1&app HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:09 -0700] "GET /cfmap/create.jsp?port=1&appn HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:09 -0700] "GET /cfmap/create.jsp?port=1&appna HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:09 -0700] "GET /cfmap/create.jsp?port=1&appnam HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:09 -0700] "GET /cfmap/create.jsp?port=1&appname HTTP/1.1" 200 60 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:09 -0700] "GET /cfmap/create.jsp?port=1&appname= HTTP/1.1" 200 60 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:10 -0700] "GET /cfmap/create.jsp?port=1&appname=34 HTTP/1.1" 200 60 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:10 -0700] "GET /cfmap/create.jsp?port=1&appname=34 HTTP/1.1" 200 60 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0

There are two issues here which made me very concerned

  1. Volume of requests: This is a no brainer. The example I used above is not a normal use case since we don’t expect users to type URLs every time they use web-applications. But if the app has an easy to use API which can be used by users in this way, the impact of that small percentage of users who use will get magnified many folds very quickly. It may get very important to figure out how to queue requests, and also important to figure out how to distinguish between users who are spamming the website with 10 requests per second from the user who makes 1 request. All this problem could also go away if your app can actually handle 5 to 20 times more traffic already, which is probably the best solution.
  2. Robust APIs: This is a more tricky one which developers need to plan for. Lets say there was an API like this “/api/transfermoney.php?from=account1&to=account2&amount=10000”. How much money will this API transfer if you type this url in a browser which auto-executes partial URLs ?

What broke the camels back was the fact this particular feature was often flagged by Google’s own search engine as being spammy/automated.  It got so bad that I had to switch to firefox to do a simple google search.  image

And here is an example of how my Google history is now polluted with things I didn’t really search for. In this example I was looking for “ohdoctah” after I heard about it on twit. The key here is that while Google might have thought about how to mine this polluted search data, other web applications might find this impossible to deal with without significant addition in resources. 

image

For now I’ve disabled the feature in the browser. I hope that either there is an easy solution to this problem, otherwise I don’t see this feature making it into the production version of Chrome soon.

Operations Dashboards – KaChing

I’ve mentioned this before, but like to do it again because I think these guys are awesome. If you have not listened to devopscafe’s podcasts, this might be the right time to take a look at it. Here is a video of one of their sessions with folks at KaChing who have been doing amazing stuff around continuous deployments.

Continuous Deployment and Operations Dashboards at kaChing from dev2ops.org on Vimeo.

Scalability updates for Aug 27th 2010

My updates have been slow recently due to other things I’m involved in. If you need more updates around what I’m reading, please feel free to follow me on twitter or buzz.

Here are some of the big ones I have mentioned on my twitter/buzz feeds.

Distributed systems and Unique IDs: Snowflake

Most of us who deal with traditional databases take auto-increments for granted. While auto-increments are simple on consistent clusters, it can become a challenge in a cluster of independent nodes which don’t use the same source for the unique-ids. Even bigger challenge is to do it in such a way so that they are roughly in sequence.

While this may be an old problem, I realized the importance of such a sequence only after using Cassandra in my own environment. Twitter, which has been using Cassandra in many interesting ways has proposed a solution for it which they are releasing as open source today.

Here are some interesting sections from their post announcing “Snowflake”.

The Problem

We currently use MySQL to store most of our online data. In the beginning, the data was in one small database instance which in turn became one large database instance and eventually many large database clusters. For various reasons, the details of which merit a whole blog post, we’re working to replace many of these systems with the Cassandra distributed database or horizontally sharded MySQL (using gizzard).

Unlike MySQL, Cassandra has no built-in way of generating unique ids – nor should it, since at the scale where Cassandra becomes interesting, it would be difficult to provide a one-size-fits-all solution for ids. Same goes for sharded MySQL.

Our requirements for this system were pretty simple, yet demanding:

We needed something that could generate tens of thousands of ids per second in a highly available manner. This naturally led us to choose an uncoordinated approach.

These ids need to be roughly sortable, meaning that if tweets A and B are posted around the same time, they should have ids in close proximity to one another since this is how we and most Twitter clients sort tweets.[1]

Additionally, these numbers have to fit into 64 bits. We’ve been through the painful process of growing the number of bits used to store tweet ids before. It’s unsurprisingly hard to do when you have over 100,000 different codebases involved.

 

Solution

To generate the roughly-sorted 64 bit ids in an uncoordinated manner, we settled on a composition of: timestamp, worker number and sequence number.

Sequence numbers are per-thread and worker numbers are chosen at startup via zookeeper (though that’s overridable via a config file).

We encourage you to peruse and play with the code: you’ll find it on github. Please remember, however, that it is currently alpha-quality software that we aren’t yet running in production and is very likely to change.

@twitter annotations : What I learnt at the hackfest….

A few of us joined in at the new Twitter office in downtown SF (right next to Moscone Center) and were for the first time shown what Twitter is doing about  “Twitter Annotations”. We probably created the first set of 3rd party applications around this new API. During the Hackathon I spent some time to wear my “Scalable web architecture” hat to think what I could learn from this experience which I’ve summarized below.Twitter

Twitter annotations from a developer’s view point is just an extension to existing APIs which now allows posting of additional structured content along with “tweet”. The content stays within the context of the tweet and will be retweeted/shared automatically with the main tweet. Twitter has some recommendations on how the annotations should be structured, for example they were talking about “type” which sounded very much like Open Graph’s “type/category” concept, with the difference that Twitter has left the field open for any kind of “type” users want. Facebook, if I remember right, had strongly recommended users to use a small set of “categories/types” which they published. Twitter accepted these annotations in multiple formats of which I tried the “simple” and “JSON” protocols. The “JSON” way was the most recommended/used medium of annotation during the whole hackathon. While annotation structure (using JSON) did allow multiple “types” in the same tweet, there were a few limitations which were slightly constricting. The first big one was that the current implementation allowed only 512 bytes in the annotations field. The second limitation was that the structure, while its JSON, it only supported a few levels of depth in the structured annotation. This was extremely restrictive for the use case I was trying to hack up.

There were a few things I learnt during the whole 32 hour experience. The first one was that twitter had actually hosted these half baked APIs on http://api.twitter.com and http://www.twitter.com, which I’m glad to say is still accessible using my account from outside twitter’s buildings. Of course the hackers(we) had to be white-listed to get access to use it, but from an operations view point this is extremely gutsy since one bad ACL code fragment could expose a lot of uncooked APIs to the whole world. This approach of testing is not new to Twitter and is frequently used for A/B testing in newer (more agile) organizations around the world.

The second was the fact that while Cassandra is in use at twitter, they don’t use it as the primary datastore for all the tweets (yet). They have other uses for it which they didn’t elaborate. The version of Cassandra they use is close to 0.6.2 which just got released. It also looked (from my discussions with one engineer) like cassandra treated rack-awareness and datacenter-awareness in a slightly different way. In the previous documentations I read, they both were the same for all practical purposes. In other words, I need to research this a little more since optimizations in this area can boost Cassandra’s performance across datacenters.

The third was that while Twitter uses cutting edge tools for a lot of different things, they don’t have service discovery nailed yet. They are playing with zookeeper, and I believe they will use that eventually, but its not there yet. This by itself is amazing because without service discovery, the management of configuration and rolling out configuration changes becomes centralized which has its own advantages and disadvantages. At the organization I work, we are playing with cassandra as a service publication/discovery tool for monitoring and consuming services. The short discussion I had with twitter folks about using cassandra in such a way validated the work work I’m doing with cassandra. But I’m still puzzled why others are not thinking about cassandra (or other eventually-consistent datastore) for service discovery. It sounded like Zookeeper might be an overkill for my organization, but I should take a look at it again.

The fourth was that Twitter employs a lot of very smart/passionate people who are amazingly good with most of the network/application stack. They dove into things like browser/javascript/cookies and then switched to dissecting network traffic using sniffer tools to debug a possible Oauth implementation bug and other weird things. This just adds to the current popular wisdom that scalability/stability/security can’t be done in small little silos.

The fifth and final I’d like to write about is the hackthon itself. Its amazing how Twitter organized this hackathon, got a group of hackers to play with their new APIs and gave them ability to demo their hacks to the likes of Paul Graham  and Ron Conway. In return they got very interesting product-ideas and use-cases for a feature which is still unpolished and unreleased. But more importantly they also got a bunch of hackers to intentionally and unintentionally break the feature and discover some serious and some very annoying bugs. They also got feedback on what does and doesn’t resonates with developers. In a way this is similar to what some other organizations (including Google) already do with their alpha/beta program, but nothing beats the velocity of hacking up 10 to 20 almost-ready products around a brand new feature in less than 32 hours.

References

P.S.: I’m terribly sorry for spamming my twitter followers who were bombarded with twitter test messages for two days. Next time I’ll pick a test account 🙂

Google Storage : What it really is…

Yesterday Google formally announced Google Storage to a few (5000?) of us at Google I/O. Here is the gist of this as I see it from the various discussions/talks I attended.

To begin with, I have to point out that there is almost nothing new in what Google has proposed to provide. Amazon has been doing this for years with its S3.  The key difference is that if you are a google customer you won’t have to look elsewhere for storage services like this one.

Lets get the technical details out

  • Its tries to implement a Strong consistency model (CA of the CAP: Consistent and Available). Which means data you store is automatically replicated in a consistent way across multiple datacenter
    • Currently it replicates to multiple locations within US. In future it does plan to replicate across continents.
    • Currently there are no controls to control how replication happens or to where. They plan to learn from usage in beta period and develop controls over time.
  • There are two basic building blocks for objects Google Code Labs
    • Buckets – Containers
        All objects are stored in flat container. However, the tools understand “/” and “*” (wild cards) and does the right thing when used correctly
    • Objects – objects/files inside those containers
  • Implements RESTful APIs (GET/PUT/POST/DELETE/HEAD/etc)
    • All resources are identified by a URI
  • No theoretical size limit of Buckets or containers. However a 100GB limit per account would be imposed during beta phase.
  • Its of course built on Google very well tested, scalable, highly available infrastructure
  • It provides multiple, flexible authentication and sharing models
    • Does support standard public/private key based auth
    • Will also have integration with some kind of groups which will allow object to be shared with  or controlled by with multiple identities.
    • ACLs can be applied to both Buckets and Objects
      • Buckets
        • Control who can list objects
        • Who can create/delete objects
        • Who can read/write into the bucket
      • Objects
        • Who can read
        • Who can read/write
  • Tools
    • There were two tools mentioned during the talk
      • GS manager looks like a web application which allows an admin to manage this service
      • GS util is more like the shell tools AWS provides for S3.
        • As I mentioned before GS util accepts wild card
          • So something like this is possible
            • gsutil cp gs://gs2010/*  /home/rkt/gs2010
  • The service was created with “data liberation” as one of the goals. As shown by the previous command it takes just one line of code to transfer all of your data out.
  • Resume feature (if connection breaks during a big upload) is not available yet, but thats on the roadmap.
  • Groups feature was discussed a lot, but its not ready in the current release
  • Versioning feature is not available. Wasn’t clear if its on the roadmap or how long before its implemented.

Few other notes.

  • Its not clear how this plays with the “storage service” Google currently provides for Gmail/Docs storage. From what I heard this is not related to that storage service at all and there are no plans to integrate it.
  • The service is free in beta period to all developers who get access to it, but when its released it will follow a pricing model similar others in the industry. The pricing model is already published on their website
  • The speakers and the product managers didn’t comment on whether storage access from google apps engine would be charged (or at what rate)
  • They do provide MD5 signatures as a way of verifying if an object on the client is same as the object on the server, but its not used storing files itself. (So MD5 collisions issue shouldn’t be a problem)
  • US Navy is already using this service with about 80TB of data on Google Storage, and from what I heard they looked pretty happy talking about it.

I suspect this product will be in beta for a while before they release it out in the open.

Spanner: Google’s next Massive Storage and Computation infrastructure

MapReduce, Bigtable and Pregel have their origins in Google and they all deal with “large systems”. But all of them may be dwarfed in size and complexity by a new project Google is working on, which was mentioned briefly (may be un-intentionally) at an event last year.

Instead of caching data closer to user, it looks like Google is trying to take “the data” to the user. If you use GMail or a Google Doc service, then with this framework, Google could, auto-magically, “move” one of the master copies of your data to the nearest Google data center without really having to cache anything locally. And because they are building one single datastore cluster around the world, instead of building hundreds of smaller ones for different applications, it looks like they may not don’t need dedicated clusters for specific projects anymore.

Below is the gist of “Spanner” from a talk by Jeff Dean at Symposium held at Cornell. Take a look at the rest of the slides if you are interested in some impressive statistics on hardware performance and reliability.

  • Spanner: Storage & computation system that spans all our datacenters
    • Single global namespace
      • Names are independent of location(s) of data
      • Similarities to Bigtable: table, families, locality groups, coprocessors,…
      • Differences: hierarchical directories instead of rows, fine-grained replication
      • Fine-grained ACLs, replication configuration at the per-directory level
    • support mix of strong and weak consistency across datacenters
      • Strong consistency implemented with Paxos across tablet replicas
      • Full support for distributed transactions across directories/machines
    • much more automated operation
      • System automatically moves and adds replicas of data and computation based on constraints and usage patterns
      • Automated allocation of resources across entire fleet of machines.

image

 
References

Talk on “database scalability”

This is a very interesting talk by Jonathan Ellis on database scalability. He designed and implemented multi-petabyte storage for Mozy and is currently the project chair for Apache Cassandra.

  • Scalability is not improving latency, but increasing throughput
  • But overall performance shouldn’t degrade
  • Throw hardware, not people at the problem
  • Traditional databases use b-tree indexes. But requires the entire index to be in-memory at the same place.
  • Easy bandaid #1– SSD storage is better for b-tree indexes which need to hit disk
  • Easy bandaid #2 – Buy faster server every 2 years. As long as your userbase doesn’t grow faster that Moore’s law
  • Easy bandaid #3 – Use caching to handle hotspots (Distributed)
  • Memcache server failures can change where hashing keys are kept
  • Consistent hashing solves the problem by mapping keys to tokens. The tokens can move around to more or less server. Apps would be able to figure out which keys are where.

Scaling PHP : HipHop and Quercus

While PHP is very popular, it unfortunately doesn’t perform as some of its competitors. One of the ways to make things faster is to write PHP Extensions in C++. In this post we will describe two different ways developers can solve this problem and the milage you might get from either model may vary.

Since Facebook is mostly running PHP, it noticed this problem pretty early, but instead of asking its developers to move from PHP to C++ one of their developers hacked up a solution to transform PHP code into C++.

Yesterday, Facebook announced they are opening up HipHop, a source code transformer, which changes PHP code into a more optimized C++ code and uses g++ to compile it. With some minor sacrifices (no eval support)  they noticed they were able to get 50% performance improvement. And since they serve 400 billion page views every month, that kind of saving can free up a lot of servers.

More info on HipHop

Quercus on the other hand is a 100% java implementation of PHP 5. What makes this more interesting is that Quercus can now run in Google App Engine pretty much the same way JSPs can.

Caucho’s Quercus presents a new mixed Java/PHP approach to web applications and services where Java  Caucho Technologyand PHP tightly integrate with each other. PHP applications can choose to use Java libraries and technologies like JMS, EJB, SOA frameworks, Hibernate, and Spring. This revolutionary capability is made possible because 1) PHP code is interpreted/compiled into Java and 2) Quercus and its libraries are written entirely in Java. This architecture allows PHP applications and Java libraries to talk directly with one another at the program level. To facilitate this new Java/PHP architecture, Quercus provides and API and interface to expose Java libraries to PHP.

The demo of Quercus running on GAE was very impressive. Any pure PHP code which doesn’t need to interact with external services would work beautifully without any issues on GAE. But the absence of Mysql in GAE means SQL queries have to be mapped to datastore (bigtable) which might require a major rewrite to parts of the application. But its not impossible, as they have shown by making wordpress run on GAE (crawl might be a better word though).

While Quercus is opensource and is as fast as regular PHP code in interpreted mode, the compiler which is way faster is not free. Regardless Quercus is a step in the right direction, and I sincerely hope PHP support on GAE is here to stay.