Netflix: Dev and Ops internals

I’ve seen a number of posts from Netflix folks talking about their architecture in the recent weeks. And part of that is due to an ongoing effort to expand their business for which they seem to be hiring like crazy. Here is the yet another interesting deck of slides which mentions stuff across both Dev and Ops.

One of the most interesting deck of slides I’ve seen in recent past.

Building your first cloud application on AWS

Building your first web application on AWS is like shopping for a car at pepboys, part by part. While manuals to build one might be on aisle 5, the experience of having built one already is harder to buy.

Here are some interesting logistical questions, which I don’t think get enough attention, when people discuss issues around building a new AWS based service.

  1. Picking the right Linux distribution: Switching OS distribution may not be too simple if your applications need custom scripts. Picking and sticking with a single distribution will save a lot of lost time.
  2. Automated server builds: There are many ways to skin this cat. Chef, Puppet, Cfengine are all good… Whats important is to pick one early in the game.
  3. Multi-Availability Zone support: Find out if multi availability zone support is important. This can impact over all architecture of the solution and tools used to implement the solution.
  4. Data consistency requirements: Similar to the Multi-AZ support question, its important to understand the data consistency tolerance of the application before one starts designing the application.
  5. Datastore: There are different kinds of datastores available as part of AWS itself (SimpleDB, S3 and RDS). If you are planning to keep your options open about moving out of AWS at some point, you should think about picking a datastore which you could move out with you with little effort. There are many NoSQL and RDBMS solutions to choose from.
  6. Backups: While some think its a waste of time to think about backups too early, I suspect those who don’t will be spending way too much time later. The long term backup strategy is integral part of disaster recovery planning, without which you shouldn’t think of going live.
  7. Integration with external data sources:  If this application is part of a larger cluster of application which is running somewhere else, think about how data would be sent back and forth. There are lots of different options depending on how much data is involved (or how important protection of that data is)
  8. Monitoring/Alerting: Most standard out of the box monitoring tools can’t handle dynamic infrastructure very well. There are, however,  plugins available for many existing monitoring solutions which can handle the dynamic nature of infrastructure. You could also choose to use one of the 3rd party monitoring services if you’d rather pay someone else to do it.
  9. Security: You should be shocked to see this on #9 on my list. If your service involves user data, or some other kind of intellectual property, build multi-tiered architecture to segment different parts of your application from targeted attacks. Security is also very important while picking the right caching and web server technologies.
  10. Development: Figure out how developers would use AWS. Would they share the same AWS account, share parts of the infrastructure, share datastore, etc. How would the developer resources be monitored so that unintentional uses of excessive resources could be flagged for alerting.

Are there other subtle issues which I should have listed here ? Let me know.

 

Why Membase Uses Erlang

It’s totally worth it. Erlang (Erlang/OTP really, which is what most people mean when they say “Erlang has X”) does out of the box a lot of things we would have had to either build from scratch or attempt to piece together existing libraries to do. Its dynamic type system and pattern matching (ala Haskelland ML) make Erlang code tend to be even more concise than Python and Ruby, two languages known for their ability to do a lot in few lines of NorthScalecode.

The single largest advantage to us of using Erlang has got to be its built-in support for concurrency. Erlang models concurrent tasks as “processes” that can communicate with one another only via message-passing (which makes use of pattern matching!), in what is known as the actor model of concurrency. This alone makes an entire class of concurrency-related bugs completely impossible. While it doesn’t completely prevent deadlock, it turns out to be pretty difficult to miss a potential deadlock scenario when you write code this way. While it’s certainly possible to implement the actor model in most if not all of the alternative environments I mentioned, and in fact such implementations exist, they are either incomplete or suffer from an impedance mismatch with existing libraries that expect you to use threads.

The complete original post can be found here: blog.membase.com

Continuous deployments may not be for everyone: Culture

If you have read this blog before, you know how much I admire those who use continuous deployments in production. Doing that at scale is even more impressive. But the message which gets lost sometimes is that Continuous deployments may not be for everyone.

Most continuous integration environments usually do all of their deployments from trunk. Which means every check-in has to be production quality. Digg’s Andrew Bayer gives a good explanation of how they do code reviews and pre-code check-ins before code is merged into trunk.

Site uptime and reliability depends on a comprehensive QA process to protect against unintentional mistakes. And for rapid deployments one has to abandon manual QA processes in favor of 100% automated testing with the goal of getting close to 100% code coverage. Thats hard if the code is not written in a way which can be tested easily.

image

But, unit and integration tests alone cannot guarantee quality. In addition to testing code which has been implemented in the application, there needs to be tests to look for things which shouldn’t be implemented. For example, it would be nice to have tests to look for non-parameterized SQL calls in parts of code where it shouldn’t exist. If you know there is a wrong way to do something, write a test case for it so that its caught as soon as someone does it.

Some of this would be easy to do if you already follow a test driven development process where you have to write tests before you write code.

The biggest difference between an organization which follows Continuous deployment and one which doesn’t is in how QA is done. QA becomes a shared responsibility where everyone has to contribute. No matter how many tools or guidelines one publishes, if teams using this process don’t believe in it, the quality and availability of website will suffer. Pascal-Louis Perez (from KaChing) used a diagram like the one here to explain how this “culture” is at the heart of continuous deployment.

“Culture” also explains why most of the older organizations who follow a more traditional form of deployment are having a hard time understanding and adapting to this process.

Are you using Continuous deployments in your environment ? What was your biggest hurdle ?

Thoughts on scalable web operations

Interesting observations/thoughts on  web operations collected from a few sessions at Velocity conference 2010 [ most are from a talk by Theo Schlossnagle, author of “Scalable internet architectures” ]

  • Optimization O'Reilly Radar Logo
    • Don’t over optimize. Could take away precious resources away from critical functions. 
    • Don’t scale early. Planning for more than 10 times the load you currently have or are planning to support might be counter-productive in most cases. RDBMS is fine until you really need something which can’t fit on 2 or 3 servers.
    • Optimize performance on single node before you optimize and re-architect a solution for horizontal scalability.
  • Tools
    • Tools are what a master craftsman makes… tools don’t make a craftsman a master.
    • Tools can never solve a problem, its correct use does.
    • Master the tools which need to be (could be ) used in production at short notice. Looking for man page for these tools during an outage isn’t ideal.
  • Cookies
    • Use cookies to store data wherever possible.
    • Sign them if you are concerned about tampering
    • Encrypt them if you are concerned about users having visibility into it
    • Its cheaper to use user’s browser as a datastore replication node, than build redundant servers
  • Datastores
    • NoSQL is not the solution for everything [ example: so long MongoDB ]
    • Ditto RDBMS
    • Ditto everything else
    • Get the requirements, understand the problem and then pick the solution. Instead of the other way around.
  • Automation
    • When you find yourself doing something more than 2 times, write scripts to automate it
    • When users report failures before monitoring systems do, write better monitoring tools.
  • Revision control
    • Revision control as much as possible.
    • Provides audit trail to help understand what happened before. One can’t remember everything. Excellent place to search during hard to solve production problems.
  • Networking
    • Think in packets and not bytes to save load time.
    • There is no point in compressing a CSS file which is 400 bytes since the smallest data IP packet will store is about 1300 bytes (rest of the packet is padded with empty bytes if the data being sent is smaller).
    • In fact compression and decompression will take away precious CPU resources on server and the client.
    • Instead think of embedding short CSS files in HTML to save a few extra packets.
  • Caching
    • Static objects
      • Cache all static objects for ever
      • Add random numbers/strings to objects to force a reload of the object.
        • For example instead of requesting “/images/myphoto.jpg” request “/images/myphoto.123245.jpg”
        • Remove the random ID using something like an htaccess rewrite rule
      • Use CDNs where ever possible, but make sure you understand all the objects part of your page before you shove the problem to a CDN. pointless redirects can steal away previous loading time.
  • People
    • When you hire someone for operations team, never hire someone who can’t remember a single production issue he/she was caused. People learn the most from mistakes, so recognizing people who have been on the hot seat and have fixed their mistakes.
    • Allow people to take risks in production and watch them how they recover from it. Taking risk is part of adapting to new ideas, and letting them fail helps them understand how to improve.
  • Systems
      • Know your systems baseline. An instant/snapshot view of a system’s current statistics is never sufficient to fully classify a systems current state. ( for example is 10 load average abnormal on server XYZ ?)
      • Use tools which periodically poll and archive data to help you give this information
    • Moderation
      • Moderate the tools and process you use
      • Moderate the moderation

    What did I miss ? 🙂 Let me know and I’ll add it here…

  • You don’t have to be Google to use NoSQL

    Ted Dziuba has a post about “I can’t wait for NoSQL to Die”. The basic argument he makes is that one has to be at the size Google is to really benefit from NoSQL. I think he is missing the point. nosql

    Here are my observations.

    • This is similar to the argument the traditional DB vendors were making when companies started switching away from the likes of Oracle/DB2 to MySQL. The difference between then and now is that before it was Large established databases vendors against the smaller (open-source) ones, and now its RDBMS vs non-RDBMS datastores.
    • Why NoSQL: The biggest difference between an RDBMS and a NoSQL datastore is the fact that NoSQL datastructures have no pre-defined schemas. That doesn’t mean that the developers don’t have to think about the data structure before using a NoSQL solution, but it does provide the opportunity to developers to add new columns which were not thought of at design time with little or no impact on applications using it. You can add and remove columns on the fly on most RDBMS as well, but those changes are usually considered significant. Also keep in mind that while NoSQL datastores could add columns at the row level, RDBMS solutions can only do it at the table level.
    • Scalability: There are basically two ways to scale any web application.
      • The first way is to build the app and leave the scalability issues for later (let the DBAs to figure out). This is an expensive iterative process which takes time to perfect. The issues around scalability and availability could be so complex that one may not be able to predict all the issues until they get used in production.
      • The second way is to train the programmers to architect the database so that it can scale better once it hits production. There is a significant upfront cost, but it pays over time.
      • NoSQL is the third way of doing it.
        • It restricts programmers by allowing only those operations and data-structures which can scale
        • And programmers who manage to figure out how to use it, have found that the these kind of restrictions guarantee significantly higher horizontal scalability than traditional RDBMS.
        • By architecting databases before the product is launched, it also reduces the amount of outage and post-deployment migrations.
    • High Availability: NoSQL is not just about scalability. Its also about “high-availability” at a cheaper cost.
      • While Ted did mention that some of the operations in Cassandra requires a restart, he forgot to mention that it doesn’t require all the nodes to be restarted at the same time. The cassandra datastore continues to be available even without many of its nodes. This is a common theme across most of the NoSQL based datastores. [CASSANDRA-44]
      • High availability over long distances with flaky network connection is not trivial to implement using traditional RDBMS based databases.
    • You don’t have to be Google to see benefits of using NoSQL.
      • If you are using S3 or SimpleDB on AWS or using datastores on Google’s Appengine then you are already using NoSQL. Many of the smaller startups are actually finding AWS/GAE to be cheaper than hosting their own servers.
        • One can still chose to use RDS like RDBMS solution, but they don’t get the benefit of high-availability and scalability which S3/SimpleDB offers out-of-the-box. 
      • While scalability to terabytes may not be a requirement for many of the smaller organizations, high availability is absolutely essential for most organizations today. RDBMS based solutions can do that, but setting up multi-master replication across two datacenters is non-trivial
    • Migration from RDBMS to NoSQL is not simple: I think Ted is right that not everyone will have success in cutting over from RDBMS to non-RDBMS world in one weekend. The reports of websites switching over to NoSQL overnight is sometimes grossly exaggerated. Most of these companies have been working on this for months if not years. And they would do extensive scalability, performance, availability and disaster-recovery tests before they put it in production.
    • RDBMS is not going anywhere: I also agree with Ted that RDBMS is not going anywhere anytime soon. Especially in organizations which are already using it. In fact most NoSQL datastores still haven’t figured out how to implement the level of security traditional RDBMS provide. I think thats the core reason why Google is still using it for some of its operational needs.

    Finally, its my personal opinion that “Cloud computing” and commoditization of storage and servers were the key catalysts for the launch of so many NoSQL implementations. The ability to control infrastructure with APIs was a huge incentive for the developers to develop datastores which could scale dynamically as well. While Oracle/MySQL are not going anywhere anytime soon, “NoSQL” movement is definitely here to stay and I won’t be surprised if it evolves more on the way.

     

    References

    1. Haters Gonna Hate
    2. Reddit: learning from mistakes
    3. Digg: Saying yes to NoSQL; Going steady with Cassandra
    4. Twitter @ 2009/07 : Up and running with cassandra
    5. Twitter @ 2010/03 : Ryan King about Twitter and Cassandra
    6. NoSQL vs RDBMS: Let the flames begin !
    7. Brewer’s CAP theorem on Distributed systems
    8. Database scalability
    9. What is scalability ?
    10. Thoughts on NoSQL

    Disaster Recovery: Impressive RPO and RTO objectives set by Google Apps Operations

    Unless you are running a fly by night shop, DR (Disaster recovery) should be one of the top issues for your operations team. In a “Scalable architecture” world, the complexity of DR can become a disaster in itself. 

    Yesterday Google Announced that it now finally has DR plan for Google Apps. While this is nice, one should always take such messages with a pinch of salt, until they prove it that they can do it. Look at the DR plan for Google App engine which was also there, but still suffered more than 2 hour outage because of incomplete documentation, insufficient training and probably lack of someone to make a quick decisive decision at the time of failure.

    But back to Google Apps for now. These guys are planning for an RPO of 0 seconds, which means multiple datacenters will always be in consistent state all the time.  And they want a RTO to be instant failover as well ! This is an incredible DR plan, and requires technical expertise in all 7 layers of OSI Model to achieve it.

    In larger businesses, companies will add a storage area network (SAN), which is a consolidated place for all storage. SANs are expensive, and even then, you’re out of luck if your data center goes down. So the largest enterprises will build an entirely new data center somewhere else, with another set of identical mail servers, another SAN and more people to staff them.

    But if, heaven forbid, disaster strikes both your data centers, you’re toast (check out this customer’s experience with a fire). So big companies will often build the second data center far away, in a different ‘threat zone’, which creates even more management headaches. Next they need to ensure the primary SAN talks to the backup SAN, so they have to implement robust bandwidth to handle terabytes of data flying back and forth without crippling their network. There are other backup options as well, but the story’s the same: as redundancy increases, cost and complexity multiplies.

    How do you know if your disaster recovery solution is as strong as you need it to be? It’s usually measured in two ways: RPO (Recovery Point Objective) and RTO (Recovery Time Objective). RPO is how much data you’re willing to lose when things go wrong, and RTO is how long you’re willing to go without service after a disaster.

    For a large enterprise running SANs, the RTO and RPO targets are an hour or less: the more you pay, the lower the numbers. That can mean a large company spending the big bucks is willing to lose all the email sent to them for up to an hour after the system goes down, and go without access to email for an hour as well. Enterprises without SANs may be literally trucking tapes back and forth between data centers, so as you can imagine their RPOs and RTOs can stretch into days. As for small businesses, often they just have to start over.

    For Google Apps customers, our RPO design target is zero, and our RTO design target is instant failover. We do this through live or synchronous replication: every action you take in Gmail is simultaneously replicated in two data centers at once, so that if one data center fails, we nearly instantly transfer your data over to the other one that’s also been reflecting your actions.

    This is one of the most ambitious DR plan I’ve ever read off which involves such a huge customer base.They not only have to replicate all the user data into multiple data centers, they have to do it synchronously (or almost synchronously),  across a huge distance (latency can slow down synchronous operations) without impacting users. And to top it all, they have to do a complete site failover if the primary datacenter goes down.

    I am impressed, but don’t mind learning more on how they do it.

    The Reddit problem: Learning from mistakes

    Reddit has a very interesting post about what not to do when trying to build a scalable system. While the error is tragic, I think its an excellent design mistakes to learn from.

    Though the post lacked detailed technical report, we might be able to recreate what happened. They mentioned they are using MemcacheDB datastore, with 2GB RAM per node, to keep some data which they need very often.

    Dont be confused between MemcacheDB and memcached. While memcached is a distributed cache engine, MemcacheDB is actually a persistent datastore. And because they both use the same protocol, applications often use memcached libraries to connect to MemcacheDB datastore. redditheader

    Both memcached and MemcacheDB rely on the clients to figure out how the keys are distributed across multiple nodes. Reddit chose MD5 to generate the keys for their key/value pairs. The algorithm Reddit used to identify which node in the cluster a key should go to could have been dependent on the number of nodes in the system. For example one popular way to identify which node a key should be on would be to use the “modulo” function. For example key “k” could be stored on the node “n” where “n=k modulo 3”. [ If k=101, then n=2 ]

    image

    Though MemcacheDB uses BDB to persist data, it seems like they heavily relied on keeping all the data in RAM. And at some point they might have hit the upper limit on what could be cached in RAM which caused disk i/o which resulted in slower response times. In a scalable architecture one should have been able to add new nodes and the system should have been able to scale.

    Unfortunately though this algorithm works beautifully during regular operation, it fails as soon as you add or remove a node (when you change n). At that point you can’t guarantee that all the data you previously stored on node n would still be on the same node.

    And while this algorithm may still be ok for “memcached” cache clusters, its really bad for MemcacheDB which requires “consistent hashing”.

    [RedditArchDiagramWhiteBG.png]

    Reddit today announced that they have increased RAM on these MemcacheDB servers from 2GB to 6GB, which allows 94% of their DB to be kept in memory. But they have realized their mistake (they probably figured this out long time back) and are thinking about how to fix it. The simplest solution of adding a few nodes requires re-hashing their keys which would take days according to their estimate. And of course just adding nodes without using some kind of “consistent hashing” is still not a scalable solution.

    I personally learnt two things

    • Dont mix MemcacheDB and memcached. They are not designed to solve the same problem.
    • Don’t just simply replace memcached with MemcacheDB without thinking twice

    There are many different products out there today which do a better job at scaling, so I won’t be surprised if they abandon MemcacheDB completely as well.

    Brewers CAP Theorem on distributed systems

    Large distributed systems run into a problem which smaller systems don’t usually have to worry about. image“Brewers CAP Theorem” [ Ref 1] [ Ref 2] [ Ref 3]  defines this problem in a very simple way.

    It states, that though its desirable to have Consistency, High-Availability and Partition-tolerance in every system, unfortunately no system can achieve all three at the same time.

    Consistent: A fully Consistent system is one where the system can guarantee that once you store a state (lets say “x=y”) in the system, it will report the same state in every subsequent operation until the imagestate is explicitly changed by something outside the system. [Example 1] A single MySQL database instance is automatically fully consistent since there is only one node keeping the state. [Example 2] If two MySQL servers are involved, and if the system is designed in such a way that all keys starting “a” to “m” is kept on server 1, and keys “n” to “z” are kept on server 2”,  then system can still easily guarantee consistency.image Lets now setup the DBs as master-master replicas[Example 3] . If one of the database accepts get a “row insert” request, that information has to be committed to the second system before the operation is considered complete. To require 100% consistency in such a replicated environment, communication between nodes is paramount. The over all performance of such a system could drop as number of replica’s required goes up.

    Available:  The database in [Example 1] or [Example 2] are not highly Available. In [Example 1] if the node goes down there would 100% data loss. In [Example 2] if one node goes down, you will have 50% data loss. [Example 3] is the only solution which solves that problem. A simple MySQL server replication [ Multi-master mode]  setup could provide 100% availability. Increasing the number of nodes with copies of the data directly increases the availability of the system. Availability is not just a protection from hardware failure. Replicas also help in loadbalancing concurrent operations, especially the read operations. “Slave” MySQL instances are a perfect example of such “replicas”.

    Partition-tolerance: So you got Consistency and Availability by replicating data. Lets say you had these two MySQL servers in Example 3, in two different datacenters, and you loose the network connectivity between the two datacenters making both databases incapable of synchronizing state between the two. Would the two DBs be fully functional in such a scenario ? If you somehow do manage allow read/write operations on these two databases, it can be proved that the two servers won’t be consistent anymore. A banking application which keeps “state of your account” at all times is a perfect example where its bad to have inconsistent bank records. If I withdraw 1000 bucks from California, it should be instantly reflected in the NY branch so that the system accurately knows how much more I can withdraw at any given time. If the system fails to do this, it could potentially cause problems which could make some customers very unhappy. If the banks decide consistency is very important, and disable write operations during the outage, it will will loose “availability” of the cluster since all the bank accounts at both the branches will now be frozen until network comes up again.

    This gets more interesting when you realize that C.A.P. rules don’t have to be applied in an “all or nothing” fashion. Different systems can choose various levels of consistency, availability or partition tolerance to meet their business objective. Increasing the number of replicas, for example, increases high availability but it could at the same time reduce partition tolerance or consistency.

    When you are discussing distributed systems ( or distributed storage ), you should try to identify which of the three rules it is trying to achieve. BigTable, used by Google App engine, and HBase, which runs over Hadoop, claim to be always consistent and highly available. Amazon’s Dynamo, which is used by S3 service and datastores like Cassandra instead sacrifice consistency in favor of availability and partition tolerance.

    CAP theorem doesn’t just apply to databases. Even simple web applications which store state in session objects have this problem. There are many “solutions” out there which allow you to replicate session objects and make sessions data “highly available”, but they all suffer from the same basic problem and its important for you to understand it.

    Recognizing which of the “C.A.P.” rules your business really needs should be the first step in building any successful distributed, scalable, highly-available system.

    Cloud : Agility vs Security

    Networking devices on the edges have become smarter over time. So have the firewalls and switches used internally within the networks. Whether we like it or not, web applications over time have grown to depend on them.

    Its impossible to build a flawless product because of which its standard practice to disable all unused services on a server. Most organizations today try to follow the n-tier approach to create different logical security zones with the core asset inside the most secure zone. The objective is to make it difficult for an attacker to get to the core asset without breaching multiple sets of firewalls.

    Doing frequent system patches, auditing file system permissions and setting up intrusion detection (host or network based)  are some of the other mundane ways of keeping web applications safe from attacks.

    Though cloud has made deployment of on-demand infrastructure simpler, its hard to build a walled garden around customers cluster of servers on the cloud in an efficient way anymore. And the absence of such walled gardens and logical security zones means there are more points of entry into the infrastructure which could be exploited. If you replace 10 powerful internal servers with 100 small servers on the cloud, all of a sudden you might have to worry about protecting 100 individual servers instead of protecting a couple of edge devices. In a worst case scenario, one week server in the cluster could expose the entire cluster to an attacker. Here are a few other things to think about…

    • Host based firewalls should allow only traffic which are required/expected
    • Non-essential services should be shut off on the server
    • Some kind of Intrusion detection might be important to have
    • Keys/passwords should be changed periodically
    • System patches (update OS image) need to be applied periodically
    • Authenticate/Authorize all inter-server communication
    • Maintain audit trail for all changes to images/servers if possible

    An organization which is completely on the cloud may not have an IT department in its current form, but it might still have an operations team which makes the security policies,  updates OS images, manages billing, monitors system health (and IDS) and trains developers to do the things in the right way.

    If your infrastructure is on the cloud, do write back with a note about what you do to protect your applications.

    Image source: AMagill