Why Membase Uses Erlang

It’s totally worth it. Erlang (Erlang/OTP really, which is what most people mean when they say “Erlang has X”) does out of the box a lot of things we would have had to either build from scratch or attempt to piece together existing libraries to do. Its dynamic type system and pattern matching (ala Haskelland ML) make Erlang code tend to be even more concise than Python and Ruby, two languages known for their ability to do a lot in few lines of NorthScalecode.

The single largest advantage to us of using Erlang has got to be its built-in support for concurrency. Erlang models concurrent tasks as “processes” that can communicate with one another only via message-passing (which makes use of pattern matching!), in what is known as the actor model of concurrency. This alone makes an entire class of concurrency-related bugs completely impossible. While it doesn’t completely prevent deadlock, it turns out to be pretty difficult to miss a potential deadlock scenario when you write code this way. While it’s certainly possible to implement the actor model in most if not all of the alternative environments I mentioned, and in fact such implementations exist, they are either incomplete or suffer from an impedance mismatch with existing libraries that expect you to use threads.

The complete original post can be found here: blog.membase.com

You don’t have to be Google to use NoSQL

Ted Dziuba has a post about “I can’t wait for NoSQL to Die”. The basic argument he makes is that one has to be at the size Google is to really benefit from NoSQL. I think he is missing the point. nosql

Here are my observations.

  • This is similar to the argument the traditional DB vendors were making when companies started switching away from the likes of Oracle/DB2 to MySQL. The difference between then and now is that before it was Large established databases vendors against the smaller (open-source) ones, and now its RDBMS vs non-RDBMS datastores.
  • Why NoSQL: The biggest difference between an RDBMS and a NoSQL datastore is the fact that NoSQL datastructures have no pre-defined schemas. That doesn’t mean that the developers don’t have to think about the data structure before using a NoSQL solution, but it does provide the opportunity to developers to add new columns which were not thought of at design time with little or no impact on applications using it. You can add and remove columns on the fly on most RDBMS as well, but those changes are usually considered significant. Also keep in mind that while NoSQL datastores could add columns at the row level, RDBMS solutions can only do it at the table level.
  • Scalability: There are basically two ways to scale any web application.
    • The first way is to build the app and leave the scalability issues for later (let the DBAs to figure out). This is an expensive iterative process which takes time to perfect. The issues around scalability and availability could be so complex that one may not be able to predict all the issues until they get used in production.
    • The second way is to train the programmers to architect the database so that it can scale better once it hits production. There is a significant upfront cost, but it pays over time.
    • NoSQL is the third way of doing it.
      • It restricts programmers by allowing only those operations and data-structures which can scale
      • And programmers who manage to figure out how to use it, have found that the these kind of restrictions guarantee significantly higher horizontal scalability than traditional RDBMS.
      • By architecting databases before the product is launched, it also reduces the amount of outage and post-deployment migrations.
  • High Availability: NoSQL is not just about scalability. Its also about “high-availability” at a cheaper cost.
    • While Ted did mention that some of the operations in Cassandra requires a restart, he forgot to mention that it doesn’t require all the nodes to be restarted at the same time. The cassandra datastore continues to be available even without many of its nodes. This is a common theme across most of the NoSQL based datastores. [CASSANDRA-44]
    • High availability over long distances with flaky network connection is not trivial to implement using traditional RDBMS based databases.
  • You don’t have to be Google to see benefits of using NoSQL.
    • If you are using S3 or SimpleDB on AWS or using datastores on Google’s Appengine then you are already using NoSQL. Many of the smaller startups are actually finding AWS/GAE to be cheaper than hosting their own servers.
      • One can still chose to use RDS like RDBMS solution, but they don’t get the benefit of high-availability and scalability which S3/SimpleDB offers out-of-the-box. 
    • While scalability to terabytes may not be a requirement for many of the smaller organizations, high availability is absolutely essential for most organizations today. RDBMS based solutions can do that, but setting up multi-master replication across two datacenters is non-trivial
  • Migration from RDBMS to NoSQL is not simple: I think Ted is right that not everyone will have success in cutting over from RDBMS to non-RDBMS world in one weekend. The reports of websites switching over to NoSQL overnight is sometimes grossly exaggerated. Most of these companies have been working on this for months if not years. And they would do extensive scalability, performance, availability and disaster-recovery tests before they put it in production.
  • RDBMS is not going anywhere: I also agree with Ted that RDBMS is not going anywhere anytime soon. Especially in organizations which are already using it. In fact most NoSQL datastores still haven’t figured out how to implement the level of security traditional RDBMS provide. I think thats the core reason why Google is still using it for some of its operational needs.

Finally, its my personal opinion that “Cloud computing” and commoditization of storage and servers were the key catalysts for the launch of so many NoSQL implementations. The ability to control infrastructure with APIs was a huge incentive for the developers to develop datastores which could scale dynamically as well. While Oracle/MySQL are not going anywhere anytime soon, “NoSQL” movement is definitely here to stay and I won’t be surprised if it evolves more on the way.

 

References

  1. Haters Gonna Hate
  2. Reddit: learning from mistakes
  3. Digg: Saying yes to NoSQL; Going steady with Cassandra
  4. Twitter @ 2009/07 : Up and running with cassandra
  5. Twitter @ 2010/03 : Ryan King about Twitter and Cassandra
  6. NoSQL vs RDBMS: Let the flames begin !
  7. Brewer’s CAP theorem on Distributed systems
  8. Database scalability
  9. What is scalability ?
  10. Thoughts on NoSQL

Brewers CAP Theorem on distributed systems

Large distributed systems run into a problem which smaller systems don’t usually have to worry about. image“Brewers CAP Theorem” [ Ref 1] [ Ref 2] [ Ref 3]  defines this problem in a very simple way.

It states, that though its desirable to have Consistency, High-Availability and Partition-tolerance in every system, unfortunately no system can achieve all three at the same time.

Consistent: A fully Consistent system is one where the system can guarantee that once you store a state (lets say “x=y”) in the system, it will report the same state in every subsequent operation until the imagestate is explicitly changed by something outside the system. [Example 1] A single MySQL database instance is automatically fully consistent since there is only one node keeping the state. [Example 2] If two MySQL servers are involved, and if the system is designed in such a way that all keys starting “a” to “m” is kept on server 1, and keys “n” to “z” are kept on server 2”,  then system can still easily guarantee consistency.image Lets now setup the DBs as master-master replicas[Example 3] . If one of the database accepts get a “row insert” request, that information has to be committed to the second system before the operation is considered complete. To require 100% consistency in such a replicated environment, communication between nodes is paramount. The over all performance of such a system could drop as number of replica’s required goes up.

Available:  The database in [Example 1] or [Example 2] are not highly Available. In [Example 1] if the node goes down there would 100% data loss. In [Example 2] if one node goes down, you will have 50% data loss. [Example 3] is the only solution which solves that problem. A simple MySQL server replication [ Multi-master mode]  setup could provide 100% availability. Increasing the number of nodes with copies of the data directly increases the availability of the system. Availability is not just a protection from hardware failure. Replicas also help in loadbalancing concurrent operations, especially the read operations. “Slave” MySQL instances are a perfect example of such “replicas”.

Partition-tolerance: So you got Consistency and Availability by replicating data. Lets say you had these two MySQL servers in Example 3, in two different datacenters, and you loose the network connectivity between the two datacenters making both databases incapable of synchronizing state between the two. Would the two DBs be fully functional in such a scenario ? If you somehow do manage allow read/write operations on these two databases, it can be proved that the two servers won’t be consistent anymore. A banking application which keeps “state of your account” at all times is a perfect example where its bad to have inconsistent bank records. If I withdraw 1000 bucks from California, it should be instantly reflected in the NY branch so that the system accurately knows how much more I can withdraw at any given time. If the system fails to do this, it could potentially cause problems which could make some customers very unhappy. If the banks decide consistency is very important, and disable write operations during the outage, it will will loose “availability” of the cluster since all the bank accounts at both the branches will now be frozen until network comes up again.

This gets more interesting when you realize that C.A.P. rules don’t have to be applied in an “all or nothing” fashion. Different systems can choose various levels of consistency, availability or partition tolerance to meet their business objective. Increasing the number of replicas, for example, increases high availability but it could at the same time reduce partition tolerance or consistency.

When you are discussing distributed systems ( or distributed storage ), you should try to identify which of the three rules it is trying to achieve. BigTable, used by Google App engine, and HBase, which runs over Hadoop, claim to be always consistent and highly available. Amazon’s Dynamo, which is used by S3 service and datastores like Cassandra instead sacrifice consistency in favor of availability and partition tolerance.

CAP theorem doesn’t just apply to databases. Even simple web applications which store state in session objects have this problem. There are many “solutions” out there which allow you to replicate session objects and make sessions data “highly available”, but they all suffer from the same basic problem and its important for you to understand it.

Recognizing which of the “C.A.P.” rules your business really needs should be the first step in building any successful distributed, scalable, highly-available system.

AppScale, an OpenSource GAE implementation

If you don’t like EC2 you have an option to move your app to a new vendor. But if you don’t like GAE  (Google app engine) there aren’t any solutions which can replace GAE easily.

AppScale might change that.

AppScale is an open-source implementation of the Google AppEngine (GAE) cloud computing interface from the RACELab at UC Santa Barbara. AppScale enables execution of GAE applications on virtualized cluster systems. In particular, AppScale enables users to execute GAE applications using their own clusters with greater scalability and reliability than the GAE SDK provides. Moreover, AppScale executes automatically and transparently over cloud infrastructures such as the Amazon Web Services (AWS) Elastic Compute Cloud (EC2) and Eucalyptus, the open-source implementation of the AWS interfaces.

The list of supported infrastructures is very impressive. However the key, in my personal opinion, would be stability and compatibility with current GAE APIs.

Learn more about AppScale:

  1. AppScale Home page
  2. Google Code page
  3. Google Group for AppScale
  4. Demo at Bay area GAE Developers meeting: At Googleplex ( Feb 10, 2010)

Hive @Facebook

Hive is a data warehouse infrastructure built over Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce fromwork to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.

At a user group meeting, Ashish Thusoo from Facebook data team, spoke about how Facebook uses Hive for their data processing needs.

Problem

Facebook is a free service and has been experiencing rapid growth in last few years. The amount of data it collects, which used to be around 200GB per day in March 2008, has now grown to 15TB per day today.  Facebook realized early on that insights derived from simple algorithms on more data is better than insights from complex algorithm on smaller set of data.

But the traditional approach towards ETL on proprietary storage systems was not only getting expensive to maintain, it was also limited in the size it could scale to. This is when they started experimenting with Hadoop.

How Hadoop gave birth to Hive

Hadoop turned out to be superior in availability, scalability and manageability. Its efficiency wasn’t that great, but one could get more throughput by throwing more cheap hardware at it. Ashish pointed out that though at that point partial availability, resilience and scale was more important than ACID they had a hard time finding Hadoop programmers within Facebook to make use of the cluster.

It was this, that eventually forced Facebook, to build a new way of querying data from Hadoop which doesn’t require writing map-reduce jobs in java. That quickly lead to the development of hive, which does exactly what it was set out to do. Lets look at a couple of examples of hive queries.

  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar;
  hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;

Hive’s long term goal was to develop a system for managing and querying structured data built on top of Hadoop. To do that it used map-reduce mechanisms for execution and used HDFS for storage. They modeled the language on SQL, designed it to be extensible, interoperable and be able to out perform traditional processing mechanisms.

How it is usedimage

Facebook has a production Hive cluster which is primarily used for log summarization, including aggregation of impressions, click counts and statistics around user engagement. They have a separate cluster for “Ad hoc analysis” which is free for all/most Facebook employees to use. And over time they figured out how to use it for spam detection, ad optimization and a host of other undocumented stuff.

Facebook Hive/Hadoop statistics

The scribe/Hadoop cluster at Facebook has about 50 nodes in the cluster today and processes about 25TB of raw data. About 99% of its data is available for use within 20 seconds. The Hive/Hadoop cluster where most of the data processing happens has about 8400 cores with roughly about 12.5 PB of raw storage which translates to 4PB of usable storage after replication. Each node in the cluster is a 8 core server with 12TB of storage each.

All in all, Facebook gets 12 TB of compressed new data and scans about 135 TB of compressed data per day. There are more than 7500 Hive jobs which use up about 80000 computer hours each day.

References

HAProxy : Load balancing

Designing any scalable web architecture would be incomplete without investigating “load balancers”.  There used to be a time when selecting and installing load balancers was an art by itself. Not anymore.

A lot of organizations today, use Apache web servers as a proxy server (and also as a load balancer) for the backend application clusters. Though Apache is the most popular web server in the world, it also considered over-weight if all you want to do is proxy a web application. The huge codebase which apache comes with and the separate modules which need to be compiled and configured with it, could soon become a liability.

HAProxy is a tiny proxying engine which doesn’t have all the bells and whistles of apache, but is highly qualified to act as a HTTP/TCP proxy server. Here are some of the other wonderful things I liked about it

  • Extremely tiny codebase. Just two runtime files to worry about, the binary and the configuration file.
  • Compiles in seconds. 10 seconds the last time I did it.
  • Logs to syslog by default
  • Can load balance HTTP as well as regular TCP connections. Can easily load balance most non-HTTP applications.
  • Can do extremely detailed performance (and cookie capture) logging. It can differentiate backend processing time from the end-user request completion time. This is extremely helpful in monitoring performance of backend services.
  • It can do sticky load balancing out of the box
  • It can use application generated cookies instead of self-assigned cookies.
  • It can do health monitoring of the nodes and automatically removes them when health monitors fail
  • And it has a beautiful web interface for application admins who care about number.

A few other notes

  • HAProxy really doesn’t serve any files locally. So its definitely not a replacement for your apache instance if you are using it to serve local files.
  • It doesn’t do SSL, so you sill need an SSL engine in front of it if you need secure http.
  • HAProxy is not the only apache replacement. Varnish is a strong candidate which can also do caching (with ESI). And while you are at it, do take a look at Perlbal which looked interesting.

Some Interesting external linksimage

Finally a sample configuration file with most of the features I mentioned above configured for use. This is the entire thing and should be good enough for a production deployment with minor changes. 

global
        log loghost logfac info
        maxconn 4096
        user webuser 
        group webuser 
        daemon

defaults
        log     global
        stats   enable
        mode    http
        option  httplog
        option  dontlognull
        option  httpclose
        retries 3
        option  redispatch
        maxconn 2000
        contimeout      5000
        clitimeout      300000
        srvtimeout      300000

listen  http_proxy 0.0.0.0:8000
        option httpchk HEAD /app/health.jsp HTTP/1.0
        mode http
        cookie SERVERID insert
        capture cookie JSESSIONID len 50
        capture request header Cookie len 200
        capture request header Host len 50
        capture request header Referer len 200
        capture request header User-Agent len 150
        capture request header Custom-Cookie len 15
        appsession JSESSIONID len 32 timeout 3600000

        balance roundrobin
        server server1_name server1:8080 weight 1 cookie server1_name_cookie check inter 60000
        server server2_name server2:8080 weight 1 cookie server2_name_cookie check inter 60000

Experimenting with SimpleDB (Flagthis.com)

A few years ago I wrote a simple online bookmarking tool called Flagthis. The tool allowed one to bookmark sites using a javascript bookmarklet from the bookmark tab. The problem it was trying to solve is that most links people bookmark are never used again if they are not checked out within the next few days.  The tool helps the user ignore bookmarks which were not used in last 30 days.

The initial version of this tool used MySQL database. The original application architecture was very simple, and other than the database it could have scaled horizontally. Over the weekend I played a little with SimpleDB and was able to convert my code to use SimpleDB in a matter of hours.

image 

Here are some things I observed during my experimentation

  1. Its not a relational database.
  2. Can’t do joins in the database. If joins have to be done, it has to be done at the application which can be very expensive .
  3. De-normalizing data is recommended.
  4. Schemaless: You can add new columns (which are actually just new row attributes) anytime you want.
  5. You have to create your own unique row identifiers. SimpleDB doesn’t have a concept of auto-increment
  6. All attributes are auto-Indexed. I think in Google App Engine you had to specify which columns need indexing. I’m wondering if this would increase cost of using SimpleDB.
  7. Data is automatically replicated across Amazon’s huge SimpleDB cloud. But they only guarantee something called “Eventually Consistent”. Which means data which is “put” into the system is not guaranteed to be available in the next “get”.
  8. I couldn’t find a GUI based tool to browse my SimpleDB like the way some S3 browsers do. I’m sure someone will come up with something soon. [Updated: Jeff blogged about some simpleDB tools here]
  9. There are limits imposed by SimpleDB on the amount of data you can put in. Look at the tables below.

 

Attribute Maximum
domains 100 active domains
size of domains 10GB
attributes per domain 250,000,000
attributes per item 256 attributes
size per attribute 1024 characters

 

Attribute Maximum
items returned in a query response 250 items
seconds a query may run 5 seconds
attribute names per query predicate 1 attribute name
comparisons per predicate 10 operators
predicates per query expression 10 predicates

Other related discussions (Do checkout CouchDB)

Mysql Cluster

Link

“Introduction to MySQL Cluster The NDB storage engine (MySQL Cluster) is a high-availability storage engine for MySQL. It provides synchronous replication between storage nodes and many mysql servers having a consistent view of the database. In 4.1 and 5.0 it’s a main memory database, but in 5.1 non-indexed attributes can be stored on disk. NDB also provides a lot of determinism in system resource usage. I’ll talk a bit about that.”

Technorati Profile