Itâ€™s totally worth it. Erlang (Erlang/OTP really, which is what most people mean when they say â€œErlang has Xâ€) does out of the box a lot of things we would have had to either build from scratch or attempt to piece together existing libraries to do. Its dynamic type system and pattern matching (ala Haskelland ML) make Erlang code tend to be even more concise than Python and Ruby, two languages known for their ability to do a lot in few lines of code.
The single largest advantage to us of using Erlang has got to be its built-in support for concurrency. Erlang models concurrent tasks as â€œprocessesâ€ that can communicate with one another only via message-passing (which makes use of pattern matching!), in what is known as the actor model of concurrency. This alone makes an entire class of concurrency-related bugs completely impossible. While it doesnâ€™t completely prevent deadlock, it turns out to be pretty difficult to miss a potential deadlock scenario when you write code this way. While itâ€™s certainly possible to implement the actor model in most if not all of the alternative environments I mentioned, and in fact such implementations exist, they are either incomplete or suffer from an impedance mismatch with existing libraries that expect you to use threads.
Ted Dziuba has a post about â€œI canâ€™t wait for NoSQL to Dieâ€. The basic argument he makes is that one has to be at the size Google is to really benefit from NoSQL. I think he is missing the point.
Here are my observations.
This is similar to the argument the traditional DB vendors were making when companies started switching away from the likes of Oracle/DB2 to MySQL. The difference between then and now is that before it was Large established databases vendors against the smaller (open-source) ones, and now its RDBMS vs non-RDBMS datastores.
Why NoSQL: The biggest difference between an RDBMS and a NoSQL datastore is the fact that NoSQL datastructures have no pre-defined schemas. That doesnâ€™t mean that the developers donâ€™t have to think about the data structure before using a NoSQL solution, but it does provide the opportunity to developers to add new columns which were not thought of at design time with little or no impact on applications using it. You can add and remove columns on the fly on most RDBMS as well, but those changes are usually considered significant. Also keep in mind that while NoSQL datastores could add columns at the row level, RDBMS solutions can only do it at the table level.
Scalability: There are basically two ways to scale any web application.
The first way is to build the app and leave the scalability issues for later (let the DBAs to figure out). This is an expensive iterative process which takes time to perfect. The issues around scalability and availability could be so complex that one may not be able to predict all the issues until they get used in production.
The second way is to train the programmers to architect the database so that it can scale better once it hits production. There is a significant upfront cost, but it pays over time.
NoSQL is the third way of doing it.
It restricts programmers by allowing only those operations and data-structures which can scale
And programmers who manage to figure out how to use it, have found that the these kind of restrictions guarantee significantly higher horizontal scalability than traditional RDBMS.
By architecting databases before the product is launched, it also reduces the amount of outage and post-deployment migrations.
High Availability: NoSQL is not just about scalability. Its also about â€œhigh-availabilityâ€ at a cheaper cost.
While Ted did mention that some of the operations in Cassandra requires a restart, he forgot to mention that it doesnâ€™t require all the nodes to be restarted at the same time. The cassandra datastore continues to be available even without many of its nodes. This is a common theme across most of the NoSQL based datastores. [CASSANDRA-44]
High availability over long distances with flaky network connection is not trivial to implement using traditional RDBMS based databases.
You donâ€™t have to be Google to see benefits of using NoSQL.
If you are using S3 or SimpleDB on AWS or using datastores on Googleâ€™s Appengine then you are already using NoSQL. Many of the smaller startups are actually finding AWS/GAE to be cheaper than hosting their own servers.
One can still chose to use RDS like RDBMS solution, but they donâ€™t get the benefit of high-availability and scalability which S3/SimpleDB offers out-of-the-box.
While scalability to terabytes may not be a requirement for many of the smaller organizations, high availability is absolutely essential for most organizations today. RDBMS based solutions can do that, but setting up multi-master replication across two datacenters is non-trivial
Migration from RDBMS to NoSQL is not simple: I think Ted is right that not everyone will have success in cutting over from RDBMS to non-RDBMS world in one weekend. The reports of websites switching over to NoSQL overnight is sometimes grossly exaggerated. Most of these companies have been working on this for months if not years. And they would do extensive scalability, performance, availability and disaster-recovery tests before they put it in production.
RDBMS is not going anywhere: I also agree with Ted that RDBMS is not going anywhere anytime soon. Especially in organizations which are already using it. In fact most NoSQL datastores still havenâ€™t figured out how to implement the level of security traditional RDBMS provide. I think thats the core reason why Google is still using it for some of its operational needs.
Finally, its my personal opinion that â€œCloud computingâ€ and commoditization of storage and servers were the key catalysts for the launch of so many NoSQL implementations. The ability to control infrastructure with APIs was a huge incentive for the developers to develop datastores which could scale dynamically as well. While Oracle/MySQL are not going anywhere anytime soon, â€œNoSQLâ€ movement is definitely here to stay and I wonâ€™t be surprised if it evolves more on the way.
Large distributed systems run into a problem which smaller systems donâ€™t usually have to worry about. â€œBrewers CAP Theoremâ€ [ Ref 1] [ Ref 2] [ Ref 3] defines this problem in a very simple way.
It states, that though its desirable to have Consistency, High-Availability and Partition-tolerance in every system, unfortunately no system can achieve all three at the same time.
Consistent: A fully Consistent system is one where the system can guarantee that once you store a state (lets say â€œx=yâ€) in the system, it will report the same state in every subsequent operation until the state is explicitly changed by something outside the system. [Example 1] A single MySQL database instance is automatically fully consistent since there is only one node keeping the state. [Example 2] If two MySQL servers are involved, and if the system is designed in such a way that all keys starting â€œaâ€ to â€œmâ€ is kept on server 1, and keys â€œnâ€ to â€œzâ€ are kept on server 2â€, then system can still easily guarantee consistency. Lets now setup the DBs as master-master replicas[Example 3] . If one of the database accepts get a â€œrow insertâ€ request, that information has to be committed to the second system before the operation is considered complete. To require 100% consistency in such a replicated environment, communication between nodes is paramount. The over all performance of such a system could drop as number of replicaâ€™s required goes up.
Available: The database in [Example 1] or [Example 2] are not highly Available. In [Example 1] if the node goes down there would 100% data loss. In [Example 2] if one node goes down, you will have 50% data loss. [Example 3] is the only solution which solves that problem. A simple MySQL server replication [ Multi-master mode] setup could provide 100% availability. Increasing the number of nodes with copies of the data directly increases the availability of the system. Availability is not just a protection from hardware failure. Replicas also help in loadbalancing concurrent operations, especially the read operations. â€œSlaveâ€ MySQL instances are a perfect example of such â€œreplicasâ€.
Partition-tolerance: So you got Consistency and Availability by replicating data. Lets say you had these two MySQL servers in Example 3, in two different datacenters, and you loose the network connectivity between the two datacenters making both databases incapable of synchronizing state between the two. Would the two DBs be fully functional in such a scenario ? If you somehow do manage allow read/write operations on these two databases, it can be proved that the two servers wonâ€™t be consistent anymore. A banking application which keeps â€œstate of your accountâ€ at all times is a perfect example where its bad to have inconsistent bank records. If I withdraw 1000 bucks from California, it should be instantly reflected in the NY branch so that the system accurately knows how much more I can withdraw at any given time. If the system fails to do this, it could potentially cause problems which could make some customers very unhappy. If the banks decide consistency is very important, and disable write operations during the outage, it will will loose â€œavailabilityâ€ of the cluster since all the bank accounts at both the branches will now be frozen until network comes up again.
This gets more interesting when you realize that C.A.P. rules donâ€™t have to be applied in an â€œall or nothingâ€ fashion. Different systems can choose various levels of consistency, availability or partition tolerance to meet their business objective. Increasing the number of replicas, for example, increases high availability but it could at the same time reduce partition tolerance or consistency.
When you are discussing distributed systems ( or distributed storage ), you should try to identify which of the three rules it is trying to achieve. BigTable, used by Google App engine, and HBase, which runs over Hadoop, claim to be always consistent and highly available. Amazonâ€™s Dynamo, which is used by S3 service and datastores like Cassandra instead sacrifice consistency in favor of availability and partition tolerance.
CAP theorem doesnâ€™t just apply to databases. Even simple web applications which store state in session objects have this problem. There are many â€œsolutionsâ€ out there which allow you to replicate session objects and make sessions data â€œhighly availableâ€, but they all suffer from the same basic problem and its important for you to understand it.
Recognizing which of the â€œC.A.P.â€ rules your business really needs should be the first step in building any successful distributed, scalable, highly-available system.
AppScale is an open-source implementation of the Google AppEngine (GAE) cloud computing interface from the RACELab at UC Santa Barbara. AppScale enables execution of GAE applications on virtualized cluster systems. In particular, AppScale enables users to execute GAE applications using their own clusters with greater scalability and reliability than the GAE SDK provides. Moreover, AppScale executes automatically and transparently over cloud infrastructures such as the Amazon Web Services (AWS) Elastic Compute Cloud (EC2) and Eucalyptus, the open-source implementation of the AWS interfaces.
The list of supported infrastructures is very impressive. However the key, in my personal opinion, would be stability and compatibility with current GAE APIs.
Hive is a data warehouse infrastructure built over Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce fromwork to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.
At a user group meeting, Ashish Thusoo from Facebook data team, spoke about how Facebook uses Hive for their data processing needs.
Facebook is a free service and has been experiencing rapid growth in last few years. The amount of data it collects, which used to be around 200GB per day in March 2008, has now grown to 15TB per day today.Â Facebook realized early on that insights derived from simple algorithms on more data is better than insights from complex algorithm on smaller set of data.
But the traditional approach towards ETL on proprietary storage systems was not only getting expensive to maintain, it was also limited in the size it could scale to. This is when they started experimenting with Hadoop.
How Hadoop gave birth to Hive
Hadoop turned out to be superior in availability, scalability and manageability. Its efficiency wasnâ€™t that great, but one could get more throughput by throwing more cheap hardware at it. Ashish pointed out that though at that point partial availability, resilience and scale was more important than ACID they had a hard time finding Hadoop programmers within Facebook to make use of the cluster.
It was this, that eventually forced Facebook, to build a new way of querying data from Hadoop which doesnâ€™t require writing map-reduce jobs in java. That quickly lead to the development of hive, which does exactly what it was set out to do. Lets look at a couple of examples of hive queries.
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar;
hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
Hiveâ€™s long term goal was to develop a system for managing and querying structured data built on top of Hadoop. To do that it used map-reduce mechanisms for execution and used HDFS for storage. They modeled the language on SQL, designed it to be extensible, interoperable and be able to out perform traditional processing mechanisms.
How it is used
Facebook has a production Hive cluster which is primarily used for log summarization, including aggregation of impressions, click counts and statistics around user engagement. They have a separate cluster for “Ad hoc analysisâ€ which is free for all/most Facebook employees to use. And over time they figured out how to use it for spam detection, ad optimization and a host of other undocumented stuff.
Facebook Hive/Hadoop statistics
The scribe/Hadoop cluster at Facebook has about 50 nodes in the cluster today and processes about 25TB of raw data. About 99% of its data is available for use within 20 seconds. The Hive/Hadoop cluster where most of the data processing happens has about 8400 cores with roughly about 12.5 PB of raw storage which translates to 4PB of usable storage after replication. Each node in the cluster is a 8 core server with 12TB of storage each.
All in all, Facebook gets 12 TB of compressed new data and scans about 135 TB of compressed data per day. There are more than 7500 Hive jobs which use up about 80000 computer hours each day.
Designing any scalable web architecture would be incomplete without investigating â€œload balancersâ€. There used to be a time when selecting and installing load balancers was an art by itself. Not anymore.
A lot of organizations today, use Apache web servers as a proxy server (and also as a load balancer) for the backend application clusters. Though Apache is the most popular web server in the world, it also considered over-weight if all you want to do is proxy a web application. The huge codebase which apache comes with and the separate modules which need to be compiled and configured with it, could soon become a liability.
HAProxy is a tiny proxying engine which doesnâ€™t have all the bells and whistles of apache, but is highly qualified to act as a HTTP/TCP proxy server. Here are some of the other wonderful things I liked about it
Extremely tiny codebase. Just two runtime files to worry about, the binary and the configuration file.
Compiles in seconds. 10 seconds the last time I did it.
Logs to syslog by default
Can load balance HTTP as well as regular TCP connections. Can easily load balance most non-HTTP applications.
Can do extremely detailed performance (and cookie capture) logging. It can differentiate backend processing time from the end-user request completion time. This is extremely helpful in monitoring performance of backend services.
It can do sticky load balancing out of the box
It can use application generated cookies instead of self-assigned cookies.
It can do health monitoring of the nodes and automatically removes them when health monitors fail
And it has a beautiful web interface for application admins who care about number.
A few other notes
HAProxy really doesnâ€™t serve any files locally. So its definitely not a replacement for your apache instance if you are using it to serve local files.
It doesnâ€™t do SSL, so you sill need an SSL engine in front of it if you need secure http.
HAProxy is not the only apache replacement. Varnish is a strong candidate which can also do caching (with ESI). And while you are at it, do take a look at Perlbal which looked interesting.
The initial version of this tool used MySQL database. The original application architecture was very simple, and other than the database it could have scaled horizontally. Over the weekend I played a little with SimpleDB and was able to convert my code to use SimpleDB in a matter of hours.
Here are some things I observed during my experimentation
Its not a relational database.
Canâ€™t do joins in the database. If joins have to be done, it has to be done at the application which can be very expensive .
De-normalizing data is recommended.
Schemaless: You can add new columns (which are actually just new row attributes) anytime you want.
You have to create your own unique row identifiers. SimpleDB doesnâ€™t have a concept of auto-increment
All attributes are auto-Indexed. I think in Google App Engine you had to specify which columns need indexing. Iâ€™m wondering if this would increase cost of using SimpleDB.
Data is automatically replicated across Amazonâ€™s huge SimpleDB cloud. But they only guarantee something called â€œEventually Consistentâ€. Which means data which is â€œputâ€ into the system is not guaranteed to be available in the next â€œgetâ€.
I couldnâ€™t find a GUI based tool to browse my SimpleDB like the way some S3 browsers do. Iâ€™m sure someone will come up with something soon. [Updated: Jeff blogged about some simpleDB tools here]
There are limits imposed by SimpleDB on the amount of data you can put in. Look at the tables below.
“Introduction to MySQL Cluster The NDB storage engine (MySQL Cluster) is a high-availability storage engine for MySQL. It provides synchronous replication between storage nodes and many mysql servers having a consistent view of the database. In 4.1 and 5.0 it’s a main memory database, but in 5.1 non-indexed attributes can be stored on disk. NDB also provides a lot of determinism in system resource usage. I’ll talk a bit about that.”