Showing posts from February, 2010

Scalability links for Feb 28th 2010

State of current NoSQL databases : A very detailed post about many NoSQL solutions. A lot of work went into this one. Truth about joins : Google app engine datastore’s limitation of not allowing joins might soon be a thing of the past. Simple joins may now be possible on GAE if you are using Java. Its still beta, but the fact that this is being tested is very encouraging. Should you switch to NoSQL too ? Notes on MongoDB : A very nice summary of MongoDB. Redis real-life examples [ More here ] : I’ve been seeing a lot of discussions around Redis lately. Here are some use cases I’ve gathered from a couple of posts. Haven’t yet seen it being used by a large organization. Who’s Online Track user locations with Node.js and Redis Remote Dictionary server Twitter clone using Redis Sikwamic: Simple Key-value with comet Replication technologies Using Varnish to assist with AB testing – Testing

Cassandra as a communication medium – A service Registry and Discovery tool

Few weeks ago while I was mulling over what kind of service registry/discovery system to use for a scalable application deployment platform, I realized that for mid-size organizations with complex set of services, building one from scratch may be the only option. I also found out that many AWS/EC2 customers have already been using S3 and SimpleDB to  publish/discover services. That discussion eventually led me to investigate Cassandra as the service registry datastore in an enterprise network. Here are some of the observations I made as I played with Cassandra for this purpose.  I welcome feedback from readers if you think I’m doing something wrong or if you think I can improve the design further. The biggest issue I noticed with Cassandra was the absence of inverted index which could be worked around as I have blogged here . I later realized there is something called Lucandra   as well which I need to look at, at some point. The keyspace structure I used was very sim

Talk on “database scalability”

This is a very interesting talk by Jonathan Ellis on database scalability. He designed and implemented multi-petabyte storage for Mozy and is currently the project chair for Apache Cassandra . What every developer should know about database scalability, PyCon 2010 View more presentations from jbellis . Scalability is not improving latency, but increasing throughput But overall performance shouldn’t degrade Throw hardware, not people at the problem Traditional databases use b-tree indexes. But requires the entire index to be in-memory at the same place. Easy bandaid #1– SSD storage is better for b-tree indexes which need to hit disk Easy bandaid #2 – Buy faster server every 2 years. As long as your userbase doesn’t grow faster that Moore’s law Easy bandaid #3 – Use caching to handle hotspots (Distributed) Memcache server failures can change where hashing keys are kept Consistent hashing solves the problem by mapping keys to to

Scalable logging using Syslog

Syslog is a commonly used transport mechanism for system logs. But people sometimes forget it could be used for a lot of other purposes as well. Take, for example, the interesting challenge of aggregating web server logs from 100 different servers into one server and then figuring out how to merge them. If you have built your own tool to do this, you would have figured out  by now how expensive it is to poll all the servers and how out-of-date these logs could get by the time you process it. If you are not inserting them into some kind of datastore which sorts the rows by timestamp, you now also have to take up the challenge of building merge-sort script. There is nothing which stops applications from using syslog as well. If your apps are in Java, you should try out Syslog appender for log4j [ Ref 1 ] [ Ref 2 ]. Not only do you get central logging, you also get get to see real-time “tail -f” of events as they happen in a merged file. If there are issues anywhere in your netwo

SimpleDB now allows you to tweak consistency levels

We discussed Brewer’s Theorm a few days ago and how its challenging to obtain Consistency, Availability and Partition tolerance in any distributed system. We also discussed that many of the distributed datastores allow CAP to be tweaked to attain certain operational goals. Amazon SimpleDB, which was released as an “Eventually Consistent” datastore,  today launched a few features to do just that. Consistent reads : Select and GetAttributes request now include an optional Boolean flag “ConsistentRead” which requests datastore to return consistent results only. If you have noticed scenarios where read right after a write returned an old value, it shouldn’t happen anymore. Conditional put/puts, delete/deletes : By providing “conditions” in the form of a key/value pair SimpleDB can now conditionally execute/discard an operation. This might look like a minor feature, but can go a long way in providing reliable datastore operations. Even though SimpleDB now

NoSQL in the Twitter world

NoSQL solutions have one thing in common. They are generally designed for horizontal scalability. So its no wonder that lot of applications in the “twitter” world have picked NoSQL based datastores for their persistence layer. Here is a collection of these apps from MyNoSQL blog. Twitter uses Cassandra MusicTweets used Redis [ Ref ] – The site is dead, but you can still read about it Tstore uses CouchDB Retwis uses CouchDB Retwis-RB uses Redis and Sinatra ??   - No idea what sinatra is. Will have to look into it. [ Update: Sinatra is not a DB store ] Floxee uses MongoDB Twidoop uses Hadoop Swordfish built on top of Tokyo Cabinet comes with a twitter clone app with it. Tweetarium uses Tokyo Cabinet References NoSQL Twitter Applications More NoSQL Twitter apps Do you know of any more ?

Eventual consistency is just caching ?

So there is someone who thinks “ eventual consistency is just caching ”.  Though I liked the idea of discussing this, I don’t agree with Udi’s views on this. “Cache” is generally used to store data which is more expensive to obtain from the primary location. For example, caching mysql queries is ideal for queries which could take more than fraction of a second to execute. Another example is caching queries to S3, SimpleDB or Google’s datastore which could cost money and introduce network latency into the mix. Though most applications are built to use such caches, they are also designed to be responsive in absence of caching layer. The most important difference between “cache” and a “datastore” is that the dataflow is generally from “datastore” to “cache” rather than the other way round. Though one could queue data on “cache” first and then update datastore later (for performance reasons) that is not the way one should use it. If you are using “ca

Scalability updates for Feb 18, 2010

Some of interesting links for today A very good post about the need of event driven Cloud API model for monitoring . I think its a matter of time before this happens. Just like feed crawlers are embracing even driven publication notification using protocols like Pubsubhubbub , we need something similar to snmp traps for the cloud notification world. Translate SQL to MongoDB MapReduce Real-time web for web developers : An example of how the problem of polling huge number of websites for updates was transformed by simply using an event driven push model. Logging: unsexy, important and now usable Comparing Pig Latin and SQL for Constructing Data processing pipelines Cassandra backend for Lucene ? This seems to solve the problem of building reverse index on cassandra which I previously blogged about. Cloud MR : A Map/Reduce framework over Amazon’s S3/SQS/EC2 service. Interesting NoSQL Categorization Writing twitter service on App engine

More on Amazon S3 versioning (webinar)

If you missed the AWS S3 versioning webcast, I have a copy of the video here . And here are the highlights.. You can enable and disable this at the bucket level They don’t think there is a performance penalty of turning versioning (but it was kind of obvious S3 would be doing slightly extra work to figure out which is the latest version of any object you have) There isn’t any additional cost for using versioning. But you have to pay for extra copy of each object. MFA (multi factor authentication) to delete objects is not mandatory when versioning is turned on. It needs to be turned on. This was slightly confusing in the original email I got from AWS. If you are planning to use this, please watch this video. There is a part where they explain what happens if you disable versioning after using the feature. This is something you might like to know about. They use GUID for versioning of each object You can iterate over objects and figure out how many ver

Brewers CAP Theorem on distributed systems

Large distributed systems run into a problem which smaller systems don’t usually have to worry about. “ Brewers CAP Theorem ” [ Ref 1 ] [ Ref 2 ] [ Ref 3 ]  defines this problem in a very simple way. It states, that though its desirable to have Consistency, High-Availability and Partition-tolerance in every system, unfortunately no system can achieve all three at the same time . C onsistent : A fully Consistent system is one where the system can guarantee that once you store a state (lets say “x=y”) in the system, it will report the same state in every subsequent operation until the state is explicitly changed by something outside the system. [ Example 1 ] A single MySQL database instance is automatically fully consistent since there is only one node keeping the state. [ Example 2 ] If two MySQL servers are involved, and if the system is designed in such a way that all keys starting “a” to “m” is kept on server 1, and keys “n” to “z” are kept on server

Scaling updates for Feb 10, 2010

Lots of interesting updates today. But would like to first mention the fantastic work Cloud computing group at UCSB are doing to make appengine framework more open. They have done significant work at making appscale “ work ” with different kinds of data sources including HBase, Cassandra, Voldemort, MongoDB, Hypertable and Mysql and MemcacheDB. Appscale is actively looking for folks interested in working with them to make this stable and production ready. GAE 1.3.1 released : I think the biggest news about this release is the fact that 1000 row limit has now been removed. You still have to deal with the 30 second processing limit per http request, but at least the row limit is not there anymore. They have also introduced support for automatic transparent datastore api retries for most operations. This should dramatically increase reliability of datastore queries, and reduces the amount of work developers have to do to build this auto-retry logic. Elastic search is a l

Versioning data in S3 on AWS

One of the problem with Amazon’s S3 was the inability to take a “snapshot” of the state of S3 at any given moment. This is one of the most important DR (disaster recovery) steps of any major upgrade which could potentially corrupt data during a release. Until now the applications using S3 would have had to manage versioning of data, but it seems Amazon has launched a versioning feature built into S3 itself to do this particular task. In addition to that, they have made it a requirement that delete operations on versioned data can only be done using MFA (Multi factor authentication). Versioning allows you to preserve, retrieve, and restore every version of every object in an Amazon S3 bucket. Once you enable Versioning for a bucket, Amazon S3 preserves existing objects any time you perform a PUT, POST, COPY, or DELETE operation on them. By default, GET requests will retrieve the most recently written version. Older versions of an overwritten or deleted object can be retrieve

Cassandra : inverted index

Cassandra is the only NOSQL datastore I’m aware of, which is scalable, distributed, self replicating, eventually consistent, schema-less key-value store running on java which doesn’t have a single point of failure. HBase could also match most of these requirements, but Cassandra is easier to manage due to its tiny footprint. The one thing Cassandra doesn’t do today is indexing columns. Lets take a specific example to explain the problem. Lets say there are 100 rows in the datastore which have 5 columns each. If you want to find the row which says “Service=app2”, you will have to iterate one row at a time which is like full database scan. In a 100 row datastore if only one row had that particular column, it could take on an average about 50 rows before you find your data. While I’m sure there is a good reason why this doesn’t exist yet, the application inserting the data could build such an inverted index itself even today. Here is an example of how a table of i

DealNews: Scaling for Traffic Spikes

Last year unexpectedly got listed on the front page of for a couple of hours. No matter how optimistic one is, unexpected events like these can take down a regular website with almost no effort at all. What is your plan if you get slashdotted ? Are you ok with a short outage ? What is the acceptable level of service for your website anyway. One way to handle such unexpected traffic is having multiple layers of cache. Database query cache is one, generating and caching dynamic content is another way (may be using a cronjob). Tools like memcached, varnish, squid can all help to reduce the load on application servers. Proxy servers ( or webservers ) in front of application servers play a special role in dealnews. They understood the limitations of application servers they were using, and the fact that slow client connections means longer lasting tcp sessions to the application servers. Proxy servers, like varnish, could off-load that job and take care of conten

Scaling deployments

Most of the newer, successful, web startups have one thing in common. They release smaller changes more often. Being in operations, I am often surprised how these organizations manage such a feat without breaking their website. Here are some notes from someone in flickr about how they do it. The two most important part of this talk is the observation that Dev, Qa and Operations teams have to slightly blend into each other to achieve deployments at such a velocity, and the fact that they are not afraid to break the website by deploying code from trunk. Don’t be afraid to do releases Automate infrastructure (hardware/OS and app deployment) Share version control system Enable one step build Enable one step build and deploy Do small frequent changes Use feature flags (branching without source code branching) Always ship trunk Do private betas Share metrics Provide applications ability to talk back (IM/IRC)   - build logs, deploy logs, alert m

Scaling PHP : HipHop and Quercus

While PHP is very popular, it unfortunately doesn't perform as some of its competitors . One of the ways to make things faster is to write PHP Extensions in C++. In this post we will describe two different ways developers can solve this problem and the milage you might get from either model may vary. Since Facebook is mostly running PHP, it noticed this problem pretty early, but instead of asking its developers to move from PHP to C++ one of their developers hacked up a solution to transform PHP code into C++. Yesterday, Facebook announced they are opening up HipHop , a source code transformer, which changes PHP code into a more optimized C++ code and uses g++ to compile it. With some minor sacrifices (no eval support)  they noticed they were able to get 50% performance improvement. And since they serve 400 billion page views every month, that kind of saving can free up a lot of servers. More info on HipHop PHP Hiphop: What it means HipHop: What you need

Cloud : Agility vs Security

Networking devices on the edges have become smarter over time. So have the firewalls and switches used internally within the networks. Whether we like it or not, web applications over time have grown to depend on them. Its impossible to build a flawless product because of which its standard practice to disable all unused services on a server. Most organizations today try to follow the n-tier approach to create different logical security zones with the core asset inside the most secure zone. The objective is to make it difficult for an attacker to get to the core asset without breaching multiple sets of firewalls. Doing frequent system patches, auditing file system permissions and setting up intrusion detection (host or network based)  are some of the other mundane ways of keeping web applications safe from attacks. Though cloud has made deployment of on-demand infrastructure simpler, its hard to build a walled garden around customers cluster of servers on the cloud in an efficien

Evaluating Cloud Computing


Windows Azure

Windows Azure is an application platform provided by Microsoft to allow others to run applications on Microsoft’s “cloud” infrastructure. Its finally open for business (as of Feb 1, 2010) . Below are some links about Azure for those who are still catching up. Wikipedia : Windows Azure has three core components: Compute, Storage and Fabric. As the names suggest, Compute provides computation environment with Web Role and Worker Role while Storage focuses on providing scalable storage (Blobs, Tables, Queue) for large scale needs. The hosting environment of Windows Azure is called the Fabric Controller - which pools individual systems into a network that automatically manages resources, load balancing, geo-replication and application lifecycle without requiring the hosted apps to explicitly deal with those requirements. [3] In addition, it also provides other services that most applications require — such as the Windows Azure Storage Service that provides applications with