Latest Publications

Automated, faster, repeatable, scalable deployments

While efficient automated deployment tools like Puppet and Capistrano are a big step in the right direction, its not the complete solution for an automated deployment process. This post will explore some of the less discussed issues which are as important for automated, fast, repeatable scalable deployments. 

Rapid Build and Integration with tests

  • Use Source control to build an audit trail: Put everything possible in it, including configurations and deployment scripts.
  • Continuous Builds triggered by code check-ins can detect and report problems early image
    • Use tools which provide targeted feedback about build failures. It reduces noise and improves over all quality faster
    • Faster the build happens after a check-in, better are the chances for bugs to get fixed quickly. Delays can be costly since broken builds could impact other developers as well
    • Build smaller components (fail fast)
  • Continuous integration tests of all components can detect errors which may not be caught a build time.

Automated database changes

Can database changes be automated ? This is probably one of the most interesting challenges for automation, especially if the app requires data migrations which can’t be rolled back. While it would be nice to have only incremental changes introduced into each deployment (which are guaranteed to be forward and backward compatible), there might be some need for non-trivial changes once in a while. As long as there is a process to separate the trivial from non-trivial changes, it might be possible to automate most of the database changes through an automation process.

Tracking which migrations have been applied and which are pending is a very application specific problem for which there are no silver bullets.

 

Configuration management

 
Environment-specific properties

Its not abnormal to have different sets of configuration for dev and production. But creating different build packages for different target environments is not the right solution. If you need to change properties between environments pick a better way to do it.

  • Either externalize the configuration properties to a file/directory location outside your app folder, such that repeated deployments don’t overwrite properties.
  • Or, update the right properties automatically during deployment using a deployment framework which is capable of that.
Pushing at deployment time or pulling at run time

In some cases pulling new configuration files dynamically after application startup might make more sense. This is especially true for applications on an infrastructure like AWS/EC2. If applications were already deployed on the base OS image, then it will come up automatically when the system boots up. Some folks keep only minimal information in the base OS image, and use a datastore like S3 to download the latest configuration from. In a private network where using S3 is not possible, you could replace it with some kind of shared store like SVN/NFS/FTP/SCP/HTTPetc.

Deployment frameworks

 
3rd Party frameworks
  • Fabric – Fabric is a Python library and command-line tool for streamlining the use of SSH for application deployment or systems administration tasks.
  • Puppet -  Put simply, Puppet is a system for automating system administration tasks.
  • Capistrano – It is designed with repeatability in mind, letting you easily and reliably automate tasks that used to require login after login and a small army of custom shell scripts.  ( also check out webistrano )
  • Bcfg2 – Bcfg2 helps system administrators produce a consistent, reproducible, and verifiable description of their environment, and offers visualization and reporting tools to aid in day-to-day administrative tasks.
  • Chef – Chef is a systems integration framework, built to bring the benefits of configuration management to your entire infrastructure.
  • Slack – slack is an evolution from the usual "put files in some central directory" that is fairly common practice.
  • Kokki – System configuration management framework influenced by Chef
Custom or Mixed frameworks

The tools listed above are not the only set of tools available. Simple bash/sh scripts, ant scripts, even tools like cruisecontrol and hudson can be used for automated deployments. Here are some other interesting observations 

  • Building huge monolithically applications are thing of the past. Understanding how to break them up into self-contained, less inter-dependent components is the challenge.
  • If all of your servers get the same exact copy of application and configuration, then you don’t need to worry about configuration management. Just find a tool which deploys files fast.
  • If your deployments have a lot of inter-dependencies between components then choose a tool which gives you a visual interface of the deployment process if required.
  • Don’t be shy to write wrapper scripts to automate more tasks.
Push/Pull/P2P Frameworks

Grig has an interesting post about Push vs Pull where he lists the pros/cons of both the systems. What he forgot to mention is P2P which is the way twitter is going for its deployment. P2P has advantages from both Push and Pull architecture but comes with its own set of challenges. I haven’t seen an opensource tool using P2P yet, but I’m sure its not too far out.

Outage windows

Though deployments are easier with long outage windows, thats something hard to come by. In an ideal world one would have a parallel set of servers which one could cut over to with a flip of a switch. Unfortunately if user data is involved this is almost impossible to do. The next best alternative is to do “rolling updates” in small batches of servers. The reason this could be challenging is because the deployment tool needs to make sure the app really has completed initialization before it moves on to the next set of servers.

This can be further complicated by the fact that at times there are version dependencies between different applications. In such cases there needs to be a robust infrastructure to facilitate discovery of the right version of applications.

Conclusion

Deployment automation, in my personal opinion, is about the process, not the tool. If you have any interesting observations, ideas or comments, please feel free to write to me or leave a comment on this blog.

Disaster Recovery: Impressive RPO and RTO objectives set by Google Apps Operations

Unless you are running a fly by night shop, DR (Disaster recovery) should be one of the top issues for your operations team. In a “Scalable architecture” world, the complexity of DR can become a disaster in itself. 

Yesterday Google Announced that it now finally has DR plan for Google Apps. While this is nice, one should always take such messages with a pinch of salt, until they prove it that they can do it. Look at the DR plan for Google App engine which was also there, but still suffered more than 2 hour outage because of incomplete documentation, insufficient training and probably lack of someone to make a quick decisive decision at the time of failure.

But back to Google Apps for now. These guys are planning for an RPO of 0 seconds, which means multiple datacenters will always be in consistent state all the time.  And they want a RTO to be instant failover as well ! This is an incredible DR plan, and requires technical expertise in all 7 layers of OSI Model to achieve it.

In larger businesses, companies will add a storage area network (SAN), which is a consolidated place for all storage. SANs are expensive, and even then, you’re out of luck if your data center goes down. So the largest enterprises will build an entirely new data center somewhere else, with another set of identical mail servers, another SAN and more people to staff them.

But if, heaven forbid, disaster strikes both your data centers, you’re toast (check out this customer’s experience with a fire). So big companies will often build the second data center far away, in a different ‘threat zone’, which creates even more management headaches. Next they need to ensure the primary SAN talks to the backup SAN, so they have to implement robust bandwidth to handle terabytes of data flying back and forth without crippling their network. There are other backup options as well, but the story’s the same: as redundancy increases, cost and complexity multiplies.

How do you know if your disaster recovery solution is as strong as you need it to be? It’s usually measured in two ways: RPO (Recovery Point Objective) and RTO (Recovery Time Objective). RPO is how much data you’re willing to lose when things go wrong, and RTO is how long you’re willing to go without service after a disaster.

For a large enterprise running SANs, the RTO and RPO targets are an hour or less: the more you pay, the lower the numbers. That can mean a large company spending the big bucks is willing to lose all the email sent to them for up to an hour after the system goes down, and go without access to email for an hour as well. Enterprises without SANs may be literally trucking tapes back and forth between data centers, so as you can imagine their RPOs and RTOs can stretch into days. As for small businesses, often they just have to start over.

For Google Apps customers, our RPO design target is zero, and our RTO design target is instant failover. We do this through live or synchronous replication: every action you take in Gmail is simultaneously replicated in two data centers at once, so that if one data center fails, we nearly instantly transfer your data over to the other one that’s also been reflecting your actions.

This is one of the most ambitious DR plan I’ve ever read off which involves such a huge customer base.They not only have to replicate all the user data into multiple data centers, they have to do it synchronously (or almost synchronously),  across a huge distance (latency can slow down synchronous operations) without impacting users. And to top it all, they have to do a complete site failover if the primary datacenter goes down.

I am impressed, but don’t mind learning more on how they do it.

The Reddit problem: Learning from mistakes

Reddit has a very interesting post about what not to do when trying to build a scalable system. While the error is tragic, I think its an excellent design mistakes to learn from.

Though the post lacked detailed technical report, we might be able to recreate what happened. They mentioned they are using MemcacheDB datastore, with 2GB RAM per node, to keep some data which they need very often.

Dont be confused between MemcacheDB and memcached. While memcached is a distributed cache engine, MemcacheDB is actually a persistent datastore. And because they both use the same protocol, applications often use memcached libraries to connect to MemcacheDB datastore. redditheader

Both memcached and MemcacheDB rely on the clients to figure out how the keys are distributed across multiple nodes. Reddit chose MD5 to generate the keys for their key/value pairs. The algorithm Reddit used to identify which node in the cluster a key should go to could have been dependent on the number of nodes in the system. For example one popular way to identify which node a key should be on would be to use the “modulo” function. For example key “k” could be stored on the node “n” where “n=k modulo 3”. [ If k=101, then n=2 ]

image

Though MemcacheDB uses BDB to persist data, it seems like they heavily relied on keeping all the data in RAM. And at some point they might have hit the upper limit on what could be cached in RAM which caused disk i/o which resulted in slower response times. In a scalable architecture one should have been able to add new nodes and the system should have been able to scale.

Unfortunately though this algorithm works beautifully during regular operation, it fails as soon as you add or remove a node (when you change n). At that point you can’t guarantee that all the data you previously stored on node n would still be on the same node.

And while this algorithm may still be ok for “memcached” cache clusters, its really bad for MemcacheDB which requires “consistent hashing”.

[RedditArchDiagramWhiteBG.png]

Reddit today announced that they have increased RAM on these MemcacheDB servers from 2GB to 6GB, which allows 94% of their DB to be kept in memory. But they have realized their mistake (they probably figured this out long time back) and are thinking about how to fix it. The simplest solution of adding a few nodes requires re-hashing their keys which would take days according to their estimate. And of course just adding nodes without using some kind of “consistent hashing” is still not a scalable solution.

I personally learnt two things

  • Dont mix MemcacheDB and memcached. They are not designed to solve the same problem.
  • Don’t just simply replace memcached with MemcacheDB without thinking twice

There are many different products out there today which do a better job at scaling, so I won’t be surprised if they abandon MemcacheDB completely as well.

Scalability links for Feb 28th 2010

Cassandra as a communication medium – A service Registry and Discovery tool

Few weeks ago while I was mulling over what kind of service registry/discovery system to use for a scalable application deployment platform, I realized that for mid-size organizations with complex set of services, building one from scratch may be the only option.

I also found out that many AWS/EC2 customers have already been using S3 and SimpleDB to  publish/discover services. That discussion eventually led me to investigate Cassandra as the service registry datastore in an enterprise network.

Here are some of the observations I made as I played with column_orientedCassandra for this purpose.  I welcome feedback from readers if you think I’m doing something wrong or if you think I can improve the design further.

  • The biggest issue I noticed with Cassandra was the absence of inverted index which could be worked around as I have blogged here. I later realized there is something called Lucandra  as well which I need to look at, at some point.
  • The keyspace structure I used was very simple… ( I skipped some configuration lines to keep it simple )

<Keyspace Name="devkeyspace">
    <ColumnFamily  CompareWith="UTF8Type" Name="forward"  />
    <ColumnFamily  CompareWith="UTF8Type" Name="reverse"  />
</Keyspace>

  • Using an “OrderPreservingPartitioner” seemed important to do “range scans”.  Order Preserving partitioner keeps objects with similar looking keys together to allow bulk reads and writes. By default Cassandra randomly distributes the objects across the cluster which works well if you only have a few nodes.
  • I eventually plan to use this application across two datacenters. The best way to mirror data across datacenters in Cassandra is by using “RackAwareStrategy”. If you select this option, it tells Cassandra to try to pick replicas of each token from different datacenters/racks. The default algorithm uses IP addresses to determine if two nodes are part of the same rack/datacenter, but there are other interesting ways to do it as well.
  • Some of the APIs changed significantly between the versions I was playing with. Cassandra developers will remind you that this is expected in a product which is still at 0.5 version. What amazes me, however, is the fact that Facebook, Digg and now Twitter have been using this product in production without bringing down everything.
  • I was eventually able to build a thin java webapp to front-end Cassandra, which provided the REST/json interface for registry/discovery service. This is also the app which managed the inverted indexes.
    • Direct Cassandra access from remote services was disabled for security/stability reasons.
    • The app used DNS to loadbalance queries across multiple servers.
  • My initial performance tests on this cluster performed miserably because I forgot that all of my requests were hitting the same node. The right way to tests Cassandra’s capacity is by loadbalancing requests across all Cassandra nodes.
    • Also realized, that by default, the logging mode was set to “DEBUG” which is very verbose. Shutting that down seemed to speed up response times as well.
  • Playing with different consistency levels for reading and writing was also an interesting experience, especially when I started killing nodes just to see the app break. This is what tweeking CAP is all about.
  • Due to an interesting problem related to “eventual consistency”, Cassandra doesn’t completely delete data which was marked deletion or was intentionally changed. In the default configuration that data is kept around for 10 days before its completely removed from the system.
  • Some documentation on the core operational aspects of Cassandra exist, but it would be nice if there were more.

Cassandra was designed as a scalable,highly available datastore. But because of its interesting self-healing and “RackAware” features, it can become an interesting communication medium as well.

Talk on “database scalability”

This is a very interesting talk by Jonathan Ellis on database scalability. He designed and implemented multi-petabyte storage for Mozy and is currently the project chair for Apache Cassandra.

  • Scalability is not improving latency, but increasing throughput
  • But overall performance shouldn’t degrade
  • Throw hardware, not people at the problem
  • Traditional databases use b-tree indexes. But requires the entire index to be in-memory at the same place.
  • Easy bandaid #1– SSD storage is better for b-tree indexes which need to hit disk
  • Easy bandaid #2 – Buy faster server every 2 years. As long as your userbase doesn’t grow faster that Moore’s law
  • Easy bandaid #3 – Use caching to handle hotspots (Distributed)
  • Memcache server failures can change where hashing keys are kept
  • Consistent hashing solves the problem by mapping keys to tokens. The tokens can move around to more or less server. Apps would be able to figure out which keys are where.

Scalable logging using Syslog

Syslog is a commonly used transport mechanism for system logs. But people sometimes forget it could be used for a lot of other purposes as well.

Take, for example, the interesting challenge of aggregating web server logs from 100 different servers into one server and syslogthen figuring out how to merge them. If you have built your own tool to do this, you would have figured out  by now how expensive it is to poll all the servers and how out-of-date these logs could get by the time you process it. If you are not inserting them into some kind of datastore which sorts the rows by timestamp, you now also have to take up the challenge of building merge-sort script.

There is nothing which stops applications from using syslog as well. If your apps are in Java, you should try out Syslog appender for log4j [Ref 1] [Ref 2]. Not only do you get central logging, you also get get to see real-time “tail -f” of events as they happen in a merged file. If there are issues anywhere in your network, you have just one place to look at. If your logging volume is high, you would have to use other tools (or build your own) to do log analysis.

Here are some things you might have to think about if you plan to use syslog for your environment.

  1. Setup different syslog servers for each of your datacenters using split DNS or by use different hostnames.
  2. Try not to send logs across WAN links
  3. Rotate logs on a nightly basis, or depending on the log volume
  4. Reduce amount of logging (don’t do “debug” in production for example)
  5. Write tools to detect change in logging volume in dev/qa environment. If you follow good logging practice, you should be able to identify components which are responsible for the increase very quickly.
  6. Identify log patterns which could be causes of concerns and setup some kind of alerting using your regular monitoring service (nagios for example). Don’t be afraid to use 3rd party tools which do this very well.
  7. Syslog over UDP is non-blocking, but the syslog server can overloaded if logging volume is not controlled. The most expensive part of logging is disk i/o. If you notice high i/o
  8. UDP doesn’t guarantee that every log event will make it to the syslog server. Find out if that level of uncertainty in logging is ok for your environment.

Other interesting observations

  1. The amount of changes required in a java app which is already using log4j to log to a syslog server is trivial
  2. Logging to local files can be disabled, which means you don’t have to worry about disk storage on each server..
  3. If you are using or want to use tools like splunk or hadoop/hbase for log analysis, syslog is probably the easiest way to get there.
  4. You can always loadbalance syslog servers by using DNS loadbalancing.
  5. Apache webservers can’t do syslog out of the box, but you can still make it happen
  6. I personally like haproxy more and it does do syslog out of the box.
  7. If you want to log events from startup/shutdown scripts, you can use the “logger” *nix command to send events to the syslog server.

How is log aggregated in your environment ?

References

SimpleDB now allows you to tweak consistency levels

We discussed Brewer’s Theorm a few days ago and how its challenging to obtain Consistency, Availability and Partition tolerance in any distributed system. We also discussed that many of the Amazon Web Servicesdistributed datastores allow CAP to be tweaked to attain certain operational goals.

Amazon SimpleDB, which was released as an “Eventually Consistent” datastore,  today launched a few features to do just that.

  • Consistent reads: Select and GetAttributes request now include an optional Boolean flag “ConsistentRead” which requests datastore to return consistent results only. If you have noticed scenarios where read right after a write returned an old value, it shouldn’t happen anymore.
  • Conditional put/puts, delete/deletes : By providing “conditions” in the form of a key/value pair SimpleDB can now conditionally execute/discard an operation. This might look like a minor feature, but can go a long way in providing reliable datastore operations.

Even though SimpleDB now enables operations that support a stronger consistency model, under the covers SimpleDB remains the same highly-scalable, highly-available, and highly durable structured data store. Even under extreme failure scenarios, such as complete datacenter failures, SimpleDB is architected to continue to operate reliably. However when one of these extreme failure conditions occurs it may be that the stronger consistency options are briefly not available while the software reorganizes itself to ensure that it can provide strong consistency. Under those conditions the default, eventually consistent read will remain available to use.

References

NoSQL in the Twitter world

NoSQL solutions have one thing in common. They are generally designed for horizontal scalability. So its no wonder that lot of applications in the “twitter” world have picked NoSQL based datastores for their Twitter.compersistence layer. Here is a collection of these apps from MyNoSQL blog.

  1. Twitter uses Cassandra
  2. MusicTweets used Redis [ Ref ] – The site is dead, but you can still read about it
  3. Tstore uses CouchDB
  4. Retwis uses CouchDB
  5. Retwis-RB uses Redis and Sinatra ??  – No idea what sinatra is. Will have to look into it. [ Update: Sinatra is not a DB store ]
  6. Floxee uses MongoDB
  7. Twidoop uses Hadoop
  8. Swordfish built on top of Tokyo Cabinet comes with a twitter clone app with it.
  9. Tweetarium uses Tokyo Cabinet

References

Do you know of any more ?

Eventual consistency is just caching ?

So there is someone who thinks “eventual consistency is just caching”.  Though I liked the idea of discussing this, I don’t agree with Udi’s views on this.

“Cache” is generally used to store data which is more expensive to obtain from the primary location. For example, caching mysql queries is ideal for queries which could take more than fraction of a second to execute. Another example is caching queries to S3, SimpleDB or Google’s datastore which could cost money and introduce network latency into the mix. Though most applications are built to use such caches, they are also designed to be responsive in absence of caching layer.

The most important difference between “cache” and a “datastore” is that the dataflow is generally from “datastore” to “cache” rather than the other way round. Though one could queue data on “cache” first and then update datastore later (for performance reasons) that is not the way one should use it. If you are using “cache” to queue data for slower storage, you are using the wrong product. There are better “queuing” solutions  (activemq for example) that can do it for you in a more reliable way.

In most “eventually consistent” systems, there is no concept of primary and secondary nodes. Most nodes on such systems are considered equal and have similar performance characteristics.

Since “caching” solutions are designed for speed, they generally don’t have a concept of “replicas” or allow persistence to disk. Synchronizing between replica’s or to a disk can be expensive and be counter productive which is why its rare to find them on “caching” products. But many “eventually consistent” systems do provide a way for developers to request the level of “consistency” (or disk persistence) desired.

Do you have an opinion on this ? Please share examples if you have seen “caching” layer being used as an “eventually consistent datastore”.

Update: Udi mentioned on twitter that “write through caches” are eventually consistent. Sure, they are if you are talking about a caching layer on top of a persistent layer. I think there is an argument which could be made that “caches” are eventually consistent, but the reverse may not be true which is what his original post mentioned.