Latest Publications

Sawzall and the PIG

When I heard interesting uses cases of how “Sawzall” is used to hack huge amounts of log data within Google I was thinking about two things.

  • Apache PIG, which is “a platform for analyzing large data sets that consists of a high-level language Pig for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.”
  • CEP (Complex event processing) – consists in processing many events happening across all the layers of an organization, identifying the most meaningful events within the event cloud, analyzing their impact, and taking subsequent action in real time. [ Also look at esper ]

Google has opened parts of this framework in a project called “Szl

Sawzall is a procedural language developed for parallel analysis of very large data sets (such as logs). It provides protocol buffer handling, regular expression support, string and array manipulation, associative arrays (maps), structured data (tuples), data fingerprinting (64-bit hash values), time values, various utility operations and the usual library functions operating on floating-point and string values. For years Sawzall has been Google’s logs processing language of choice and is used for various other data analysis tasks across the company.

Instead of specifying how to process the entire data set, a Sawzall program describes the processing steps for a single data record independent of others. It also provides statements for emitting extracted intermediate results to predefined containers that aggregate data over all records. The separation between per-record processing and aggregation enables parallelization. Multiple records can be processed in parallel by different runs of the same program, possibly distributed across many machines. The language does not specify a particular implementation for aggregation, but a number of aggregators are supplied. Aggregation within a single execution is automatic. Aggregation of results from multiple executions is not automatic but an example program is supplied.

Here is a quick example of how it could be used…

  topwords: table top(3) of word: string weight count: int;
   fields: array of bytes = splitcsvline(input);
   w: string = string(fields[0]);
   c: int = int(string(fields[1]), 10);
   if (c != 0) {
   emit topwords <- w weight c;
   }

Given the input:

  abc,1
   def,2
   ghi,3
   def,4
   jkl,5

The program (using –table_output) emits:

  topwords[] = def, 6, 0
   topwords[] = jkl, 5, 0
   topwords[] = ghi, 3, 0

Presentation: “OrientDB, the database of the web”

I knew there was something called “OrientDB”, but didn’t know much about it until I went through these slides. Here is what I learned in one sentence. Its a easy to install NoSQL(schemaless) datastore, with absolutely no configuration required, supports ACID transactions, it can be used as a document store, a graph store and a key value store, it can be queried using SQL-like and JSON syntax, supports indexing and triggers and its been benchmarked to do 150000 inserts using commodity hardware.  That’s a lot of features.

Cloud economics: Not really black and white..

While some of the interest in moving towards public cloud is based on sound economics, there is a small segment of this movement purely due to the “herd mentality”.

The slide on the right is from a Microsoft publication shows that larger networks may be less economical on the cloud (at least today).

Richard Farley, has been discussing this very topic for few months now. He observed that a medium sized organization which already has a decent IT infrastructure including a dedicated IT staff to support it has a significantly smaller overhead than what cloud vendors might make it look like.

Here is a small snippet from his blog. If you are not afraid to get dirty with numbers read the rest here.

Now, we know we need 300 virtual servers, each of which consumes 12.5% of a physical host.  This means we need a total of 37.5 physical hosts.  Our vendor tells us these servers can be had for $7k each including tax and delivery with the cabinet.  We can’t buy a half server, and want to have an extra server on hand in case one breaks.  This brings our total to 39 at a cost of $273k.  Adding in the cost of the cabinet, we’re up to $300k.

There are several non-capital costs we now have to factor in.  Your vendor will provide warranty, support and on-site hardware replacement service for the cabinet and servers for $15k per year.  Figure you will need to allocate around 5% of the time of one of your sys admins to deal with hardware issues (i.e., coordinating repairs with the vendor) at a cost of around $8k per year in salary and benefits.  Figure power and cooling for the cabinet will also cost $12k per year.  In total, your non-capital yearly costs add up to $35k.

One thing which posts doesn’t clearly articulate, is the fact that while long term infrastructure is cheaper to host in private cloud, its may still be more economical to use public cloud for short term resource intensive projects.

Cassandra: What is HintedHandoff ?

Nate has a very good post about how Cassandra is different from a lot of other distributed data-stores. In particular he explains that every node in a Cassandra cluster are identical to every other node. After using cassandra and a few months I can tell you for a fact that its true. It does come at a price though. Because its so decentralized, if you want to make a schema change, for example, configuration files of all of the nodes in the cluster need to be updated all at the same time. Some of these problems will go away when 0.70 finally comes out.

While is true that Cassandra doesn’t have a concept of single “master” server, each node participating in the cluster do actually act as masters and slaves of parts of the entire key range. The actual % size of the key range owned by an individual node depends on the replication factor, on how one picks the keys and what partitioner algorithm was selected.

The fun starts when a node, which could be the master for a range of keys, goes down. This is how Nate explains the process..

Though the node is the "primary" for a portion of the data in the cluster, the number of copies of the data kept on other nodes in the cluster is configurable. When a node goes down, the other nodes containing copies, referred to as "replicas", continue to service read requests and will even accept writes for the down node. When the node returns, these queued up writes are sent from the replicas to bring the node back up to date

And this is from the Cassandra wiki

If a node which should receive a write is down, Cassandra will write a hint to a live replica node indicating that the write needs to be replayed to the unavailable node. If no live replica nodes exist for this key, and ConsistencyLevel.ANY was specified, the coordinating node will write the hint locally. Cassandra uses hinted handoff as a way to (1) reduce the time required for a temporarily failed node to become consistent again with live ones and (2) provide extreme write availability when consistency is not required.

A hinted write is NOT sufficient to count towards ConsistencyLevel requirements of ONE, QUORUM, or ALL. Take the simple example of a cluster of two nodes, A and B, and a replication factor of 1 (each key is stored on one node). Suppose node A is down while we write key K to it with ConsistencyLevel.ONE. Then we must fail the write: recall from the API page that "if W + R > ReplicationFactor, where W is the number of nodes to block for on write, and R the number to block for on reads, you will have strongly consistent behavior; that is, readers will always see the most recent write."

Thus if we write a hint to B and call the write good because it is written "somewhere," there is no way to read the data at any ConsistencyLevel until A comes back up and B forwards the data to him. Historically, only the lowest ConsistencyLevel of ZERO would accept writes in this situation; for 0.6, we added ConsistencyLevel.ANY, meaning, "wait for a write to succeed anywhere, even a hinted write that isn’t immediately readable."

Mike Perham has a related post on the same topic. He goes further and explains that because there could be scenarios where writes are not immediately visible due to a disabled master node, its possible that master could get out of sync with the slaves in the confusion. There is a process called “anti-entropy” which cassandra uses to detect and resolve such issues. Here is how he explains

The final trick up Cassandra’s proverbial sleeve is anti-entropy. AE explicitly ensures that the nodes in the cluster agree on the current data. If read repair or hinted handoff don’t work due to some set of circumstances, the AE service will ensure that nodes reach eventual consistency. The AE service runs during “major compactions” (the equivalent of rebuilding a table in an RDBMS) so it is a relatively heavyweight process that runs infrequently. AE uses a Merkle Tree to determine where within the tree of column family data the nodes disagree and then repairs each of those branches.

OpenTSDB – Distributed time series database

Ever since I saw a demo of this tool, I’ve been on the edge, waiting for it to be opensourced so that I could use it.  The problem its trying to solve is a real pain-point which most webops folks would understand.

Yesterday folks at stumbleupon finally opened it up. Its released under LGPLv3 license. You can find the source here and the documentation here.

At StumbleUpon, we have found this system tremendously helpful to:

  • Get real-time state information about our infrastructure and services.
  • Understand outages or how complex systems interact together.
  • Measure SLAs (availability, latency, etc.)
  • Tune our applications and databases for maximum performance.
  • Do capacity planning.

AWS cloudfront grows up… a little. Now allows Custom origins.

 Cloudfront has come a long way from its humble beginnings. Here is what Jeff had to say when he announced that its out of “beta”….

    1. First, we’ve removed the beta tag from CloudFront and it is now in full production. During the beta period we listened to our customers and added a number of important features including Invalidation, a default root object, HTTPS access, private content, streamed content, private streamed content,AWS Management Console support, request logging, and additional edge locations. We’ve also reduced our prices.
    2. There’s now an SLA (Service Level Agreement) for CloudFront. If availability of your content drops below 99.9% in any given month, you can apply for a service credit equal to 10% of your monthly bill. If the availability drops below 99% you can apply for a service credit equal to 25% of your monthly bill.

While all this is a big step forward, its probably not enough for the more advanced CDN users to switch over yet.

Here are a couple of issues which stuck out in the Developer Guide.

  • Query parameters are not used to generate cache key. So while it looks like it can pull content from an elastic loadbalancer, it still acts like a giant S3 accelerator.
  • Doesn’t support HTTP/1.1 yet. So if you have multiple domains on the same IP, this solution isn’t for you.

Building your first cloud application on AWS

Building your first web application on AWS is like shopping for a car at pepboys, part by part. While manuals to build one might be on aisle 5, the experience of having built one already is harder to buy.

Here are some interesting logistical questions, which I don’t think get enough attention, when people discuss issues around building a new AWS based service.

  1. Picking the right Linux distribution: Switching OS distribution may not be too simple if your applications need custom scripts. Picking and sticking with a single distribution will save a lot of lost time.
  2. Automated server builds: There are many ways to skin this cat. Chef, Puppet, Cfengine are all good… Whats important is to pick one early in the game.
  3. Multi-Availability Zone support: Find out if multi availability zone support is important. This can impact over all architecture of the solution and tools used to implement the solution.
  4. Data consistency requirements: Similar to the Multi-AZ support question, its important to understand the data consistency tolerance of the application before one starts designing the application.
  5. Datastore: There are different kinds of datastores available as part of AWS itself (SimpleDB, S3 and RDS). If you are planning to keep your options open about moving out of AWS at some point, you should think about picking a datastore which you could move out with you with little effort. There are many NoSQL and RDBMS solutions to choose from.
  6. Backups: While some think its a waste of time to think about backups too early, I suspect those who don’t will be spending way too much time later. The long term backup strategy is integral part of disaster recovery planning, without which you shouldn’t think of going live.
  7. Integration with external data sources:  If this application is part of a larger cluster of application which is running somewhere else, think about how data would be sent back and forth. There are lots of different options depending on how much data is involved (or how important protection of that data is)
  8. Monitoring/Alerting: Most standard out of the box monitoring tools can’t handle dynamic infrastructure very well. There are, however,  plugins available for many existing monitoring solutions which can handle the dynamic nature of infrastructure. You could also choose to use one of the 3rd party monitoring services if you’d rather pay someone else to do it.
  9. Security: You should be shocked to see this on #9 on my list. If your service involves user data, or some other kind of intellectual property, build multi-tiered architecture to segment different parts of your application from targeted attacks. Security is also very important while picking the right caching and web server technologies.
  10. Development: Figure out how developers would use AWS. Would they share the same AWS account, share parts of the infrastructure, share datastore, etc. How would the developer resources be monitored so that unintentional uses of excessive resources could be flagged for alerting.

Are there other subtle issues which I should have listed here ? Let me know.

 

Riak MapReduce: A story in Three Acts

 

Rapid prototyping with solr

Extreme prototyping with Solr by Eric Hatcher

At ApacheCon this week I presented “Rapid Prototyping with Solr”.  This is the third time I’ve given a presentation with the same title.  In the spirit of the rapid prototyping theme, each time I’ve created a new prototype just a day or so prior to presenting it.  At Lucene EuroCon the prototype used attendee data, a treemap visualization, and a cute little Solr-powered “app” for picking attendees at random for the conference giveaways.  For a recent Lucid webinar the prototype was more general purpose, bringing in and making searchable rich documents and faceting on file types with a pie chart visualization.

This time around, the data set I chose was Data.gov’s catalog of datasets, which fit with the ApacheCon open source aura, and Lucid Imagination’s support of Open Source for America, which helps to advocate for open source in the US Federal Government.  The prototype built includes faceting browsing, query term suggest, hit highlighting, result clustering, spell checking, document detail, and a bonus Venn diagram visualization.

Shipping Trunk : For web applications

I had briefly blogged about this presentation before from Velocity 2010. I wish they had released the video for this session. I went through this slide deck again today to see if Paul mentioned any of the problems organization like ours are dealing with in its transition from quarterly releases to weekly/continuous releases.

One of the key observations Paul made during his talk is that most organizations still treat web applications as desktop software and have very strict quality controls which may not be as necessary since releasing changes for web app in a SAAS (Software as a service) is much more cheaper than for releasing patches for traditional desktop software.

Here are some of the other points he made. For really detailed info check out the [slides]

  • Deploy frequently, facilitating rapid product iteration
  • Avoid large merges and the associated integration testing
  • Easily perform A/B testing of functionality
  • Run QA and beta testing on production hardware
  • Launch big features without worrying about your infrastructure
  • Provide all the switches your operations team needs to manage the deployed system

Slides: Always ship Trunk: Managing Change In Complex Websites