December 20, 2010

Switching roles: next stop Google

Jan of 2011 will start a little different for me after 10 years. I’ve accepted a position in the Google Apps altEnterprise Group and would be joining them early next month.

Other than the fun stuff I do outside my regular job, I’ve been in IT related roles for as long as I can remember. And while IT has been very challenging and is an exciting field to be in , I feel that its time for a little exploration.

I will deeply miss all of my friends at Ingenuity. Some of whom I’ve worked with for over 10 years... but I'm ready for my next challenge.

Switching roles: next stop Google

Jan of 2011 will start a little different for me after 10 long years. I’ve accepted a position in the Google Apps Enterprise group and would be joining them early next month.

Other than the fun stuff I do outside my regular job, I’ve been doing IT related stuff for as long as I can remember. And while IT has been very challenging and exciting field to be in , I feel that its time for a little exploration.

My scalable web architecture blog and this personal blog will continue to stay up, but I’m not sure at this point how my new job will impact the frequency at which I post here.

December 18, 2010

S4: Distributed Stream Computing Platform

A few weeks ago I mentioned Yahoo! Labs was working on something called S4 for real-time data analysis. Yesterday they released an 8 page paper with detailed description of how and why they built this. Here is the abstract from the paper.

Its interesting to note that the authors compared S4 with MapReduce and explained that MapReduce was too optimized for batch process and wasn’t the best place to do real time computation. They also made an architectural decision of not building a system which can do both offline (batch) processing and real-time processing since they feared such a system would end up to be not good for either.

S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable imageplatform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Keyed data events are routed with affinity to Processing Elements (PEs), which consume the events and do one or both of the following: (1) emit one or more events which may be consumed by other PEs, (2) publish results. The architecture resembles the Actors model [1], providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers. In this paper, we outline the S4 architecture in detail, describe various applications, including real-life deployments. Our de- sign is primarily driven by large scale applications for data mining and machine learning in a production environment. We show that the S4 design is surprisingly flexible and lends itself to run in large clusters built with commodity hardware.


Authors: Neumeyer L, Robbins B, Nair A, Kesari A
Source: Yahoo! Labs

December 07, 2010

REST APIs for cloud management and the launch

I found the top two stories on scalebig last night to be interesting enough for me to dig a little deeper. The one which surprised me the most was William Vambenepe’s post about why he thinks that REST APIs doesn’t matter in context of cloud management. While REST might be ideal for many different things, including web based applications which are accessed mostly by the browsers, Amazon chose to avoid REST for most of its infrastructure management APIs.

Has this lack of REStfulness stopped anyone from using it? Has it limited the scale of systems deployed on AWS? Does it limit the flexibility of the Cloud offering and somehow force people to consume more resources than they need? Has it made the Amazon Cloud less secure? Has it restricted the scope of platforms and languages from which the API can be invoked? Does it require more experienced engineers than competing solutions?

I don’t see any sign that the answer is “yes” to any of these questions. Considering the scale of the service, it would be a multi-million dollars blunder if indeed one of them had a positive answer.

Here’s a rule of thumb. If most invocations of your API come via libraries for object-oriented languages that more or less map each HTTP request to a method call, it probably doesn’t matter very much how RESTful your API is.

The Rackspace people are technically right when they point out the benefits of their API compared to Amazon’s. But it’s a rounding error compared to the innovation, pragmatism and frequency of iteration that distinguishes the services provided by Amazon. It’s the content that matters.

And the other big news was of course the launch of a new cloud datastore by salesforce at database.comInterestingly, you should notice, that they decided to brand it with its own website instead of making it part of its existing set of services. Its possible they did it to distance this new service from an impression that its only useful for applications which need other salesforce services. For more in-depth technical information continue reading here.

The infrastructure promises automatic tuning, upgrades, backups and replication to remote data centers, and automatic creation of sandboxes for development, test and training. offers enterprise search services, allowing developers to access a full-text search engine that respects enterprise security rules

In terms of pricing, access will be free for 3 users, and up to 100,000 records and 50,000 transactions per month. The platform will $10 per month for each set of 100,000 records beyond that and another $10 per month for each set of 150,000 transactions beyond that benchmark. The enterprise-level services will be an additional $10 per user per month and will include user identity, authentication and row-level security access controls.

Other references: Dreamforce 2010 – Launch Interview with Eric Stahl Read more

December 06, 2010

Providing Dynamic DNS over “Amazon Route 53” ( a hackathon )

On hindsight, yesterday’s “Route 53” announcement was not completely unexpected. Amazon is an IAAS provider and its in their own interest to automate infrastructure as much as possible. After tackling monitoring and cloudfront features, DNS was one of the more obvious targets for improvement.

So when I was trying to pick a challenge for this morning’s hackathon, I picked one around “Amazon Route 53”  service. At the end of the day I had a almost functional public dynamic DNS service using “Route 53” as the DNS service and Twitter’s oauth service for authentication. The final hack is up here You are most welcome to play with it and/or use it.image

After the initial creation of user in the system (with a little help from twitter’s oauth), the end user is free to use the browser based web application or  lynx or curl based REST interface to add/create/update host records. The current version only supports “A” records, but it would be expanded to other records if there is enough interest.

A one line script in cron is all you need to make sure all your DHCP based services are always publically addressable over the internet.

The current version of flagthis provides addresses in the following format.

Here are some of my other observations, as I tried to use Route 53 for the first time today.

  • I was forced to install some CPAN modules on my ec2 instance which sucked up a lot of time.
  • There are two scripts you really need from Amazon’s code library to learn the basics of the service “” and “”. The first one helps you create the XML file which in turn creates new “zones”. And the second script is what gets used to upload and execute new requests using XML files.
  • “” requires your secret and key in a file in home directory with 600 as the permission of the file.
  • Once the basic commands started working, I focused on more complex requests like “update” and “delete”
  • Thats when I figured out that there is no concept of “update” in Route 53. If you want to update a record you have to send out a “delete” and “create” request. To avoid disrupting a live system, they recommend the deletion and creation happen in the same request (acts like an atomic transaction)
  • Another thing I realized is that the only way to delete an old record is to have complete information about that request from when it was created. For example lets say you created an entry “” A with 1400 as the TTL, then you have to send all of this information again at deletion time (including the TTL) to delete it. This was slightly frustrating.
  • At the end of the day I couldn’t figure out a quick way to list all the hosts in a zone, or search for details of a particular record in a zone. This is next on my list. The way I got around this step is by using “dig” to get all that information using DNS protocol instead of HTTP based API.
  • Every change request submitted has a 10 to 15 character request ID associated with it. A client can poll the status of this request ID anytime. Since I was playing with small number of changes, most of them completed within a second of submission. While the change is happening, the status of the request is set to “PENDING” and it switches to “INSYNC” as soon as the change is complete.

Folks who have been using S3/SimpleDB as a datastore for service registry should strongly consider using “Route 53” for the same thing. And if you are still running your own DNS servers on EC2 I think its time for you to question yourself if its time to move on.

December 05, 2010

Amazon Route 53 : Programmable DNS is finally here

Managing DNS has been considered as an art by many. If you manage your own DNS records, and run your own external DNS servers, I’m sure you have some stories to share. Unfortunately unlike most other Amazon Web Services infrastructure on the internet, DNS screw-ups can get very costly, especially because caching policies can tend to keep your mistakes alive long after you have rolled back your changes.

The unforgiving nature of DNS has forced most, except a few hardcore sys-admins, from avoiding the DNS hell and choosing a managed service to do it for them. Domain name registrars like network solutions, mydomain and godaddy already provide these DNS services, but I can’t recall any of them providing APIs to make these changes automatically. DynDNS does provide an API to change DNS mappings, but it costs15 bucks a year for a single host. There might be others which I’m not aware off, but the bottom line is that there is no standard, and its not cheap.

altCustomers on AWS today unfortunately have the same problem. And not surprisingly they too prefer to use 3rd party service providers to monitor, setup and manage DNS records for them.  Today AWS is announcing a new service “Amazon Route 53” which, technically, isn’t a significant breakthrough. But considering the number of users already on AWS, the demand for such a service this would be one of the biggest game changing events in the DNS world in the last decade.

The service is pretty cheap, about 12 bucks a year and gives a complete set of APIs to create, delete, modify maintain and query DNS records on this new service.

To make transition simple they even have migration tools from bind to Amazon Route 53 ready to go. Here are pointers to some more documentation if you want to get down and dirty with it.

But stop thinking of this as a simple DNS service. I can see a whole range of interesting applications being built over this service in next few months. The simplest application which I think would be built over this is a service like “Dynamic DNS service”, which would be cheap to build with Route 53 doing most of the grunt work for you.

Here is how Jeff Barr introduced this service

Today we are introducing Amazon Route 53, a programmable Domain Name Service. You can now create, modify, and delete DNS zone files for any domain that you own. You can do all of this under full program control—you can easily add and modify DNS entries in response to changing circumstances. For example, you could create a new sub-domain for each new customer of a Software as a Service (SaaS) application. DNS queries for information within your domains will be routed to a global network of 16 edge locations tuned for high availability and high performance.

Route 53 introduces a new concept called a Hosted Zone. A Hosted Zone is equivalent to a DNS zone file. It begins with the customary SOA (Start of Authority) record and can contain other records such as A (IPV4 address), AAAA (IPV6 address), CNAME (canonical name), MX (mail exchanger), NS (name server), and SPF (Sender Policy Framework). You have full control over the set of records in each Hosted Zone.

Here is some more info from Werner Vogels

Amazon Route 53

Amazon Route 53 is a new service in the Amazon Web Services suite that manages DNS names and answers DNS queries. Route 53 provides Authoritative DNS functionality implemented using a world-wide network of highly-available DNS servers. Amazon Route 53 sets itself apart from other DNS services that are being offered in several ways:

A familiar cloud business model: A complete self-service environment with no sales people in the loop. No upfront commitments are necessary and you only pay for what you have used. The pricing is transparent and no bundling is required and no overage fees are charged.

Very fast update propagation times: One of the difficulties with many of the existing DNS services are the very long update propagation times, sometimes it may even take up to 24 hours before updates are received at all replicas. Modern systems require much faster update propagation to for example deal with outages. We have designed Route 53 to propagate updates very quickly and give the customer the tools to find out when all changes have been propagated.

Low-latency query resolution The query resolution functionality of Route 53 is based on anycast, which will route the request automatically to the DNS server that is the closest. This achieves very low-latency for queries which is crucial for the overall performance of internet applications. Anycast is also very robust in the presence of network or server failures as requests are automatically routed to the next closest server.

No lock-in. While we have made sure that Route 53 works really well with other Amazon services such as Amazon EC2 and Amazon S3, it is not restricted to using it within AWS. You can use Route 53 with any of the resources and entities that you want to control, whether they are in the cloud or on premise.

We chose the name "Route 53" as a play on the fact that DNS servers respond to queries on port 53. But in the future we plan for Route 53 to also give you greater control over the final aspect of distributed system naming, the route your users take to reach an endpoint. If you want to learn more about Route 53 visit and read the blog post at the AWS Developer weblog.

Here are some of the other comments on this service

December 04, 2010

Scalability links for December 4th

Scalability links for December 4th:

December 02, 2010

AWS Cloudwatch is now really open for business

In a surprise move Amazon today released a bunch of new features to its cloudwatch service, some of which, till now, were provided by third party service providers.

  • Basic Monitoring of Amazon EC2 instances at 5-minute intervals at no additional charge. AWS cloudwatch
  • Elastic Load Balancer Health Checks -Auto Scaling can now be instructed to automatically replace instances that have been deemed unhealthy by an Elastic Load Balancer.
  • Alarms - You can now monitor Amazon CloudWatch metrics, with notification to the Amazon SNS topic of your choice when the metric falls outside of a defined range.
  • Auto Scaling Suspend/Resume - You can now push a "big red button" in order to prevent scaling activities from being initiated.
  • Auto Scaling Follow the Line -You can now use scheduled actions to perform scaling operations at particular points in time, creating a time-based scaling plan.
  • Auto Scaling Policies - You now have more fine-grained control over the modifications to the size of your AutoScaling groups.
  • VPC and HPC Support - You can now use AutoScaling with Amazon EC2 instances that are running within your Virtual Private Cloud or as Cluster Compute instances.

Kafka : A high-throughput distributed messaging system.

Found an interesting new open source project which I hadn’t heard about before. Kafka is a messaging system used by linkedin to serve as the foundation of their activity stream processing.
Kafka is a distributed publish-subscribe messaging system. It is designed to support the following

  • Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.

  • High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second.

  • Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.

  • Support for parallel data load into Hadoop.

Kafka is aimed at providing a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. This kind of activity (page views, searches, and other user actions) are a key ingredient in many of the social feature on the modern web. This data is typically handled by "logging" and ad hoc log aggregation solutions due to the throughput requirements. This kind of ad hoc solution is a viable solution to providing logging data to an offline analysis system like Hadoop, but is very limiting for building real-time processing. Kafka aims to unify offline and online processing by providing a mechanism for parallel load into Hadoop as well as the ability to partition real-time consumption over a cluster of machines.

The use for activity stream processing makes Kafka comparable to Facebook's Scribe or Cloudera's Flume, though the architecture and primitives are very different for these systems and make Kafka more comparable to a traditional messaging system. See our design page for more details.

The unbiased private vs AWS ROI worksheet

One of the my problems with most cloud ROI worksheets is that they are heavily weighted for use-cases where resource usage is very bursty. But what if your resource requirements aren’t bursty ? And what if you have a use case where you have to maintain a small IT team to manage some on-site resources due to compliance and other issues ? 


In his latest post, Richard shares his worksheet for everyone to play with.