December 20, 2010

Switching roles: next stop Google

Jan of 2011 will start a little different for me after 10 years. I’ve accepted a position in the Google Apps altEnterprise Group and would be joining them early next month.

Other than the fun stuff I do outside my regular job, I’ve been in IT related roles for as long as I can remember. And while IT has been very challenging and is an exciting field to be in , I feel that its time for a little exploration.

I will deeply miss all of my friends at Ingenuity. Some of whom I’ve worked with for over 10 years... but I'm ready for my next challenge.

Switching roles: next stop Google

Jan of 2011 will start a little different for me after 10 long years. I’ve accepted a position in the Google Apps Enterprise group and would be joining them early next month.

Other than the fun stuff I do outside my regular job, I’ve been doing IT related stuff for as long as I can remember. And while IT has been very challenging and exciting field to be in , I feel that its time for a little exploration.

My scalable web architecture blog and this personal blog will continue to stay up, but I’m not sure at this point how my new job will impact the frequency at which I post here.

December 18, 2010

S4: Distributed Stream Computing Platform

A few weeks ago I mentioned Yahoo! Labs was working on something called S4 for real-time data analysis. Yesterday they released an 8 page paper with detailed description of how and why they built this. Here is the abstract from the paper.

Its interesting to note that the authors compared S4 with MapReduce and explained that MapReduce was too optimized for batch process and wasn’t the best place to do real time computation. They also made an architectural decision of not building a system which can do both offline (batch) processing and real-time processing since they feared such a system would end up to be not good for either.

S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable imageplatform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Keyed data events are routed with affinity to Processing Elements (PEs), which consume the events and do one or both of the following: (1) emit one or more events which may be consumed by other PEs, (2) publish results. The architecture resembles the Actors model [1], providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers. In this paper, we outline the S4 architecture in detail, describe various applications, including real-life deployments. Our de- sign is primarily driven by large scale applications for data mining and machine learning in a production environment. We show that the S4 design is surprisingly flexible and lends itself to run in large clusters built with commodity hardware.

Code: http://s4.io/

Authors: Neumeyer L, Robbins B, Nair A, Kesari A
Source: Yahoo! Labs

December 07, 2010

REST APIs for cloud management and the Database.com launch

I found the top two stories on scalebig last night to be interesting enough for me to dig a little deeper. The one which surprised me the most was William Vambenepe’s post about why he thinks that REST APIs doesn’t matter in context of cloud management. While REST might be ideal for many different things, including web based applications which are accessed mostly by the browsers, Amazon chose to avoid REST for most of its infrastructure management APIs.

Has this lack of REStfulness stopped anyone from using it? Has it limited the scale of systems deployed on AWS? Does it limit the flexibility of the Cloud offering and somehow force people to consume more resources than they need? Has it made the Amazon Cloud less secure? Has it restricted the scope of platforms and languages from which the API can be invoked? Does it require more experienced engineers than competing solutions?

I don’t see any sign that the answer is “yes” to any of these questions. Considering the scale of the service, it would be a multi-million dollars blunder if indeed one of them had a positive answer.

Here’s a rule of thumb. If most invocations of your API come via libraries for object-oriented languages that more or less map each HTTP request to a method call, it probably doesn’t matter very much how RESTful your API is.

The Rackspace people are technically right when they point out the benefits of their API compared to Amazon’s. But it’s a rounding error compared to the innovation, pragmatism and frequency of iteration that distinguishes the services provided by Amazon. It’s the content that matters.

And the other big news was of course the launch of a new cloud datastore by salesforce at Database.com. database.comInterestingly, you should notice, that they decided to brand it with its own website instead of making it part of its existing set of services. Its possible they did it to distance this new service from an impression that its only useful for applications which need other salesforce services. For more in-depth technical information continue reading here.

The infrastructure promises automatic tuning, upgrades, backups and replication to remote data centers, and automatic creation of sandboxes for development, test and training. Database.com offers enterprise search services, allowing developers to access a full-text search engine that respects enterprise security rules

In terms of pricing, Database.com access will be free for 3 users, and up to 100,000 records and 50,000 transactions per month. The platform will $10 per month for each set of 100,000 records beyond that and another $10 per month for each set of 150,000 transactions beyond that benchmark. The enterprise-level services will be an additional $10 per user per month and will include user identity, authentication and row-level security access controls.

Other references: Dreamforce 2010 – Database.com Launch Interview with Eric Stahl Read more

December 06, 2010

Providing Dynamic DNS over “Amazon Route 53” ( a hackathon )

On hindsight, yesterday’s “Route 53” announcement was not completely unexpected. Amazon is an IAAS provider and its in their own interest to automate infrastructure as much as possible. After tackling monitoring and cloudfront features, DNS was one of the more obvious targets for improvement.

So when I was trying to pick a challenge for this morning’s hackathon, I picked one around “Amazon Route 53”  service. At the end of the day I had a almost functional public dynamic DNS service using “Route 53” as the DNS service and Twitter’s oauth service for authentication. The final hack is up here http://www.flagthis.com/. You are most welcome to play with it and/or use it.image

After the initial creation of user in the system (with a little help from twitter’s oauth), the end user is free to use the browser based web application or  lynx or curl based REST interface to add/create/update host records. The current version only supports “A” records, but it would be expanded to other records if there is enough interest.

A one line script in cron is all you need to make sure all your DHCP based services are always publically addressable over the internet.

The current version of flagthis provides addresses in the following format. HOSTNAME.TWITTERHANDLE.dynamic.flagthis.com

Here are some of my other observations, as I tried to use Route 53 for the first time today.

  • I was forced to install some CPAN modules on my ec2 instance which sucked up a lot of time.
  • There are two scripts you really need from Amazon’s code library to learn the basics of the service “route53zone.pl” and “dnscurl.pl”. The first one helps you create the XML file which in turn creates new “zones”. And the second script is what gets used to upload and execute new requests using XML files.
  • “dnscurl.pl” requires your secret and key in a file in home directory with 600 as the permission of the file.
  • Once the basic commands started working, I focused on more complex requests like “update” and “delete”
  • Thats when I figured out that there is no concept of “update” in Route 53. If you want to update a record you have to send out a “delete” and “create” request. To avoid disrupting a live system, they recommend the deletion and creation happen in the same request (acts like an atomic transaction)
  • Another thing I realized is that the only way to delete an old record is to have complete information about that request from when it was created. For example lets say you created an entry “xyz.flagthis.com” A 10.10.10.10 with 1400 as the TTL, then you have to send all of this information again at deletion time (including the TTL) to delete it. This was slightly frustrating.
  • At the end of the day I couldn’t figure out a quick way to list all the hosts in a zone, or search for details of a particular record in a zone. This is next on my list. The way I got around this step is by using “dig” to get all that information using DNS protocol instead of HTTP based API.
  • Every change request submitted has a 10 to 15 character request ID associated with it. A client can poll the status of this request ID anytime. Since I was playing with small number of changes, most of them completed within a second of submission. While the change is happening, the status of the request is set to “PENDING” and it switches to “INSYNC” as soon as the change is complete.

Folks who have been using S3/SimpleDB as a datastore for service registry should strongly consider using “Route 53” for the same thing. And if you are still running your own DNS servers on EC2 I think its time for you to question yourself if its time to move on.

December 05, 2010

Amazon Route 53 : Programmable DNS is finally here

Managing DNS has been considered as an art by many. If you manage your own DNS records, and run your own external DNS servers, I’m sure you have some stories to share. Unfortunately unlike most other Amazon Web Services infrastructure on the internet, DNS screw-ups can get very costly, especially because caching policies can tend to keep your mistakes alive long after you have rolled back your changes.

The unforgiving nature of DNS has forced most, except a few hardcore sys-admins, from avoiding the DNS hell and choosing a managed service to do it for them. Domain name registrars like network solutions, mydomain and godaddy already provide these DNS services, but I can’t recall any of them providing APIs to make these changes automatically. DynDNS does provide an API to change DNS mappings, but it costs15 bucks a year for a single host. There might be others which I’m not aware off, but the bottom line is that there is no standard, and its not cheap.

altCustomers on AWS today unfortunately have the same problem. And not surprisingly they too prefer to use 3rd party service providers to monitor, setup and manage DNS records for them.  Today AWS is announcing a new service “Amazon Route 53” which, technically, isn’t a significant breakthrough. But considering the number of users already on AWS, the demand for such a service this would be one of the biggest game changing events in the DNS world in the last decade.

The service is pretty cheap, about 12 bucks a year and gives a complete set of APIs to create, delete, modify maintain and query DNS records on this new service.

To make transition simple they even have migration tools from bind to Amazon Route 53 ready to go. Here are pointers to some more documentation if you want to get down and dirty with it.

But stop thinking of this as a simple DNS service. I can see a whole range of interesting applications being built over this service in next few months. The simplest application which I think would be built over this is a service like “Dynamic DNS service”, which would be cheap to build with Route 53 doing most of the grunt work for you.

Here is how Jeff Barr introduced this service

Today we are introducing Amazon Route 53, a programmable Domain Name Service. You can now create, modify, and delete DNS zone files for any domain that you own. You can do all of this under full program control—you can easily add and modify DNS entries in response to changing circumstances. For example, you could create a new sub-domain for each new customer of a Software as a Service (SaaS) application. DNS queries for information within your domains will be routed to a global network of 16 edge locations tuned for high availability and high performance.

Route 53 introduces a new concept called a Hosted Zone. A Hosted Zone is equivalent to a DNS zone file. It begins with the customary SOA (Start of Authority) record and can contain other records such as A (IPV4 address), AAAA (IPV6 address), CNAME (canonical name), MX (mail exchanger), NS (name server), and SPF (Sender Policy Framework). You have full control over the set of records in each Hosted Zone.

Here is some more info from Werner Vogels

Amazon Route 53

Amazon Route 53 is a new service in the Amazon Web Services suite that manages DNS names and answers DNS queries. Route 53 provides Authoritative DNS functionality implemented using a world-wide network of highly-available DNS servers. Amazon Route 53 sets itself apart from other DNS services that are being offered in several ways:

A familiar cloud business model: A complete self-service environment with no sales people in the loop. No upfront commitments are necessary and you only pay for what you have used. The pricing is transparent and no bundling is required and no overage fees are charged.

Very fast update propagation times: One of the difficulties with many of the existing DNS services are the very long update propagation times, sometimes it may even take up to 24 hours before updates are received at all replicas. Modern systems require much faster update propagation to for example deal with outages. We have designed Route 53 to propagate updates very quickly and give the customer the tools to find out when all changes have been propagated.

Low-latency query resolution The query resolution functionality of Route 53 is based on anycast, which will route the request automatically to the DNS server that is the closest. This achieves very low-latency for queries which is crucial for the overall performance of internet applications. Anycast is also very robust in the presence of network or server failures as requests are automatically routed to the next closest server.

No lock-in. While we have made sure that Route 53 works really well with other Amazon services such as Amazon EC2 and Amazon S3, it is not restricted to using it within AWS. You can use Route 53 with any of the resources and entities that you want to control, whether they are in the cloud or on premise.

We chose the name "Route 53" as a play on the fact that DNS servers respond to queries on port 53. But in the future we plan for Route 53 to also give you greater control over the final aspect of distributed system naming, the route your users take to reach an endpoint. If you want to learn more about Route 53 visithttp://aws.amazon.com/route53 and read the blog post at the AWS Developer weblog.

Here are some of the other comments on this service

December 04, 2010

Scalability links for December 4th

Scalability links for December 4th:


December 02, 2010

AWS Cloudwatch is now really open for business

In a surprise move Amazon today released a bunch of new features to its cloudwatch service, some of which, till now, were provided by third party service providers.

  • Basic Monitoring of Amazon EC2 instances at 5-minute intervals at no additional charge. AWS cloudwatch
  • Elastic Load Balancer Health Checks -Auto Scaling can now be instructed to automatically replace instances that have been deemed unhealthy by an Elastic Load Balancer.
  • Alarms - You can now monitor Amazon CloudWatch metrics, with notification to the Amazon SNS topic of your choice when the metric falls outside of a defined range.
  • Auto Scaling Suspend/Resume - You can now push a "big red button" in order to prevent scaling activities from being initiated.
  • Auto Scaling Follow the Line -You can now use scheduled actions to perform scaling operations at particular points in time, creating a time-based scaling plan.
  • Auto Scaling Policies - You now have more fine-grained control over the modifications to the size of your AutoScaling groups.
  • VPC and HPC Support - You can now use AutoScaling with Amazon EC2 instances that are running within your Virtual Private Cloud or as Cluster Compute instances.

Kafka : A high-throughput distributed messaging system.

Found an interesting new open source project which I hadn’t heard about before. Kafka is a messaging system used by linkedin to serve as the foundation of their activity stream processing.
Kafka is a distributed publish-subscribe messaging system. It is designed to support the following

  • Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.

  • High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second.

  • Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.

  • Support for parallel data load into Hadoop.


Kafka is aimed at providing a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. This kind of activity (page views, searches, and other user actions) are a key ingredient in many of the social feature on the modern web. This data is typically handled by "logging" and ad hoc log aggregation solutions due to the throughput requirements. This kind of ad hoc solution is a viable solution to providing logging data to an offline analysis system like Hadoop, but is very limiting for building real-time processing. Kafka aims to unify offline and online processing by providing a mechanism for parallel load into Hadoop as well as the ability to partition real-time consumption over a cluster of machines.

The use for activity stream processing makes Kafka comparable to Facebook's Scribe or Cloudera's Flume, though the architecture and primitives are very different for these systems and make Kafka more comparable to a traditional messaging system. See our design page for more details.

The unbiased private vs AWS ROI worksheet

One of the my problems with most cloud ROI worksheets is that they are heavily weighted for use-cases where resource usage is very bursty. But what if your resource requirements aren’t bursty ? And what if you have a use case where you have to maintain a small IT team to manage some on-site resources due to compliance and other issues ? 

image

In his latest post, Richard shares his worksheet for everyone to play with.

 

 

November 28, 2010

Google App Engine 1.4.0 pre-release is out

The complete announcement is here, but here are the changes for the java SDK. The two big changes I liked is the fact that there is now an “always on” feature, and “tasks” feature has graduated out of beta/testing.

  • The Always On feature allows applications to pay and keep 3 instances of
    their application always running, which can significantly reduce application latency.
  • Developers can now enable Warmup Requests. By specifying  a handler in an app's appengine-web.xml, App Engine will attempt to to send a Warmup Request to initialize new instances before a user interacts with it. This can reduce  the latency an end-user sees for initializing your application.
  • The Channel API is now available for all users.
  • Task Queue has been officially released, and is no longer an experimental feature. The API import paths that use 'labs' have been deprecated. Task queue storage will count towards an application's overall storage quota, and will thus be charged for.
  • The deadline for Task Queue and Cron requests has been raised to 10 minutes.  Datastore and API deadlines within those requests remain unchanged.
  • For the Task Queue, developers can specify task retry-parameters in their queue.xml.
  • Metadata Queries on the datastore for datastore kinds, namespaces, and entity  properties are available.
  • URL Fetch allowed response size has been increased, up to 32 MB. Request
    size is still limited to 1 MB.
  • The Admin Console Blacklist page lists the top blacklist rejected visitors.
  • The automatic image thumbnailing service supports arbitrary crop sizes up to 1600px.
  • Overall average instance latency in the Admin Console is now a weighted  average over QPS per instance.
  • Added a low-level AysncDatastoreService for making calls to the datastore asynchronously.
  • Added a getBodyAsBytes() method to QueueStateInfo.TaskStateInfo, this returns the body of the task state as a pure byte-string.
  • The whitelist has been updated to include all classes from javax.xml.soap.
  • Fixed an issue sending email to multiple recipients. http://code.google.com/p/googleappengine/issues/detail?id=1623

November 27, 2010

How to setup Amazon Cloudfront ( learning with experimentation )

I have some experience with Akamai’s WAA (Web applications archive) service, which I’ve been using in my professional capacity for a few years now. And I’ve have been curious about how  cloudfront compares with it. Until a few weeks ago, Cloudfront didn’t have a key feature which I think was critical for it to win the traditional CDN customers. “Custom origin” is an amazing new feature which I finally got to test last night and here are my notes for those who are curious as well.

My test application which I tried to convert was my news aggregator portal http://www.scalebig.com/. The application consists of a rapidly changing front page (few times a day) ,  a collection of old pages archived in a sub directory and some other webpage elements like headers, footers, images, style-sheets etc.

  • While Amazon Coudfront does have a presence on AWS management console, it only supports S3 buckets as origins.
  • Since my application didn’t have any components which requires server side processing, I tried to put the whole website on an S3 bucket and tried to use S3 as the origin.
  • When I initially set it up, I ended up with multiple URLs which I had to understand
    • S3 URL – This is the unique URL to your S3 bucket. All requests to this URL will go to Amazons S3 server cluster, and if your objects are marked as private, anyone can get these objects. The object could be a movie, an image, or even an HTML file.
    • Cloudfront URL  – This is the unique Cloudfront URL which maps to your S3 resource through the cloudfront network. For all practical purposes its the same as the first one, except that this is through the CDN service.
    • Your own domain name – This is the actual URL which end users will see, which will be a CNAME to the cloudfront URL.
  • So in my case, I configured the DNS entry for www.scalebig.com to point to DNS entry Cloudfront service created for me (dbnqedizktbfa.cloudfront.net).
  • First thing which broke is that I forgot that this is just an S3 bucket, so it can’t handle things like “sparsed html” to dynamically append headers/footers. I also realized that it can’t control cache policies, setup expiry, etc. But the worst problem was that if you went to “http://www.scalebig.com/” it would throw an error. It was expecting a file name, so http://www.scalebig.com/index.html would have worked.
  • In short I realized that my idea of using S3 as a webserver full of holes.
  • When I started digging for options to enable “custom origin” I realized that those options do not exist on the AWS management console !!. I was instead directed to some third party applications to do this instead. (most of them were commercial products, except two)
  • I finally created the cloudfront configuration using Cloudberry S3 Explorer PRO which allowed me to point Cloudfront to a custom domain name (instead of an S3 resource).
  • In my case my server was running on EC2 with a public reserved IP.  I’m not yet using AWS ELB (Elastic loadbalancer).
  • Once I got that working, which literally worked out of the box, the next challenge is to setup the cache controls and expiries working. If they are set incorrectly, it may stop users from getting latest content. I setup the policies using “.htaccess”. Below I’ve attached a part of the .htaccess I have for the /index.html page which is updated many times a day. There is a similar .htaccess page for rest of the website which recommends a much longer expiry.
  • Finally I realized that it is possible that I might have to invalidate parts of the caches at times (could be due to a bug). Cloudberry and AWS management console didn’t have any option avaliable, but apparently “boto” has some APIs which can work with Amazon cloudfront APIs to do this.

# turn on the module for this directory
ExpiresActive on
# set default
ExpiresDefault "access plus 1 hours"
ExpiresByType image/jpg "access plus 1 hours"
ExpiresByType image/gif "access plus 1 hours"
ExpiresByType image/jpeg "access plus 1 hours"
ExpiresByType image/png "access plus 1 hours"
ExpiresByType text/css "access plus 1 hours"
ExpiresByType text/javascript "access plus 1 hours"
ExpiresByType application/javascript "access plus 1 hours"
ExpiresByType application/x-javascript "access plus 1 hours"
ExpiresByType application/x-shockwave-flash "access plus 1 hours"

Header set Cache-Control "max-age=3600"

AddOutputFilterByType DEFLATE text/html text/plain text/xml application/javascript text/javascript  application/x-javascript text/css

Here is how I would summarize the current state of Amazon cloudfront.

  • Its definitely ready for static websites which don’t have any server side execution code.
  • Cloudfront only accepts GET and HEAD requests
  • Cloudfront ignores cookies, so server can’t set any. (Browser based cookie management will still work, which could be used to keep in-browser session data)
  • If you do want to use serverside code, use iframes, jsonp, javascript widgets or some other mechanism to execute code from a different domain name (which is not on cloudfront).
  • While Cloudfront can log access logs to an S3 bucket of your choice, I’ll recommend using something like Google Analytics to do log analysis.
  • I’ll recommend buying one of the commercial third party products if you want to use Custom Origin and would recommend reading more about the protocols/APIs before you fully trust a production service to Cloudfront.
  • I wish Cloudfront starts supporting something like ESI, which could effectively make an S3 bucket a full fledged webserver without the need of having a running EC2 instance all the time.
  • Overall Cloudfront has a very long way to go, in the number of features, to be treated as a competitor for Akamai’s current range of services.
  • And if you look at Akamai’s current world wide presence, Cloudfront is just a tiny blip.  [ Cloudfront edge locations ]
  • But I suspect that Cloudfront’s continuous evolution is being watched by many and the next set of features could change the balance.

I’m planning to leave http://www.scalebig.com/ on Cloudfront for some time to learn a little more about its operational issues. If you have been using Cloudfront please feel free to leave comments about what important features, you think, are still missing.

November 23, 2010

Netflix: Dev and Ops internals

I’ve seen a number of posts from Netflix folks talking about their architecture in the recent weeks. And part of that is due to an ongoing effort to expand their business for which they seem to be hiring like crazy. Here is the yet another interesting deck of slides which mentions stuff across both Dev and Ops.

One of the most interesting deck of slides I’ve seen in recent past.

November 22, 2010

The Cloud: Watch your step ( Google App engine limitations )

Any blog which promotes the concept of cloud infrastructure would be doing injustice if it doesn’t provide references to implementations where it failed horribly. Here is an excellent post by Carlos Ble where he lists out all the problems he faced on Google App engine (python).  He lists 13 different limitations, most of which are very well known facts, and then lists some more frustrating reasons why he had to dump the solution and look for an alternative.

The tone of the voice is understandable, and while it might look like App-Engine-bashing, I see it as a great story which others could lean from.

For us, GAE has been a failure like Wave or Buzz were but this time, we have paid it with our money. I've been too stubborn just because this great company was behind the platform but I've learned an important lesson: good companies make mistakes too. I didn't do enough spikes before developing actual features. I should have performed more proofs of concept before investing so much money. I was blind.

Cloud is not for everyone or for all problems. While some of these technologies take away your growing pain points, they assume you are ok with some of the limitations. If you were surprised by these limitations after you are neck deep in coding, then you didn’t do your homework.

Here are the 13 points issues he pointed out. I haven’t  used Google App engine lately, but my understanding is that App engine team have solved, or on the path of solving (or reducing pain) some of these issues.

  • Requires Python 2.5
  • Cant use HTTPS
  • 30 seconds to run
  • URL fetch gets only 5 seconds
  • Can’t use python libraries compiled in C
  • No “LIKE” operators in datastore
  • Can’t join tables
  • “Too many indexes”
  • Only 1000 records at a time returned
  • Datastore and memcache can fail at times
  • Max memcache size is 1MB

November 16, 2010

Sawzall and the PIG

When I heard interesting uses cases of how “Sawzall” is used to hack huge amounts of log data within Google I was thinking about two things.

  • Apache PIG, which is “a platform for analyzing large data sets that consists of a high-level language Pig for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.”
  • CEP (Complex event processing) - consists in processing many events happening across all the layers of an organization, identifying the most meaningful events within the event cloud, analyzing their impact, and taking subsequent action in real time. [ Also look at esper ]

Google has opened parts of this framework in a project called “Szl”

Sawzall is a procedural language developed for parallel analysis of very large data sets (such as logs). It provides protocol buffer handling, regular expression support, string and array manipulation, associative arrays (maps), structured data (tuples), data fingerprinting (64-bit hash values), time values, various utility operations and the usual library functions operating on floating-point and string values. For years Sawzall has been Google's logs processing language of choice and is used for various other data analysis tasks across the company.

Instead of specifying how to process the entire data set, a Sawzall program describes the processing steps for a single data record independent of others. It also provides statements for emitting extracted intermediate results to predefined containers that aggregate data over all records. The separation between per-record processing and aggregation enables parallelization. Multiple records can be processed in parallel by different runs of the same program, possibly distributed across many machines. The language does not specify a particular implementation for aggregation, but a number of aggregators are supplied. Aggregation within a single execution is automatic. Aggregation of results from multiple executions is not automatic but an example program is supplied.

Here is a quick example of how it could be used…

  topwords: table top(3) of word: string weight count: int;
   fields: array of bytes = splitcsvline(input);
   w: string = string(fields[0]);
   c: int = int(string(fields[1]), 10);
   if (c != 0) {
   emit topwords <- w weight c;
   }


Given the input:



  abc,1
   def,2
   ghi,3
   def,4
   jkl,5


The program (using --table_output) emits:



  topwords[] = def, 6, 0
   topwords[] = jkl, 5, 0
   topwords[] = ghi, 3, 0

Presentation: “OrientDB, the database of the web”

I knew there was something called “OrientDB”, but didn’t know much about it until I went through these slides. Here is what I learned in one sentence. Its a easy to install NoSQL(schemaless) datastore, with absolutely no configuration required, supports ACID transactions, it can be used as a document store, a graph store and a key value store, it can be queried using SQL-like and JSON syntax, supports indexing and triggers and its been benchmarked to do 150000 inserts using commodity hardware.  That’s a lot of features.

November 12, 2010

Cloud economics: Not really black and white..

While some of the interest in moving towards public cloud is based on sound economics, there is a small segment of this movement purely due to the “herd mentality”.

The slide on the right is from a Microsoft publication shows that larger networks may be less economical on the cloud (at least today).

Richard Farley, has been discussing this very topic for few months now. He observed that a medium sized organization which already has a decent IT infrastructure including a dedicated IT staff to support it has a significantly smaller overhead than what cloud vendors might make it look like.

Here is a small snippet from his blog. If you are not afraid to get dirty with numbers read the rest here.

Now, we know we need 300 virtual servers, each of which consumes 12.5% of a physical host.  This means we need a total of 37.5 physical hosts.  Our vendor tells us these servers can be had for $7k each including tax and delivery with the cabinet.  We can’t buy a half server, and want to have an extra server on hand in case one breaks.  This brings our total to 39 at a cost of $273k.  Adding in the cost of the cabinet, we’re up to $300k.

There are several non-capital costs we now have to factor in.  Your vendor will provide warranty, support and on-site hardware replacement service for the cabinet and servers for $15k per year.  Figure you will need to allocate around 5% of the time of one of your sys admins to deal with hardware issues (i.e., coordinating repairs with the vendor) at a cost of around $8k per year in salary and benefits.  Figure power and cooling for the cabinet will also cost $12k per year.  In total, your non-capital yearly costs add up to $35k.

One thing which posts doesn’t clearly articulate, is the fact that while long term infrastructure is cheaper to host in private cloud, its may still be more economical to use public cloud for short term resource intensive projects.

November 11, 2010

Cassandra: What is HintedHandoff ?

Nate has a very good post about how Cassandra is different from a lot of other distributed data-stores. In particular he explains that every node in a Cassandra cluster are identical to every other node. After using cassandra and a few months I can tell you for a fact that its true. It does come at a price though. Because its so decentralized, if you want to make a schema change, for example, configuration files of all of the nodes in the cluster need to be updated all at the same time. Some of these problems will go away when 0.70 finally comes out.

While is true that Cassandra doesn’t have a concept of single “master” server, each node participating in the cluster do actually act as masters and slaves of parts of the entire key range. The actual % size of the key range owned by an individual node depends on the replication factor, on how one picks the keys and what partitioner algorithm was selected.

The fun starts when a node, which could be the master for a range of keys, goes down. This is how Nate explains the process..

Though the node is the "primary" for a portion of the data in the cluster, the number of copies of the data kept on other nodes in the cluster is configurable. When a node goes down, the other nodes containing copies, referred to as "replicas", continue to service read requests and will even accept writes for the down node. When the node returns, these queued up writes are sent from the replicas to bring the node back up to date

And this is from the Cassandra wiki

If a node which should receive a write is down, Cassandra will write a hint to a live replica node indicating that the write needs to be replayed to the unavailable node. If no live replica nodes exist for this key, and ConsistencyLevel.ANY was specified, the coordinating node will write the hint locally. Cassandra uses hinted handoff as a way to (1) reduce the time required for a temporarily failed node to become consistent again with live ones and (2) provide extreme write availability when consistency is not required.

A hinted write is NOT sufficient to count towards ConsistencyLevel requirements of ONE, QUORUM, or ALL. Take the simple example of a cluster of two nodes, A and B, and a replication factor of 1 (each key is stored on one node). Suppose node A is down while we write key K to it with ConsistencyLevel.ONE. Then we must fail the write: recall from the API page that "if W + R > ReplicationFactor, where W is the number of nodes to block for on write, and R the number to block for on reads, you will have strongly consistent behavior; that is, readers will always see the most recent write."

Thus if we write a hint to B and call the write good because it is written "somewhere," there is no way to read the data at any ConsistencyLevel until A comes back up and B forwards the data to him. Historically, only the lowest ConsistencyLevel of ZERO would accept writes in this situation; for 0.6, we added ConsistencyLevel.ANY, meaning, "wait for a write to succeed anywhere, even a hinted write that isn't immediately readable."

Mike Perham has a related post on the same topic. He goes further and explains that because there could be scenarios where writes are not immediately visible due to a disabled master node, its possible that master could get out of sync with the slaves in the confusion. There is a process called “anti-entropy” which cassandra uses to detect and resolve such issues. Here is how he explains

The final trick up Cassandra’s proverbial sleeve is anti-entropy. AE explicitly ensures that the nodes in the cluster agree on the current data. If read repair or hinted handoff don’t work due to some set of circumstances, the AE service will ensure that nodes reach eventual consistency. The AE service runs during “major compactions” (the equivalent of rebuilding a table in an RDBMS) so it is a relatively heavyweight process that runs infrequently. AE uses a Merkle Tree to determine where within the tree of column family data the nodes disagree and then repairs each of those branches.

November 08, 2010

OpenTSDB – Distributed time series database

Ever since I saw a demo of this tool, I’ve been on the edge, waiting for it to be opensourced so that I could use it.  The problem its trying to solve is a real pain-point which most webops folks would understand.

Yesterday folks at stumbleupon finally opened it up. Its released under LGPLv3 license. You can find the source here and the documentation here.

At StumbleUpon, we have found this system tremendously helpful to:

  • Get real-time state information about our infrastructure and services.
  • Understand outages or how complex systems interact together.
  • Measure SLAs (availability, latency, etc.)
  • Tune our applications and databases for maximum performance.
  • Do capacity planning.

AWS cloudfront grows up… a little. Now allows Custom origins.

 Cloudfront has come a long way from its humble beginnings. Here is what Jeff had to say when he announced that its out of “beta”….

    1. First, we've removed the beta tag from CloudFront and it is now in full production. During the beta period we listened to our customers and added a number of important features including Invalidation, a default root object, HTTPS access, private content, streamed content, private streamed content,AWS Management Console support, request logging, and additional edge locations. We've also reduced our prices.
    2. There's now an SLA (Service Level Agreement) for CloudFront. If availability of your content drops below 99.9% in any given month, you can apply for a service credit equal to 10% of your monthly bill. If the availability drops below 99% you can apply for a service credit equal to 25% of your monthly bill.

While all this is a big step forward, its probably not enough for the more advanced CDN users to switch over yet.

Here are a couple of issues which stuck out in the Developer Guide.

  • Query parameters are not used to generate cache key. So while it looks like it can pull content from an elastic loadbalancer, it still acts like a giant S3 accelerator.
  • Doesn’t support HTTP/1.1 yet. So if you have multiple domains on the same IP, this solution isn’t for you.

November 07, 2010

Building your first cloud application on AWS

Building your first web application on AWS is like shopping for a car at pepboys, part by part. While manuals to build one might be on aisle 5, the experience of having built one already is harder to buy.

Here are some interesting logistical questions, which I don’t think get enough attention, when people discuss issues around building a new AWS based service.

  1. Picking the right Linux distribution: Switching OS distribution may not be too simple if your applications need custom scripts. Picking and sticking with a single distribution will save a lot of lost time.
  2. Automated server builds: There are many ways to skin this cat. Chef, Puppet, Cfengine are all good... Whats important is to pick one early in the game.
  3. Multi-Availability Zone support: Find out if multi availability zone support is important. This can impact over all architecture of the solution and tools used to implement the solution.
  4. Data consistency requirements: Similar to the Multi-AZ support question, its important to understand the data consistency tolerance of the application before one starts designing the application.
  5. Datastore: There are different kinds of datastores available as part of AWS itself (SimpleDB, S3 and RDS). If you are planning to keep your options open about moving out of AWS at some point, you should think about picking a datastore which you could move out with you with little effort. There are many NoSQL and RDBMS solutions to choose from.
  6. Backups: While some think its a waste of time to think about backups too early, I suspect those who don’t will be spending way too much time later. The long term backup strategy is integral part of disaster recovery planning, without which you shouldn’t think of going live.
  7. Integration with external data sources:  If this application is part of a larger cluster of application which is running somewhere else, think about how data would be sent back and forth. There are lots of different options depending on how much data is involved (or how important protection of that data is)
  8. Monitoring/Alerting: Most standard out of the box monitoring tools can’t handle dynamic infrastructure very well. There are, however,  plugins available for many existing monitoring solutions which can handle the dynamic nature of infrastructure. You could also choose to use one of the 3rd party monitoring services if you’d rather pay someone else to do it.
  9. Security: You should be shocked to see this on #9 on my list. If your service involves user data, or some other kind of intellectual property, build multi-tiered architecture to segment different parts of your application from targeted attacks. Security is also very important while picking the right caching and web server technologies.
  10. Development: Figure out how developers would use AWS. Would they share the same AWS account, share parts of the infrastructure, share datastore, etc. How would the developer resources be monitored so that unintentional uses of excessive resources could be flagged for alerting.

Are there other subtle issues which I should have listed here ? Let me know.

 

November 05, 2010

Rapid prototyping with solr

Extreme prototyping with Solr by Eric Hatcher

At ApacheCon this week I presented “Rapid Prototyping with Solr”.  This is the third time I’ve given a presentation with the same title.  In the spirit of the rapid prototyping theme, each time I’ve created a new prototype just a day or so prior to presenting it.  At Lucene EuroCon the prototype used attendee data, a treemap visualization, and a cute little Solr-powered “app” for picking attendees at random for the conference giveaways.  For a recent Lucid webinar the prototype was more general purpose, bringing in and making searchable rich documents and faceting on file types with a pie chart visualization.

This time around, the data set I chose was Data.gov’s catalog of datasets, which fit with the ApacheCon open source aura, and Lucid Imagination’s support of Open Source for America, which helps to advocate for open source in the US Federal Government.  The prototype built includes faceting browsing, query term suggest, hit highlighting, result clustering, spell checking, document detail, and a bonus Venn diagram visualization.

November 04, 2010

Shipping Trunk : For web applications

I had briefly blogged about this presentation before from Velocity 2010. I wish they had released the video for this session. I went through this slide deck again today to see if Paul mentioned any of the problems organization like ours are dealing with in its transition from quarterly releases to weekly/continuous releases.

One of the key observations Paul made during his talk is that most organizations still treat web applications as desktop software and have very strict quality controls which may not be as necessary since releasing changes for web app in a SAAS (Software as a service) is much more cheaper than for releasing patches for traditional desktop software.

Here are some of the other points he made. For really detailed info check out the [slides]

  • Deploy frequently, facilitating rapid product iteration
  • Avoid large merges and the associated integration testing
  • Easily perform A/B testing of functionality
  • Run QA and beta testing on production hardware
  • Launch big features without worrying about your infrastructure
  • Provide all the switches your operations team needs to manage the deployed system

Slides: Always ship Trunk: Managing Change In Complex Websites

November 03, 2010

Real-Time MapReduce using S4

While trying to figure out how to do real-time log analysis in my own organization I realized that most map-reduce frameworks are designed to run as batch jobs in time delays manner rather than be instantaneous like a SQL query to a Mysql DB. There are some frameworks which are bucking the trend. Yahoo! Lab! recently announced that their “Advertising Sciences” group has built a general purpose, real-time, distributed, fault-tolerant, scalable, event driven, expandable platform called “S4” which allows programmers to easily implement applications for processing continuous unbounded streams of data.

S4 clusters are built using low-cost commoditized hardware, and leverage many technologies from Yahoo!’s Hadoop project. S4 is written in Java and uses the Spring Framework to build a software component architecture. Over a dozen pluggable modules have been created so far.

Why do we need a real-time map-reduce framework?
Applications such as personalization, user feedback, malicious traffic detection, and real-time search require both very fast response and scalability. In S4 we abstract the input data as streams of key-value pairs that arrive asynchronously and are dispatched intelligently to processing nodes that produce data sets of output key-value pairs. In search, for example, the output data sets are made available to the serving system before a user executes her next search query. We use this rapid feedback to adapt the search models based on user intent

Read more: Original post from Yahoo! Labs

Storage options on app engine

For those who think google app engine only has one kind of datastore, the one built around “bigtable”, think again. Nick Johnson goes into details of all the other options available with their pro’s and con’s in his post.
App Engine provides more data storage mechanisms than is apparent at first glance. All of them have different tradeoffs, so it's likely that one - or more - of them will suit your application well. Often, the ideal solution involves a combination, such as the datastore and memcache, or local files and instance memory.

Storage options he lists..

  • Datastore

  • Memcache

  • Instance memory

  • Local Files


Read more: Original post from Nick

November 01, 2010

Is auto-sharding ready for auto-pilot ?

James Golick makes a point which lot of people miss. He doesn’t believe auto-sharding features NoSQL provides is ready for full auto-pilot yet, and that good developers have to think about sharding as part of design architecture, regardless of what datastore you pick.

If you take at face value the marketing materials of many NoSQL database vendors, you'd think that with a horizontally scalable data store, operations engineering simply isn't necessary. Recent high profile outages suggest otherwise.

MongoDB, Redis-cluster (if and when it ships), Cassandra, Riak, Voldemort, and friends are tools that may be able to help you scale your data storage to varying degrees. Compared to sharding a relational database by hand, using a partitioned data store may even reduce operations costs at scale. But fundamentally, no software can use system resources that aren't there.

At the very least one has to understand how auto sharding in a NoSQL works, how easy is it to setup, maintain, backup and restore. “Rebalancing” can be an expensive operation, and if shards are separated by distance or high latency, some designs might be better than others.

October 31, 2010

Why Membase Uses Erlang

It’s totally worth it. Erlang (Erlang/OTP really, which is what most people mean when they say “Erlang has X”) does out of the box a lot of things we would have had to either build from scratch or attempt to piece together existing libraries to do. Its dynamic type system and pattern matching (ala Haskelland ML) make Erlang code tend to be even more concise than Python and Ruby, two languages known for their ability to do a lot in few lines of NorthScalecode.

The single largest advantage to us of using Erlang has got to be its built-in support for concurrency. Erlang models concurrent tasks as “processes” that can communicate with one another only via message-passing (which makes use of pattern matching!), in what is known as the actor model of concurrency. This alone makes an entire class of concurrency-related bugs completely impossible. While it doesn’t completely prevent deadlock, it turns out to be pretty difficult to miss a potential deadlock scenario when you write code this way. While it’s certainly possible to implement the actor model in most if not all of the alternative environments I mentioned, and in fact such implementations exist, they are either incomplete or suffer from an impedance mismatch with existing libraries that expect you to use threads.

The complete original post can be found here: blog.membase.com

October 30, 2010

Cassandra's future @facebook and links to other NoSQL slides

I heard an unconfirmed rumor that facebook is moving away from Cassandra. Not sure why, or to what, but rumors like this is a concern regardless. After twitter's backing off, and digg's troubles, which were indirectly linked to either Cassandra's maturity as a production solution or their understanding of Cassandra's capability, it might raise more eyebrows if facebook does really abandon cassandra.  Cassandra was created in Facebook, which it opensourced, but its my understanding today that most of the development on the open sourced cassandra happens outside its walls. Rackspace is a big sponsor(may not be the largest anymore) of the open source project and Riptano, which has built a whole compnay around the open source project has done a tremendous job of promoting.

Scalability links for October 30th :

October 27, 2010

Scalability links for October 28th

Scalability links for October 28th:

October 21, 2010

Scalability links for October 21st

Scalability links for October 21st:

October 18, 2010

Scalability links for October 18th

Scalability links for October 18th:

October 17, 2010

Scaling Graphite by using Cfmap as the data transport

Graphite is an extremely promising system and resource graphing tool which tries to take RRD to the next level. Here are some of the most interesting features of graphite which I liked.

  • Updates can happen to records in the past (RRD doesn’t allow this I think)image
  • Creation of new datasets is trivial with whisper/carbon ( its part of  the graphite framework )
  • Graphite allows federated storage (multiple servers across multiple datacenters for example)

Monitoring and graphing resources across data-centers is tricky however. Especially because WAN links cannot be trusted. While loosing some data may be ok for some folks, it may not be acceptable for others. Graphite tries to solve this problem by providing an option to federate data across multiple servers which could each be kept in separate datacenters.

Another way to solve this problem is by using a data transport which is resilient to network failures.

Since Cfmap (thanks to Cassandra) is a distributed, eventually consistent, state repository, it could easily be extended to act as an eventually consistent data queue for tools like graphite. Take a look at an example of a server record here (on the right).  With some minor modifications, we were able to log and publish all changes to system attributes using an API like this. And with the right script running on the graphite server, the import of these stats into carbon (a component of graphite), became a trivial task.

image

With Cfmap’s easy REST interface, adding new stats to graphite becomes as simple as registering the stats to cfmap from anywhere in the network. [ sample script ]

Graphite is not in use in our corporate network today, but I’m extremely excited at the possibilities and would be actively looking at how else we could use it.

[ Take a look at RabbitMQ integration with Graphite for another interesting way of working with graphite ]

October 06, 2010

Cfmap: Publishing, discovering and dashboarding infrastructure state

image

Dynamic infrastructure can be a challenging if apps and scripts can’t keep up with them. At Ingenuity we observed this problem when we started moving towards virtualization and SOA (service oriented architecture). Remembering server names became impractical, and error-free manual configuration changes became impossible.

imageWhile there are some tools which solve parts of this specific problem, we couldn’t find any opensource tool which could be used to both publish and discover state of a system in a distributed, scalable and fault-tolerant way. Zookeeper which comes pretty close to what we needed was a fully consistent system which was not designed to be used across multiple data centers over high latency, unstable network connections. We wanted a system which could not only be up during network outages, but also sync up the state from different data-centers when they are connected.

We built a few different tools to solve our scalability problems, one of which is a tool called Cfmap which we are opensourcing today to help others facing the same problem.

So what is cfmap ?

Built over cassandra, cfmap is designed to be a scalable, eventually consistent and a fault tolerant repository of state information. It provides a set of REST APIs and UIs to both publish and discover state of an entity or a group of entities with great ease. The APIs are so simple that you would most probably be writing your own custom agents for the various servers and processes than use the agent which comes bundled with the tool.

We have been using cfmap internally for a few months and the results are promising. Here is an example of how cfmap’s dashboard looks like on our network  (I’ve changed some names to protect the actual resource names).  Here is another dashboard which is running out in the public which you can use today as a demo.

image

Cfmap provides the ability to quickly drill down to a filtered set of servers or apps, and the ability to export them quickly into a json or a shell greppable format. The two export formats available today makes dashboarding and scripting a trivial task.

The image above shows a small set of applications from our dev cluster which is sorted in the order of the time when the apps were deployed. In addition to showing the host names, status of the apps, and version information, it also lists the time when the app sent the last heartbeat. What is not visible here is that it also keeps track of certain changes in a “log” which could be used to understand historical changes of a particular record over time.

While REST interface is easy to use, you could choose to use the commandline tool “cfquery”, which comes with Cfmap to interact with cfmap. Cfquery could be used to both publish and search results… lets look at some example.

Here is an example of how one could extract a list of all the hosts in cfmap.
rkt@torque:~/cc/cfmap/bin$ ./cfquery.pl -c view | grep ":host=" | cut -d':' -f2host=team50host=ip-10-205-15-124host=torquehost=anorien

Here is a more elaborate example which shows up cfmap output could be used as parts of other scripts. In this case, the query just specifies a host “anorien” in the query. The result is a dump of all the properties set by the host. A few extra commands can quickly help you extract specific properties which can then be used as a data-source for other tools (like monitoring).

rkt@torque:~/cc/cfmap/bin$ ./cfquery.pl -c view -p "host=anorien"

52cb892bc339f286bacbcfe9a8c8b4a6:port=0
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_freeswap=1999
52cb892bc339f286bacbcfe9a8c8b4a6:host=anorien
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_loadavg5m=0
52cb892bc339f286bacbcfe9a8c8b4a6:cfqversion=1.1
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_estconn=4
52cb892bc339f286bacbcfe9a8c8b4a6:type=host
52cb892bc339f286bacbcfe9a8c8b4a6:deployed_date=1286217400
52cb892bc339f286bacbcfe9a8c8b4a6:version=2.6.32-00007-g56678ec
52cb892bc339f286bacbcfe9a8c8b4a6:ip=127.0.0.1
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_pscount=101
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_loadavg15m=0
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_loadavgentities=0
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_freemem=3
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_loadavg1m=0
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_totalswap=1999
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_totalmem=501
52cb892bc339f286bacbcfe9a8c8b4a6:appname=os
52cb892bc339f286bacbcfe9a8c8b4a6:checked=1286331427

rkt@torque:~/cc/cfmap/bin$ ./cfquery.pl -c view -p "host=anorien" | grep stats_host_totalmem
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_totalmem=501
rkt@torque:~/cc/cfmap/bin$ ./cfquery.pl -c view -p "host=anorien" | grep stats_host_totalmem | cut -d'=' -f2
501

Few other interesting features


  • Schema-less design – cfmap provides a simple schema-less datastore which could be used for other purposes as well. Please note that since it was designed to maintain “state” (instead of a simple datastore API), it has a few reserved keywords which have a special meaning.

  • Low overhead to add/delete cfmap nodes – Since its built over cassandra, adding new nodes is as simple as adding new cassandra servers.

  • Configurable - The recommended way of setting up cfmap for production use would be to host cfmap (which comes with a bundled version of cassandra) on 3 or more servers. Then put them all under a single DNS entry (round robin) and let DNS loadbalancing take care of the rest.

    • If you want an even more redundancy, setup something like haproxy on each of the nodes which could also monitor and redirect traffic to alternate cfmap nodes when failures (or GCs) happen.

    • The default setup doesn’t enforce consistency during reads or writes to facilitate smooth operation even during massive network or system failures. But if you wish, you could tweak the consistency, replication requirements based on your needs.




Cfmap is still a very early prototype, but we welcome others to play with it.