November 28, 2010

Google App Engine 1.4.0 pre-release is out

The complete announcement is here, but here are the changes for the java SDK. The two big changes I liked is the fact that there is now an “always on” feature, and “tasks” feature has graduated out of beta/testing.

  • The Always On feature allows applications to pay and keep 3 instances of
    their application always running, which can significantly reduce application latency.
  • Developers can now enable Warmup Requests. By specifying  a handler in an app's appengine-web.xml, App Engine will attempt to to send a Warmup Request to initialize new instances before a user interacts with it. This can reduce  the latency an end-user sees for initializing your application.
  • The Channel API is now available for all users.
  • Task Queue has been officially released, and is no longer an experimental feature. The API import paths that use 'labs' have been deprecated. Task queue storage will count towards an application's overall storage quota, and will thus be charged for.
  • The deadline for Task Queue and Cron requests has been raised to 10 minutes.  Datastore and API deadlines within those requests remain unchanged.
  • For the Task Queue, developers can specify task retry-parameters in their queue.xml.
  • Metadata Queries on the datastore for datastore kinds, namespaces, and entity  properties are available.
  • URL Fetch allowed response size has been increased, up to 32 MB. Request
    size is still limited to 1 MB.
  • The Admin Console Blacklist page lists the top blacklist rejected visitors.
  • The automatic image thumbnailing service supports arbitrary crop sizes up to 1600px.
  • Overall average instance latency in the Admin Console is now a weighted  average over QPS per instance.
  • Added a low-level AysncDatastoreService for making calls to the datastore asynchronously.
  • Added a getBodyAsBytes() method to QueueStateInfo.TaskStateInfo, this returns the body of the task state as a pure byte-string.
  • The whitelist has been updated to include all classes from javax.xml.soap.
  • Fixed an issue sending email to multiple recipients. http://code.google.com/p/googleappengine/issues/detail?id=1623

November 27, 2010

How to setup Amazon Cloudfront ( learning with experimentation )

I have some experience with Akamai’s WAA (Web applications archive) service, which I’ve been using in my professional capacity for a few years now. And I’ve have been curious about how  cloudfront compares with it. Until a few weeks ago, Cloudfront didn’t have a key feature which I think was critical for it to win the traditional CDN customers. “Custom origin” is an amazing new feature which I finally got to test last night and here are my notes for those who are curious as well.

My test application which I tried to convert was my news aggregator portal http://www.scalebig.com/. The application consists of a rapidly changing front page (few times a day) ,  a collection of old pages archived in a sub directory and some other webpage elements like headers, footers, images, style-sheets etc.

  • While Amazon Coudfront does have a presence on AWS management console, it only supports S3 buckets as origins.
  • Since my application didn’t have any components which requires server side processing, I tried to put the whole website on an S3 bucket and tried to use S3 as the origin.
  • When I initially set it up, I ended up with multiple URLs which I had to understand
    • S3 URL – This is the unique URL to your S3 bucket. All requests to this URL will go to Amazons S3 server cluster, and if your objects are marked as private, anyone can get these objects. The object could be a movie, an image, or even an HTML file.
    • Cloudfront URL  – This is the unique Cloudfront URL which maps to your S3 resource through the cloudfront network. For all practical purposes its the same as the first one, except that this is through the CDN service.
    • Your own domain name – This is the actual URL which end users will see, which will be a CNAME to the cloudfront URL.
  • So in my case, I configured the DNS entry for www.scalebig.com to point to DNS entry Cloudfront service created for me (dbnqedizktbfa.cloudfront.net).
  • First thing which broke is that I forgot that this is just an S3 bucket, so it can’t handle things like “sparsed html” to dynamically append headers/footers. I also realized that it can’t control cache policies, setup expiry, etc. But the worst problem was that if you went to “http://www.scalebig.com/” it would throw an error. It was expecting a file name, so http://www.scalebig.com/index.html would have worked.
  • In short I realized that my idea of using S3 as a webserver full of holes.
  • When I started digging for options to enable “custom origin” I realized that those options do not exist on the AWS management console !!. I was instead directed to some third party applications to do this instead. (most of them were commercial products, except two)
  • I finally created the cloudfront configuration using Cloudberry S3 Explorer PRO which allowed me to point Cloudfront to a custom domain name (instead of an S3 resource).
  • In my case my server was running on EC2 with a public reserved IP.  I’m not yet using AWS ELB (Elastic loadbalancer).
  • Once I got that working, which literally worked out of the box, the next challenge is to setup the cache controls and expiries working. If they are set incorrectly, it may stop users from getting latest content. I setup the policies using “.htaccess”. Below I’ve attached a part of the .htaccess I have for the /index.html page which is updated many times a day. There is a similar .htaccess page for rest of the website which recommends a much longer expiry.
  • Finally I realized that it is possible that I might have to invalidate parts of the caches at times (could be due to a bug). Cloudberry and AWS management console didn’t have any option avaliable, but apparently “boto” has some APIs which can work with Amazon cloudfront APIs to do this.

# turn on the module for this directory
ExpiresActive on
# set default
ExpiresDefault "access plus 1 hours"
ExpiresByType image/jpg "access plus 1 hours"
ExpiresByType image/gif "access plus 1 hours"
ExpiresByType image/jpeg "access plus 1 hours"
ExpiresByType image/png "access plus 1 hours"
ExpiresByType text/css "access plus 1 hours"
ExpiresByType text/javascript "access plus 1 hours"
ExpiresByType application/javascript "access plus 1 hours"
ExpiresByType application/x-javascript "access plus 1 hours"
ExpiresByType application/x-shockwave-flash "access plus 1 hours"

Header set Cache-Control "max-age=3600"

AddOutputFilterByType DEFLATE text/html text/plain text/xml application/javascript text/javascript  application/x-javascript text/css

Here is how I would summarize the current state of Amazon cloudfront.

  • Its definitely ready for static websites which don’t have any server side execution code.
  • Cloudfront only accepts GET and HEAD requests
  • Cloudfront ignores cookies, so server can’t set any. (Browser based cookie management will still work, which could be used to keep in-browser session data)
  • If you do want to use serverside code, use iframes, jsonp, javascript widgets or some other mechanism to execute code from a different domain name (which is not on cloudfront).
  • While Cloudfront can log access logs to an S3 bucket of your choice, I’ll recommend using something like Google Analytics to do log analysis.
  • I’ll recommend buying one of the commercial third party products if you want to use Custom Origin and would recommend reading more about the protocols/APIs before you fully trust a production service to Cloudfront.
  • I wish Cloudfront starts supporting something like ESI, which could effectively make an S3 bucket a full fledged webserver without the need of having a running EC2 instance all the time.
  • Overall Cloudfront has a very long way to go, in the number of features, to be treated as a competitor for Akamai’s current range of services.
  • And if you look at Akamai’s current world wide presence, Cloudfront is just a tiny blip.  [ Cloudfront edge locations ]
  • But I suspect that Cloudfront’s continuous evolution is being watched by many and the next set of features could change the balance.

I’m planning to leave http://www.scalebig.com/ on Cloudfront for some time to learn a little more about its operational issues. If you have been using Cloudfront please feel free to leave comments about what important features, you think, are still missing.

November 23, 2010

Netflix: Dev and Ops internals

I’ve seen a number of posts from Netflix folks talking about their architecture in the recent weeks. And part of that is due to an ongoing effort to expand their business for which they seem to be hiring like crazy. Here is the yet another interesting deck of slides which mentions stuff across both Dev and Ops.

One of the most interesting deck of slides I’ve seen in recent past.

November 22, 2010

The Cloud: Watch your step ( Google App engine limitations )

Any blog which promotes the concept of cloud infrastructure would be doing injustice if it doesn’t provide references to implementations where it failed horribly. Here is an excellent post by Carlos Ble where he lists out all the problems he faced on Google App engine (python).  He lists 13 different limitations, most of which are very well known facts, and then lists some more frustrating reasons why he had to dump the solution and look for an alternative.

The tone of the voice is understandable, and while it might look like App-Engine-bashing, I see it as a great story which others could lean from.

For us, GAE has been a failure like Wave or Buzz were but this time, we have paid it with our money. I've been too stubborn just because this great company was behind the platform but I've learned an important lesson: good companies make mistakes too. I didn't do enough spikes before developing actual features. I should have performed more proofs of concept before investing so much money. I was blind.

Cloud is not for everyone or for all problems. While some of these technologies take away your growing pain points, they assume you are ok with some of the limitations. If you were surprised by these limitations after you are neck deep in coding, then you didn’t do your homework.

Here are the 13 points issues he pointed out. I haven’t  used Google App engine lately, but my understanding is that App engine team have solved, or on the path of solving (or reducing pain) some of these issues.

  • Requires Python 2.5
  • Cant use HTTPS
  • 30 seconds to run
  • URL fetch gets only 5 seconds
  • Can’t use python libraries compiled in C
  • No “LIKE” operators in datastore
  • Can’t join tables
  • “Too many indexes”
  • Only 1000 records at a time returned
  • Datastore and memcache can fail at times
  • Max memcache size is 1MB

November 16, 2010

Sawzall and the PIG

When I heard interesting uses cases of how “Sawzall” is used to hack huge amounts of log data within Google I was thinking about two things.

  • Apache PIG, which is “a platform for analyzing large data sets that consists of a high-level language Pig for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.”
  • CEP (Complex event processing) - consists in processing many events happening across all the layers of an organization, identifying the most meaningful events within the event cloud, analyzing their impact, and taking subsequent action in real time. [ Also look at esper ]

Google has opened parts of this framework in a project called “Szl”

Sawzall is a procedural language developed for parallel analysis of very large data sets (such as logs). It provides protocol buffer handling, regular expression support, string and array manipulation, associative arrays (maps), structured data (tuples), data fingerprinting (64-bit hash values), time values, various utility operations and the usual library functions operating on floating-point and string values. For years Sawzall has been Google's logs processing language of choice and is used for various other data analysis tasks across the company.

Instead of specifying how to process the entire data set, a Sawzall program describes the processing steps for a single data record independent of others. It also provides statements for emitting extracted intermediate results to predefined containers that aggregate data over all records. The separation between per-record processing and aggregation enables parallelization. Multiple records can be processed in parallel by different runs of the same program, possibly distributed across many machines. The language does not specify a particular implementation for aggregation, but a number of aggregators are supplied. Aggregation within a single execution is automatic. Aggregation of results from multiple executions is not automatic but an example program is supplied.

Here is a quick example of how it could be used…

  topwords: table top(3) of word: string weight count: int;
   fields: array of bytes = splitcsvline(input);
   w: string = string(fields[0]);
   c: int = int(string(fields[1]), 10);
   if (c != 0) {
   emit topwords <- w weight c;
   }


Given the input:



  abc,1
   def,2
   ghi,3
   def,4
   jkl,5


The program (using --table_output) emits:



  topwords[] = def, 6, 0
   topwords[] = jkl, 5, 0
   topwords[] = ghi, 3, 0

Presentation: “OrientDB, the database of the web”

I knew there was something called “OrientDB”, but didn’t know much about it until I went through these slides. Here is what I learned in one sentence. Its a easy to install NoSQL(schemaless) datastore, with absolutely no configuration required, supports ACID transactions, it can be used as a document store, a graph store and a key value store, it can be queried using SQL-like and JSON syntax, supports indexing and triggers and its been benchmarked to do 150000 inserts using commodity hardware.  That’s a lot of features.

November 12, 2010

Cloud economics: Not really black and white..

While some of the interest in moving towards public cloud is based on sound economics, there is a small segment of this movement purely due to the “herd mentality”.

The slide on the right is from a Microsoft publication shows that larger networks may be less economical on the cloud (at least today).

Richard Farley, has been discussing this very topic for few months now. He observed that a medium sized organization which already has a decent IT infrastructure including a dedicated IT staff to support it has a significantly smaller overhead than what cloud vendors might make it look like.

Here is a small snippet from his blog. If you are not afraid to get dirty with numbers read the rest here.

Now, we know we need 300 virtual servers, each of which consumes 12.5% of a physical host.  This means we need a total of 37.5 physical hosts.  Our vendor tells us these servers can be had for $7k each including tax and delivery with the cabinet.  We can’t buy a half server, and want to have an extra server on hand in case one breaks.  This brings our total to 39 at a cost of $273k.  Adding in the cost of the cabinet, we’re up to $300k.

There are several non-capital costs we now have to factor in.  Your vendor will provide warranty, support and on-site hardware replacement service for the cabinet and servers for $15k per year.  Figure you will need to allocate around 5% of the time of one of your sys admins to deal with hardware issues (i.e., coordinating repairs with the vendor) at a cost of around $8k per year in salary and benefits.  Figure power and cooling for the cabinet will also cost $12k per year.  In total, your non-capital yearly costs add up to $35k.

One thing which posts doesn’t clearly articulate, is the fact that while long term infrastructure is cheaper to host in private cloud, its may still be more economical to use public cloud for short term resource intensive projects.

November 11, 2010

Cassandra: What is HintedHandoff ?

Nate has a very good post about how Cassandra is different from a lot of other distributed data-stores. In particular he explains that every node in a Cassandra cluster are identical to every other node. After using cassandra and a few months I can tell you for a fact that its true. It does come at a price though. Because its so decentralized, if you want to make a schema change, for example, configuration files of all of the nodes in the cluster need to be updated all at the same time. Some of these problems will go away when 0.70 finally comes out.

While is true that Cassandra doesn’t have a concept of single “master” server, each node participating in the cluster do actually act as masters and slaves of parts of the entire key range. The actual % size of the key range owned by an individual node depends on the replication factor, on how one picks the keys and what partitioner algorithm was selected.

The fun starts when a node, which could be the master for a range of keys, goes down. This is how Nate explains the process..

Though the node is the "primary" for a portion of the data in the cluster, the number of copies of the data kept on other nodes in the cluster is configurable. When a node goes down, the other nodes containing copies, referred to as "replicas", continue to service read requests and will even accept writes for the down node. When the node returns, these queued up writes are sent from the replicas to bring the node back up to date

And this is from the Cassandra wiki

If a node which should receive a write is down, Cassandra will write a hint to a live replica node indicating that the write needs to be replayed to the unavailable node. If no live replica nodes exist for this key, and ConsistencyLevel.ANY was specified, the coordinating node will write the hint locally. Cassandra uses hinted handoff as a way to (1) reduce the time required for a temporarily failed node to become consistent again with live ones and (2) provide extreme write availability when consistency is not required.

A hinted write is NOT sufficient to count towards ConsistencyLevel requirements of ONE, QUORUM, or ALL. Take the simple example of a cluster of two nodes, A and B, and a replication factor of 1 (each key is stored on one node). Suppose node A is down while we write key K to it with ConsistencyLevel.ONE. Then we must fail the write: recall from the API page that "if W + R > ReplicationFactor, where W is the number of nodes to block for on write, and R the number to block for on reads, you will have strongly consistent behavior; that is, readers will always see the most recent write."

Thus if we write a hint to B and call the write good because it is written "somewhere," there is no way to read the data at any ConsistencyLevel until A comes back up and B forwards the data to him. Historically, only the lowest ConsistencyLevel of ZERO would accept writes in this situation; for 0.6, we added ConsistencyLevel.ANY, meaning, "wait for a write to succeed anywhere, even a hinted write that isn't immediately readable."

Mike Perham has a related post on the same topic. He goes further and explains that because there could be scenarios where writes are not immediately visible due to a disabled master node, its possible that master could get out of sync with the slaves in the confusion. There is a process called “anti-entropy” which cassandra uses to detect and resolve such issues. Here is how he explains

The final trick up Cassandra’s proverbial sleeve is anti-entropy. AE explicitly ensures that the nodes in the cluster agree on the current data. If read repair or hinted handoff don’t work due to some set of circumstances, the AE service will ensure that nodes reach eventual consistency. The AE service runs during “major compactions” (the equivalent of rebuilding a table in an RDBMS) so it is a relatively heavyweight process that runs infrequently. AE uses a Merkle Tree to determine where within the tree of column family data the nodes disagree and then repairs each of those branches.

November 08, 2010

OpenTSDB – Distributed time series database

Ever since I saw a demo of this tool, I’ve been on the edge, waiting for it to be opensourced so that I could use it.  The problem its trying to solve is a real pain-point which most webops folks would understand.

Yesterday folks at stumbleupon finally opened it up. Its released under LGPLv3 license. You can find the source here and the documentation here.

At StumbleUpon, we have found this system tremendously helpful to:

  • Get real-time state information about our infrastructure and services.
  • Understand outages or how complex systems interact together.
  • Measure SLAs (availability, latency, etc.)
  • Tune our applications and databases for maximum performance.
  • Do capacity planning.

AWS cloudfront grows up… a little. Now allows Custom origins.

 Cloudfront has come a long way from its humble beginnings. Here is what Jeff had to say when he announced that its out of “beta”….

    1. First, we've removed the beta tag from CloudFront and it is now in full production. During the beta period we listened to our customers and added a number of important features including Invalidation, a default root object, HTTPS access, private content, streamed content, private streamed content,AWS Management Console support, request logging, and additional edge locations. We've also reduced our prices.
    2. There's now an SLA (Service Level Agreement) for CloudFront. If availability of your content drops below 99.9% in any given month, you can apply for a service credit equal to 10% of your monthly bill. If the availability drops below 99% you can apply for a service credit equal to 25% of your monthly bill.

While all this is a big step forward, its probably not enough for the more advanced CDN users to switch over yet.

Here are a couple of issues which stuck out in the Developer Guide.

  • Query parameters are not used to generate cache key. So while it looks like it can pull content from an elastic loadbalancer, it still acts like a giant S3 accelerator.
  • Doesn’t support HTTP/1.1 yet. So if you have multiple domains on the same IP, this solution isn’t for you.

November 07, 2010

Building your first cloud application on AWS

Building your first web application on AWS is like shopping for a car at pepboys, part by part. While manuals to build one might be on aisle 5, the experience of having built one already is harder to buy.

Here are some interesting logistical questions, which I don’t think get enough attention, when people discuss issues around building a new AWS based service.

  1. Picking the right Linux distribution: Switching OS distribution may not be too simple if your applications need custom scripts. Picking and sticking with a single distribution will save a lot of lost time.
  2. Automated server builds: There are many ways to skin this cat. Chef, Puppet, Cfengine are all good... Whats important is to pick one early in the game.
  3. Multi-Availability Zone support: Find out if multi availability zone support is important. This can impact over all architecture of the solution and tools used to implement the solution.
  4. Data consistency requirements: Similar to the Multi-AZ support question, its important to understand the data consistency tolerance of the application before one starts designing the application.
  5. Datastore: There are different kinds of datastores available as part of AWS itself (SimpleDB, S3 and RDS). If you are planning to keep your options open about moving out of AWS at some point, you should think about picking a datastore which you could move out with you with little effort. There are many NoSQL and RDBMS solutions to choose from.
  6. Backups: While some think its a waste of time to think about backups too early, I suspect those who don’t will be spending way too much time later. The long term backup strategy is integral part of disaster recovery planning, without which you shouldn’t think of going live.
  7. Integration with external data sources:  If this application is part of a larger cluster of application which is running somewhere else, think about how data would be sent back and forth. There are lots of different options depending on how much data is involved (or how important protection of that data is)
  8. Monitoring/Alerting: Most standard out of the box monitoring tools can’t handle dynamic infrastructure very well. There are, however,  plugins available for many existing monitoring solutions which can handle the dynamic nature of infrastructure. You could also choose to use one of the 3rd party monitoring services if you’d rather pay someone else to do it.
  9. Security: You should be shocked to see this on #9 on my list. If your service involves user data, or some other kind of intellectual property, build multi-tiered architecture to segment different parts of your application from targeted attacks. Security is also very important while picking the right caching and web server technologies.
  10. Development: Figure out how developers would use AWS. Would they share the same AWS account, share parts of the infrastructure, share datastore, etc. How would the developer resources be monitored so that unintentional uses of excessive resources could be flagged for alerting.

Are there other subtle issues which I should have listed here ? Let me know.

 

November 05, 2010

Rapid prototyping with solr

Extreme prototyping with Solr by Eric Hatcher

At ApacheCon this week I presented “Rapid Prototyping with Solr”.  This is the third time I’ve given a presentation with the same title.  In the spirit of the rapid prototyping theme, each time I’ve created a new prototype just a day or so prior to presenting it.  At Lucene EuroCon the prototype used attendee data, a treemap visualization, and a cute little Solr-powered “app” for picking attendees at random for the conference giveaways.  For a recent Lucid webinar the prototype was more general purpose, bringing in and making searchable rich documents and faceting on file types with a pie chart visualization.

This time around, the data set I chose was Data.gov’s catalog of datasets, which fit with the ApacheCon open source aura, and Lucid Imagination’s support of Open Source for America, which helps to advocate for open source in the US Federal Government.  The prototype built includes faceting browsing, query term suggest, hit highlighting, result clustering, spell checking, document detail, and a bonus Venn diagram visualization.

November 04, 2010

Shipping Trunk : For web applications

I had briefly blogged about this presentation before from Velocity 2010. I wish they had released the video for this session. I went through this slide deck again today to see if Paul mentioned any of the problems organization like ours are dealing with in its transition from quarterly releases to weekly/continuous releases.

One of the key observations Paul made during his talk is that most organizations still treat web applications as desktop software and have very strict quality controls which may not be as necessary since releasing changes for web app in a SAAS (Software as a service) is much more cheaper than for releasing patches for traditional desktop software.

Here are some of the other points he made. For really detailed info check out the [slides]

  • Deploy frequently, facilitating rapid product iteration
  • Avoid large merges and the associated integration testing
  • Easily perform A/B testing of functionality
  • Run QA and beta testing on production hardware
  • Launch big features without worrying about your infrastructure
  • Provide all the switches your operations team needs to manage the deployed system

Slides: Always ship Trunk: Managing Change In Complex Websites

November 03, 2010

Real-Time MapReduce using S4

While trying to figure out how to do real-time log analysis in my own organization I realized that most map-reduce frameworks are designed to run as batch jobs in time delays manner rather than be instantaneous like a SQL query to a Mysql DB. There are some frameworks which are bucking the trend. Yahoo! Lab! recently announced that their “Advertising Sciences” group has built a general purpose, real-time, distributed, fault-tolerant, scalable, event driven, expandable platform called “S4” which allows programmers to easily implement applications for processing continuous unbounded streams of data.

S4 clusters are built using low-cost commoditized hardware, and leverage many technologies from Yahoo!’s Hadoop project. S4 is written in Java and uses the Spring Framework to build a software component architecture. Over a dozen pluggable modules have been created so far.

Why do we need a real-time map-reduce framework?
Applications such as personalization, user feedback, malicious traffic detection, and real-time search require both very fast response and scalability. In S4 we abstract the input data as streams of key-value pairs that arrive asynchronously and are dispatched intelligently to processing nodes that produce data sets of output key-value pairs. In search, for example, the output data sets are made available to the serving system before a user executes her next search query. We use this rapid feedback to adapt the search models based on user intent

Read more: Original post from Yahoo! Labs

Storage options on app engine

For those who think google app engine only has one kind of datastore, the one built around “bigtable”, think again. Nick Johnson goes into details of all the other options available with their pro’s and con’s in his post.
App Engine provides more data storage mechanisms than is apparent at first glance. All of them have different tradeoffs, so it's likely that one - or more - of them will suit your application well. Often, the ideal solution involves a combination, such as the datastore and memcache, or local files and instance memory.

Storage options he lists..

  • Datastore

  • Memcache

  • Instance memory

  • Local Files


Read more: Original post from Nick

November 01, 2010

Is auto-sharding ready for auto-pilot ?

James Golick makes a point which lot of people miss. He doesn’t believe auto-sharding features NoSQL provides is ready for full auto-pilot yet, and that good developers have to think about sharding as part of design architecture, regardless of what datastore you pick.

If you take at face value the marketing materials of many NoSQL database vendors, you'd think that with a horizontally scalable data store, operations engineering simply isn't necessary. Recent high profile outages suggest otherwise.

MongoDB, Redis-cluster (if and when it ships), Cassandra, Riak, Voldemort, and friends are tools that may be able to help you scale your data storage to varying degrees. Compared to sharding a relational database by hand, using a partitioned data store may even reduce operations costs at scale. But fundamentally, no software can use system resources that aren't there.

At the very least one has to understand how auto sharding in a NoSQL works, how easy is it to setup, maintain, backup and restore. “Rebalancing” can be an expensive operation, and if shards are separated by distance or high latency, some designs might be better than others.