AWS Cloudwatch is now really open for business

In a surprise move Amazon today released a bunch of new features to its cloudwatch service, some of which, till now, were provided by third party service providers.

  • Basic Monitoring of Amazon EC2 instances at 5-minute intervals at no additional charge. AWS cloudwatch
  • Elastic Load Balancer Health Checks -Auto Scaling can now be instructed to automatically replace instances that have been deemed unhealthy by an Elastic Load Balancer.
  • Alarms – You can now monitor Amazon CloudWatch metrics, with notification to the Amazon SNS topic of your choice when the metric falls outside of a defined range.
  • Auto Scaling Suspend/Resume – You can now push a "big red button" in order to prevent scaling activities from being initiated.
  • Auto Scaling Follow the Line -You can now use scheduled actions to perform scaling operations at particular points in time, creating a time-based scaling plan.
  • Auto Scaling Policies – You now have more fine-grained control over the modifications to the size of your AutoScaling groups.
  • VPC and HPC Support – You can now use AutoScaling with Amazon EC2 instances that are running within your Virtual Private Cloud or as Cluster Compute instances.

The Cloud: Watch your step ( Google App engine limitations )

Any blog which promotes the concept of cloud infrastructure would be doing injustice if it doesn’t provide references to implementations where it failed horribly. Here is an excellent post by Carlos Ble where he lists out all the problems he faced on Google App engine (python).  He lists 13 different limitations, most of which are very well known facts, and then lists some more frustrating reasons why he had to dump the solution and look for an alternative.

The tone of the voice is understandable, and while it might look like App-Engine-bashing, I see it as a great story which others could lean from.

For us, GAE has been a failure like Wave or Buzz were but this time, we have paid it with our money. I’ve been too stubborn just because this great company was behind the platform but I’ve learned an important lesson: good companies make mistakes too. I didn’t do enough spikes before developing actual features. I should have performed more proofs of concept before investing so much money. I was blind.

Cloud is not for everyone or for all problems. While some of these technologies take away your growing pain points, they assume you are ok with some of the limitations. If you were surprised by these limitations after you are neck deep in coding, then you didn’t do your homework.

Here are the 13 points issues he pointed out. I haven’t  used Google App engine lately, but my understanding is that App engine team have solved, or on the path of solving (or reducing pain) some of these issues.

  • Requires Python 2.5
  • Cant use HTTPS
  • 30 seconds to run
  • URL fetch gets only 5 seconds
  • Can’t use python libraries compiled in C
  • No “LIKE” operators in datastore
  • Can’t join tables
  • “Too many indexes”
  • Only 1000 records at a time returned
  • Datastore and memcache can fail at times
  • Max memcache size is 1MB

DealNews: Scaling for Traffic Spikes

Last year unexpectedly got listed dealnews.comon the front page of for a couple of hours. No matter how optimistic one is, unexpected events like these can take down a regular website with almost no effort at all. What is your plan if you get slashdotted ? Are you ok with a short outage ? What is the acceptable level of service for your website anyway.

One way to handle such unexpected traffic is having multiple layers of cache. Database query cache is one, generating and caching dynamic content is another way (may be using a cronjob). Tools like memcached, varnish, squid can all help to reduce the load on application servers.

Proxy servers ( or webservers ) in front of application servers play a special role in dealnews. They understood the limitations of application servers they were using, and the fact that slow client connections means longer lasting tcp sessions to the application servers. Proxy servers, like varnish, could off-load that job and take care of content delivery without keeping application servers busy. In addition Varnish also acts as a content caching service which further reduces load on the application servers.

Dealnews’ content is extremely dynamic because of which the it uses a very low TTL of 5 minutes for most of its pages. It may not look a lot but at thousands of pages per second, such a cache can do miracles. While caching is great, the one thing every loaded website has to go through is figure out how to avoid the “cache stampede” when the TTL expires. “Cache stampede” is what happens when 100s of request requesting the same resource hit the server at the same time forcing the webserver to forward all 100 request to the app server and the database server because the caches were not good.

Dealnews solves this problem by separating content generation from content delivery. There is a process which they run which converts data from more than 300 tables of normalized data, into 30 tables with highly redundant de-normalized data. This data is kept in such a way that the application servers are required to make queries  using primary keys or unique keys only. With such a design a cluster of Mysql DB servers shouldn’t have any problem handling 1000s of queries per second from the front end application servers.

Twitter drives a lot of traffic and since a lot of that data is redundant, it heavily relies on caches. Its actually so much that the site could completely go down if a few memcached servers go down. Dealnews explicitly tested their application with the memcached servers disabled to see what the worst case scenario was for reinitializing cache. They then optimized their app to the point where the response time only doubled from about 0.75 seconds to 1.5 second per page without memcached servers.

Handling 3rd party content could be tricky. Dealnews treats 3rd party content as lower class citizens. They not only load 3rd party at the very end of the page, they also try to use iframes wherever possible to keep loading of those objects from loading of

If you are interested in the video recording or the slides from the talk, click on the following links.