AWS Cloudwatch is now really open for business

In a surprise move Amazon today released a bunch of new features to its cloudwatch service, some of which, till now, were provided by third party service providers.

  • Basic Monitoring of Amazon EC2 instances at 5-minute intervals at no additional charge. AWS cloudwatch
  • Elastic Load Balancer Health Checks -Auto Scaling can now be instructed to automatically replace instances that have been deemed unhealthy by an Elastic Load Balancer.
  • Alarms – You can now monitor Amazon CloudWatch metrics, with notification to the Amazon SNS topic of your choice when the metric falls outside of a defined range.
  • Auto Scaling Suspend/Resume – You can now push a "big red button" in order to prevent scaling activities from being initiated.
  • Auto Scaling Follow the Line -You can now use scheduled actions to perform scaling operations at particular points in time, creating a time-based scaling plan.
  • Auto Scaling Policies – You now have more fine-grained control over the modifications to the size of your AutoScaling groups.
  • VPC and HPC Support – You can now use AutoScaling with Amazon EC2 instances that are running within your Virtual Private Cloud or as Cluster Compute instances.

Scaling Graphite by using Cfmap as the data transport

Graphite is an extremely promising system and resource graphing tool which tries to take RRD to the next level. Here are some of the most interesting features of graphite which I liked.

  • Updates can happen to records in the past (RRD doesn’t allow this I think)image
  • Creation of new datasets is trivial with whisper/carbon ( its part of  the graphite framework )
  • Graphite allows federated storage (multiple servers across multiple datacenters for example)

Monitoring and graphing resources across data-centers is tricky however. Especially because WAN links cannot be trusted. While loosing some data may be ok for some folks, it may not be acceptable for others. Graphite tries to solve this problem by providing an option to federate data across multiple servers which could each be kept in separate datacenters.

Another way to solve this problem is by using a data transport which is resilient to network failures.

Since Cfmap (thanks to Cassandra) is a distributed, eventually consistent, state repository, it could easily be extended to act as an eventually consistent data queue for tools like graphite. Take a look at an example of a server record here (on the right).  With some minor modifications, we were able to log and publish all changes to system attributes using an API like this. And with the right script running on the graphite server, the import of these stats into carbon (a component of graphite), became a trivial task.

image

With Cfmap’s easy REST interface, adding new stats to graphite becomes as simple as registering the stats to cfmap from anywhere in the network. [ sample script ]

Graphite is not in use in our corporate network today, but I’m extremely excited at the possibilities and would be actively looking at how else we could use it.

[ Take a look at RabbitMQ integration with Graphite for another interesting way of working with graphite ]

Monitoring large-scale application clusters

Most software engineering organizations build applications with some hooks in place to allow functional tests. Some organizations continuously build and test all software automatically at check-in. And then there are those who have learnt from mistakes, and have built a suite of tests which get triggered at startup to look for problems which could indicate a failed initialization.

The next step in building a scalable web application, is creating some form of self-monitoring logic (sometimes called a watchdog) which could periodically test itself (or monitor performance statistics) for problems worth escalating to operations team.

Arnon has a couple of (1)interesting  (2)posts on the topic which I came across today. He summarized the whole suggestion using 3 acronymns. And I’m going to add one more to make it complete

  1. BBIT : Build time Built in Tests.
  2. PBIT: Power-on Built in Tests
  3. CBIT: Continuous Built in Tests
  4. IBIT: Initiated Built in Tests

The organization I work for is relatively small in the number of servers but one thing we learnt early on was that waiting for customers to report problems was the worst way to find out about issues. Setting up monitoring using something like Nagios/Openview/curl_scripts/etc may work, but its not very scalable. Besides blackbox testing without insight into the application may not be sufficient/ideal for more complex applications.

To increase automation, reduce dependence on central monitoring infrastructure and to speed up failure detection (prediction), it might be important to have some tests in all of the 4 zones listed above.