January 17, 2010

Architecting for the Cloud: Best practices

Amazon has published another “Best practices” document. This one covers the almost the entire collection of services. Its biased towards AWS (obviously), but its still one of the best description summary of the various services amazon offers today.


Just the diagram above tells a lot about how the various AWS services interact with each other. Here is another small section from the document.

AWS specific tactics to automate your infrastructure

  1. Define Auto-scaling groups for different clusters using the Amazon Auto-scaling feature in Amazon EC2.
  2. Monitor your system metrics (CPU, Memory, Disk I/O, Network I/O) using Amazon CloudWatch and take appropriate actions (launching new AMIs dynamically using the Auto-scaling service) or send notifications.
  3. Store and retrieve machine configuration information dynamically: Utilize Amazon SimpleDB to fetch config data during boot-time of an instance (eg. database connection strings). SimpleDB may also be used to store information about an instance such as its IP address, machine name and role.
  4. Design a build process such that it dumps the latest builds to a bucket in Amazon S3; download the latest version of an application from during system startup.
  5. Invest in building resource management tools (Automated scripts, pre-configured images) or Use smart open source configuration management tools like Chef16, Puppet17, CFEngine 18or Genome19.
  6. Bundle Just Enough Operating System (JeOS20) and your software dependencies into an Amazon Machine Image so that it is easier to manage and maintain. Pass configuration files or parameters at launch time and retrieve user data21 and instance metadata after launch.
  7. Reduce bundling and launch time by booting from Amazon EBS volumes22 and attaching multiple Amazon EBS volumes to an instance. Create snapshots of common volumes and share snapshots23 among accounts wherever appropriate.
  8. Application components should not assume health or location of hardware it is running on. For example, dynamically attach the IP address of a new node to the cluster. Automatically failover and start a new clone in case of a failure.

Monitoring large-scale application clusters

Most software engineering organizations build applications with some hooks in place to allow functional tests. Some organizations continuously build and test all software automatically at check-in. And then there are those who have learnt from mistakes, and have built a suite of tests which get triggered at startup to look for problems which could indicate a failed initialization.

The next step in building a scalable web application, is creating some form of self-monitoring logic (sometimes called a watchdog) which could periodically test itself (or monitor performance statistics) for problems worth escalating to operations team.

Arnon has a couple of (1)interesting  (2)posts on the topic which I came across today. He summarized the whole suggestion using 3 acronymns. And I’m going to add one more to make it complete

  1. BBIT : Build time Built in Tests.
  2. PBIT: Power-on Built in Tests
  3. CBIT: Continuous Built in Tests
  4. IBIT: Initiated Built in Tests

The organization I work for is relatively small in the number of servers but one thing we learnt early on was that waiting for customers to report problems was the worst way to find out about issues. Setting up monitoring using something like Nagios/Openview/curl_scripts/etc may work, but its not very scalable. Besides blackbox testing without insight into the application may not be sufficient/ideal for more complex applications.

To increase automation, reduce dependence on central monitoring infrastructure and to speed up failure detection (prediction), it might be important to have some tests in all of the 4 zones listed above.