Versioning data in S3 on AWS

One of the problem with Amazon’s S3 was the inability to take a “snapshot” of the state of S3 at anyAmazon Web Services given moment. This is one of the most important DR (disaster recovery) steps of any major upgrade which could potentially corrupt data during a release. Until now the applications using S3 would have had to manage versioning of data, but it seems Amazon has launched a versioning feature built into S3 itself to do this particular task. In addition to that, they have made it a requirement that delete operations on versioned data can only be done using MFA (Multi factor authentication).

Versioning allows you to preserve, retrieve, and restore every version of every object in an Amazon S3 bucket. Once you enable Versioning for a bucket, Amazon S3 preserves existing objects any time you perform a PUT, POST, COPY, or DELETE operation on them. By default, GET requests will retrieve the most recently written version. Older versions of an overwritten or deleted object can be retrieved by specifying a version in the request.

The way AWS Blog describes the feature, it looks like a version would be created every time an object is modified and each object in S3 could have different number of copies depending on the number of times it was modified.

This kind of reminds me of SVN/CVS like versioning control system and I wonder how long it will take for someone to build a source code versioning system on S3.

BTW, data requests to a versioned object is priced the same way as regular data, which basically means you are getting this feature for free.


Architecting for the Cloud: Best practices

Amazon has published another “Best practices” document. This one covers the almost the entire collection of services. Its biased towards AWS (obviously), but its still one of the best description summary of the various services amazon offers today.


Just the diagram above tells a lot about how the various AWS services interact with each other. Here is another small section from the document.

AWS specific tactics to automate your infrastructure

  1. Define Auto-scaling groups for different clusters using the Amazon Auto-scaling feature in Amazon EC2.
  2. Monitor your system metrics (CPU, Memory, Disk I/O, Network I/O) using Amazon CloudWatch and take appropriate actions (launching new AMIs dynamically using the Auto-scaling service) or send notifications.
  3. Store and retrieve machine configuration information dynamically: Utilize Amazon SimpleDB to fetch config data during boot-time of an instance (eg. database connection strings). SimpleDB may also be used to store information about an instance such as its IP address, machine name and role.
  4. Design a build process such that it dumps the latest builds to a bucket in Amazon S3; download the latest version of an application from during system startup.
  5. Invest in building resource management tools (Automated scripts, pre-configured images) or Use smart open source configuration management tools like Chef16, Puppet17, CFEngine 18or Genome19.
  6. Bundle Just Enough Operating System (JeOS20) and your software dependencies into an Amazon Machine Image so that it is easier to manage and maintain. Pass configuration files or parameters at launch time and retrieve user data21 and instance metadata after launch.
  7. Reduce bundling and launch time by booting from Amazon EBS volumes22 and attaching multiple Amazon EBS volumes to an instance. Create snapshots of common volumes and share snapshots23 among accounts wherever appropriate.
  8. Application components should not assume health or location of hardware it is running on. For example, dynamically attach the IP address of a new node to the cluster. Automatically failover and start a new clone in case of a failure.

Private clouds: By Amazon

A few days ago I blogged about how VMware is going to do a huge push into “private clouds” around the VMware 2009 conference. But little did we know that Amazon had something up its sleeve as well. It has announced it today.

AWS now supports creation of Virtual Private Cloud with private address space (including RFC 1918) which could be locked down by a VPN connection to only your organization only. You still get most of the benefit of Amazons cheap hardware pricing but you get to lock down the infrastructure for security reasons.

Regardless of how you see it, this is huge for IT and the developer community. Some may love it, and I’m sure some will be pretty angry at Amazon for trying to commodities security and making it look as if network security was as simple as that.

With VMware’s announcements next week, there is no doubt in my mind that the next one year at least there will be a significant push towards “private clouds”.

Cloud architecture: Notes from an Amazon talk


Some notes from a talk I was at. Didn’t get time to write it in detail. But hey, something is better than nothing… right ?

Design for failure

        – handle failure
            – use elastic ip addresses
            – use multiple amazon ec2 availability zones
            – create mutliple database slaves across multiple zones
            – use real-time monitoring (amazon cloudwatch)
            – use amazon EBS for persistent file system
                – snapshot database to s3 (from ebs)

Loose coupling sets you free

        – independent components
        – design everything as a blackbox
        – de-coupling for hybrid models
        – loadbalance-clusters
        – use SQS as buffers to queue messages. Allows elasticity

Design for dynamism

        – build for changes in infrastructure 
            – Don’t assume health of fixed location of components
            – Use designs that are resilient to reboot and re-launch
            – Bootstrap your instances
            – Enable dynamic configuration
                – Enable Self discovery
                    (puttet, chef, ?)
            – Free auto-scaling features (by triggers)
            – Use Elastic loadbalancing on multiple layers
            – Use configurations in SimnpleDB to bootstrap instances

Build security in every layerider encrypted files

        – Physical is free
        – network is easy
            – Can confider app to talk to only web and db layer… etc. Everything can be automated.
        – The rest can be added
            – Create distinct Security Groups for each Amazon EC2 cluster
            – Use group-based rules for controlling access between layers
            – Restrict external access to specific IP ranges
            – Encrypt data "at-rest" in Amazon S3
            – Encrypt data "in-transit" (SSL)
            – Consider encrypted file systems in EC2 for sensitive data

Dont fear constraints

        – More RAM ?
            Distribute load across machines. Shared distributed cache
        – Better IOPS on my database ?
            Multiple read0only / sharding / DB clustering
        – Your server has better config ?
            Implement elasticity
        – Static IP ?
            Boot script for software reconfiguration from SimpleDB


Leverage aws storage solutions

        – Amazon S3: for large static objects (whats the maximum size per object ?)
        – Amazon Coudfront: content distribution
        – Amazon SimpleDB: simple data indexing/querying
        – Amazon EC2 local disk drive: transient data
        – Amazon EBS: RDBMS persistent storage + S3 Snapshots