More on Amazon S3 versioning (webinar)

If you missed the AWS S3 versioning webcast, I have a copy of the video here. And here are the highlights..

image

  • You can enable and disable this at the bucket level
  • They don’t think there is a performance penalty of turning versioning (but it was kind of obvious S3 would be doing slightly extra work to figure out which is the latest version of any object you have)
  • There isn’t any additional cost for using versioning. But you have to pay for extra copy of each object.
  • MFA (multi factor authentication) to delete objects is not mandatory when versioning is turned on. It needs to be turned on. This was slightly confusing in the original email I got from AWS.
  • If you are planning to use this, please watch this video. There is a part where they explain what happens if you disable versioning after using the feature. This is something you might like to know about.
  • They use GUID for versioning of each object
  • You can iterate over objects and figure out how many versions you have for each object, but currently its not possible to find all objects which have versions older than X date. This is important if you are planning to garbage collection (cleaning up older copies of data) for a later time.

More References

Scaling updates for Feb 10, 2010

Lots of interesting updates today.

But would like to first mention the fantastic work Cloud computing group at UCSB are doing to make appengine framework more open. They have done significant work at making appscale “work” with different kinds of data sources including HBase, Cassandra, Voldemort, MongoDB, Hypertable and Mysql and MemcacheDB. Appscale is actively looking for folks interested in working with them to make this stable and production ready.

  • GAE 1.3.1 released: I think the biggest news about this release is the fact that 1000 row limit has now been removed. You still have to deal with the 30 second processing limit per http request, but at least the row limit is not there anymore. They have also introduced support for automatic transparent datastore api retries for most operations. This should dramatically increase reliability of datastore queries, and reduces the amount of work developers have to do to build this auto-retry logic.
  • Elastic search is a lucene based indexing product which seems to do what Solr used to do with the exception that it can now scale across multiple servers. Very interesting product. I’m going to try this out soon.
  • MemcacheDB: A distributed key-value store which is designed to be persistent. It uses memcached protocol, but its actually a datastore (using Berkley DB) rather than cache. 
  • Nasuni seems to have come up with NAS software which uses cloud storage as the persistent datastore. It has capability to cache data locally for faster access to frequently accessed data.
  • Guys at Flickr have two interesting posts you should glance over. “Using, Abusing and Scaling MySQL at Flickr” seems to be the first in a series of post about how flickr scales using Mysql. The next one in the series is “Ticket Servers: Distributed Unique Primary Keys on the Cheap”
    • Finally a fireside chat by Mike Schroepfer, VP of Engineering,  about Scaling Facebook.

Versioning data in S3 on AWS

One of the problem with Amazon’s S3 was the inability to take a “snapshot” of the state of S3 at anyAmazon Web Services given moment. This is one of the most important DR (disaster recovery) steps of any major upgrade which could potentially corrupt data during a release. Until now the applications using S3 would have had to manage versioning of data, but it seems Amazon has launched a versioning feature built into S3 itself to do this particular task. In addition to that, they have made it a requirement that delete operations on versioned data can only be done using MFA (Multi factor authentication).

Versioning allows you to preserve, retrieve, and restore every version of every object in an Amazon S3 bucket. Once you enable Versioning for a bucket, Amazon S3 preserves existing objects any time you perform a PUT, POST, COPY, or DELETE operation on them. By default, GET requests will retrieve the most recently written version. Older versions of an overwritten or deleted object can be retrieved by specifying a version in the request.

The way AWS Blog describes the feature, it looks like a version would be created every time an object is modified and each object in S3 could have different number of copies depending on the number of times it was modified.

This kind of reminds me of SVN/CVS like versioning control system and I wonder how long it will take for someone to build a source code versioning system on S3.

BTW, data requests to a versioned object is priced the same way as regular data, which basically means you are getting this feature for free.

References

Cloud : Agility vs Security

Networking devices on the edges have become smarter over time. So have the firewalls and switches used internally within the networks. Whether we like it or not, web applications over time have grown to depend on them.

Its impossible to build a flawless product because of which its standard practice to disable all unused services on a server. Most organizations today try to follow the n-tier approach to create different logical security zones with the core asset inside the most secure zone. The objective is to make it difficult for an attacker to get to the core asset without breaching multiple sets of firewalls.

Doing frequent system patches, auditing file system permissions and setting up intrusion detection (host or network based)  are some of the other mundane ways of keeping web applications safe from attacks.

Though cloud has made deployment of on-demand infrastructure simpler, its hard to build a walled garden around customers cluster of servers on the cloud in an efficient way anymore. And the absence of such walled gardens and logical security zones means there are more points of entry into the infrastructure which could be exploited. If you replace 10 powerful internal servers with 100 small servers on the cloud, all of a sudden you might have to worry about protecting 100 individual servers instead of protecting a couple of edge devices. In a worst case scenario, one week server in the cluster could expose the entire cluster to an attacker. Here are a few other things to think about…

  • Host based firewalls should allow only traffic which are required/expected
  • Non-essential services should be shut off on the server
  • Some kind of Intrusion detection might be important to have
  • Keys/passwords should be changed periodically
  • System patches (update OS image) need to be applied periodically
  • Authenticate/Authorize all inter-server communication
  • Maintain audit trail for all changes to images/servers if possible

An organization which is completely on the cloud may not have an IT department in its current form, but it might still have an operations team which makes the security policies,  updates OS images, manages billing, monitors system health (and IDS) and trains developers to do the things in the right way.

If your infrastructure is on the cloud, do write back with a note about what you do to protect your applications.

Image source: AMagill

Windows Azure

Windows Azure is an application platform provided by Microsoft to allow others to run applications on Microsoft’s “cloud” infrastructure. Its finally open for business (as of Feb 1, 2010). Below are some links about Azure for those who are still catching up.Windows Azure logo.jpg

Wikipedia: Windows Azure has three core components: Compute, Storage and Fabric. As the names suggest, Compute provides computation environment with Web Role and Worker Role while Storage focuses on providing scalable storage (Blobs, Tables, Queue) for large scale needs.

The hosting environment of Windows Azure is called the Fabric Controller – which pools individual systems into a network that automatically manages resources, load balancing, geo-replication and application lifecycle without requiring the hosted apps to explicitly deal with those requirements.[3] In addition, it also provides other services that most applications require — such as the Windows Azure Storage Service that provides applications with the capability to store unstructured data such as binary large objects, queues and non-relational tables.[3] Applications can also use other services that are a part of the Azure Services Platform.

The real concerns about Cloud infrastructure (as it is today)

While “private clouds may not be the future” they are definitely needed today. Here are some of the top issues bothering some organizations which have been thinking about going into the cloud. Some of issues were based on Craig Bolding’s talk on “Guide to cloud security”.cluod

  • Unlike your own data center, you will never know what the cloud vendors are running, or how they backup, or what their DR plans are. They will say you shouldn’t care, but do you remember what happened to the Tmobile customer’s on Danger ?
  • Uptime, availability and responsiveness is less predictable than in a self hosted environment. In most cases the cloud vendors may not even choose to let customers know about major maintenance if they don’t anticipate any issues. Organizations who manage their own infrastructure would always try to avoid doing two major changes which have interdependencies.
  • Multi-Tenancy means you may have to worry about a noisy neighbor.
  • Muti-Tenancy could also lead one to interesting issues which were never thought about before. What if there was a way to do an “injection attack”. Depending on how Multi-Tenancy is implemented, you could potentially touch other customers data.
  • Infrastructure and platform lock-in issues are worrying for many organizations who are thinking long term. Most cloud vendors don’t really have a long history to show their track record.
  • Change control and detailed change log is missing.
  • Individual customers don’t have much decision making power on what a vendor should do next. In a privately hosted environment the stake holders are asked before something is done, but in larger infrastructure, you are a small fish in a huge pond.
  • Most cloud vendors have multiple layers of cloud infrastructure dependent on each other. Its hard to understand how issues around one type of cloud could impact others. This is especially true from Security view point. A bad flaw in a lower layer of the architecture could impact all other platforms built over it.
  • Moving applications to cloud means dealing with a different style of programming designed for horizontal scalability, data consistency issues, health monitoring, load balancing, managing state, etc.
  • Identify management is still in early stages. Integration with corporate Identify management infrastructure would be important to make it easy for individuals from large organizations on external clouds.
  • Who takes care of scrubbing disks when data is moved around ? What about data on backup tapes ? This is very important in application handling highly sensitive data.
  • Just like credit card fraud, one has to worry about CPU time fraud. Is the current billing and reporting good enough to help large organizations figure out what is real and what could be fraud ? They need a real-time fraud detection mechanism. And what about loss of service due to DOS attacks ? Who pays for that ?
  • Need a better mechanism to bill large corporations.
  • On the non-technical side, there are a lot of questions related to SLAs, Compliance issues, Terms of services, Legal issues around cross border services, and even questions about whether law enforcement have a different set of rules when search and seizure is required.
  • Not too far from being another form of “outsourcing”.

Photo credit: akakumo

AppScale, an OpenSource GAE implementation

If you don’t like EC2 you have an option to move your app to a new vendor. But if you don’t like GAE  (Google app engine) there aren’t any solutions which can replace GAE easily.

AppScale might change that.

AppScale is an open-source implementation of the Google AppEngine (GAE) cloud computing interface from the RACELab at UC Santa Barbara. AppScale enables execution of GAE applications on virtualized cluster systems. In particular, AppScale enables users to execute GAE applications using their own clusters with greater scalability and reliability than the GAE SDK provides. Moreover, AppScale executes automatically and transparently over cloud infrastructures such as the Amazon Web Services (AWS) Elastic Compute Cloud (EC2) and Eucalyptus, the open-source implementation of the AWS interfaces.

The list of supported infrastructures is very impressive. However the key, in my personal opinion, would be stability and compatibility with current GAE APIs.

Learn more about AppScale:

  1. AppScale Home page
  2. Google Code page
  3. Google Group for AppScale
  4. Demo at Bay area GAE Developers meeting: At Googleplex ( Feb 10, 2010)

Private clouds not the future ?

James Hamilton is one of the leaders in this industry and has written a very thought provoking post about private clouds not being the future. This is what he said about private clouds when compared to existing not-cloud solutions.

  • A fix, Not the future (reference to an InformationWeek post)
  • Runs at lower utilization levels
  • Consumes more power
  • Less efficient environmentally
  • Runs at higher costs

Though I believe in most of his comments, I’m not convinced with the generalization of the conclusions. In particular, what is the maximum number of servers one need to own, beyond which outsourcing will become a liability. I suspect this is not a very high number today, but will grow over time.

Hardware costs: The scale at which Amazon buys infrastructure is just mind boggling, but organizations buying in bulk could get pretty good deal from those same vendors as well.  Its not clear to me how many servers one has to buy to get discounts like what amazon does.

Utilization levels: Cloud providers optimize utilization by making sure all the servers are getting used all the time. Its also important to remember that because they trying to maximize utilization they don’t always buy all the servers for all of its customers when they sign up. 

At scale, with high customer diversity, a wonderful property emerges: non-correlated peaks. Whereas each company has to provision to support their peak workload, when running in a shared cloud the peaks and valleys smooth. The retail market peaks in November, taxation in April, some financial business peak on quarter ends and many of these workloads have many cycles overlaid some daily, some weekly, some yearly and some event specific. For example, the death of Michael Jackson drove heavy workloads in some domains but had zero impact in others.

This is something which bothers Enterprise IT departments everywhere when they are building private clouds. Can they get away with buying less servers than what the organization really needs and at times say “no” to some departments when they run out of computing power ?  Its hard to beat the scale of shared clouds.

The other reason why utilization levels are low in private clouds is because most organizations don’t have computationally-intensive batch jobs which could take advantage of servers be done while servers are not in use. On Amazon one could even bid for a lower price on unused EC2 resources.

This is a tough problem and I don’t think private clouds can outperform shared clouds.

Power usage: Inefficient cooking and power conversion losses can quickly make hosting infrastructure more expensive. Having domain experts can definitely help, and that’s not something smaller organizations can do either.

Platform: There aren’t any stable, proven, internal cloud infrastructure platform  which comes cheap. VMware’s ROI calculator might claim its cheap, but I’m not convinced yet. The xen/kvm options look very stable, but they don’t come with decent management tools. In short there is a lot of work which needs to be done just to pick a platform.

A private hadoop cluster is still a cloud infrastructure. At lot of organizations are now switching to similar batch processing based clouds which could be shared for different kinds of jobs. And there are still others who could decide to invest in smarter deployment and automation scripts to fully utilize their private infrastructure without using virtualization.

Overhead of the shared cloud: Larger an organization is, more difficult it is for it to migrate to a shared cloud. In fact migrating an existing live RDBMS based application over to cloud would be impossible without significant resources to architect the whole application and datastore. These organizations also have extensive well tested security policies and guidelines in place, all of  which would have to be thrown to the dogs if they have to put their data on a public network over which they have no control. But I do believe this is a temporary problem which will be resolved over time in favor of shared clouds.

Cost: Cloud infrastructure providers are not non-profit organizations. Though are here to make money, they would still be significantly cheaper for many. But do your homework and make sure you and your management team is ok with giving up infrastructure control for some cost savings.

That being said, here are my predictions for next couple of years.

  1. Except to see more non-virtualized, application clouds in the enterprise.
  2. Expect the shared cloud providers to get even more cost effective over time as competition increases.
  3. See more open source initiatives to build tools which manage private cloud infrastructures.
  4. See more interesting tools which provide end-users the ability to visualize actual cost of resources they are using. Making the cost more transparent, could guide developers to design smarter applications.

Cassandra for service registry/discovery service

My last post was about my struggle to find a good distributed ESB/Service-discovery solution built over open source tools which was simple to use and maintain. Thanks to reader comments (Dan especially) and some other email exchanges, it seems like building a custom solution is unavoidable if I really want to keep things simple.

Dan suggested that I could use DNS to find seed locations for config store which would work very well in a distributed network. If security wasn’t a concern this seed location could have been on S3 or SimpleDB, but the requirement that it needs to be secured on internal infrastructure forced me to investigate simple replicated/eventually-consistent databases which could be hosted internally in different data centers with little or no long term administration cost.

My search lead me to investigate a few different NOSQL options

But the one I finally settled on as a possible candidate was Cassandra. Unlike some of the others, since our application platform was based on java, Cassandra was simple to install and setup. The fact that Facebook used it to store 50TB of data across 150 servers helped us convince it was stable as well.

The documentation on this project isn’t as much as I would have liked, but I did get it running pretty fast. Building a service registry/discovery service on top of this is whats next on my mind..

More on Cassandra

If you are interested in learning more about cassandra I’ll recommend you to listen to this talk by Avinash Lakshman (facebook) and read a few other posts listed here.

Cassandra: Articles

  • Cassandra — Getting Started: Cassandra data model from a Java perspective

  • Using Cassandra’s Thrift interface with Ruby

  • Cassandra and Thrift on OS X: one, two, three

  • Looking to the Future with Cassandra: how Digg migrated their friends+diggs data set to Cassandra from mysql

  • Building Scalable Databases: Denormalization, the NoSQL Movement and Digg

  • WTF is a SuperColumn? An Introduction to the Cassandra Data Model

  • Meet Scalandra: Scala wrapper for Cassandra

  • Cassandra and Ruby: A Love Affair? – Engine Yard’s walk-through of the Cassandra gem

  • Up and Running with Cassandra: featuring data model examples of a Twitter clone and a multi-user blog, and ruby client code

  • Facebook Engineering notes and Cassandra introduction and LADIS 2009 paper

  • ArchitectureInternals

  • ArchitectureGossip

  • Cassandra: Presentations

  • Cassandra in Production at Digg from NoSQL East 09

  • Introduction to Cassandra at OSCON 09

  • What Every Developer Should Know About Database Scalability: presentation on RDBMS vs. Dynamo, BigTable, and Cassandra

  • IBM Research’s scalable mail storage on Cassandra

  • NOSQL VideoNOSQL Slides: More on Cassandra internals from Avinash Lakshman.

  • Video of a presentation about Cassandra at Facebook: covers the data model of Facebook’s inbox search and a lot of implementation details. Prashant Malik and Avinash Lakshman presenting.

  • Cassandra presentation at sigmod: mostly the same slides as above

  • If any of you have worked on cassandra, please let me know how that has been working out for you.