Is Yahoo launching a cloud storage solution : MObStor

While rest of the world is busy with Microsoft and Google, Yahoo might be preparing to launch MObStor which they tout as the “Unstructured Storage for the Internet”.

While comparing MObStor to the various Cloud computing storage solutions already available, Navneet Joneja, Sr. Product Manager, mentions Facebook’s Haystack to describe MObStor’s architectural design. He also points out that though Facebook’s Haystack was optimized to store photographs, MObStor was optimized for diverse set of use cases.

Its a REST based, browser-accessible API with simple security model, and content-agnostic storage features. The focus of this service seems to be fast, reliable, secure storage with the option of allowing customers to layer additional services on top of the core service. It claims it would be optimized for high performance and high availability (who doesn’t).

Here is more from the Yahoo Developer Network Blog

Facebook’s Haystack is based on commodity storage. While MObStor does support commodity storage, it doesn’t require it. Instead, we have a storage-layer abstraction we call the ObjectStore. The ObjectStore encapsulates the key storage operations we need to perform, and allows us to have many underlying physical object stores. This allows us to mix, for example, filer-based storage with commodity storage. The upper layers have the routing intelligence that determines which ObjectStore a given piece of data is stored in. However, like Haystack, we do support high request rates using our own optimized ObjectStore written to run on commodity hardware – with one important difference. While Haystack identifies every object using a 64-bit photo key, all objects in MObStor are accessible through logical (i.e., client-supplied) URLs, not object IDs.

In MObStor, the storage layer maintains the mapping between logical URLs and physical storage, and can use any means to do so – the implementation is encapsulated within the storage layer. Needless to say, this operation is a potential performance bottleneck, so we’ve carefully optimized the algorithms used and the hardware that they run on.

Now with Amazon, Google, Microsoft and Yahoo in the picture the last shoe might finally drop.

Cloud architecture: Notes from an Amazon talk


Some notes from a talk I was at. Didn’t get time to write it in detail. But hey, something is better than nothing… right ?

Design for failure

        – handle failure
            – use elastic ip addresses
            – use multiple amazon ec2 availability zones
            – create mutliple database slaves across multiple zones
            – use real-time monitoring (amazon cloudwatch)
            – use amazon EBS for persistent file system
                – snapshot database to s3 (from ebs)

Loose coupling sets you free

        – independent components
        – design everything as a blackbox
        – de-coupling for hybrid models
        – loadbalance-clusters
        – use SQS as buffers to queue messages. Allows elasticity

Design for dynamism

        – build for changes in infrastructure 
            – Don’t assume health of fixed location of components
            – Use designs that are resilient to reboot and re-launch
            – Bootstrap your instances
            – Enable dynamic configuration
                – Enable Self discovery
                    (puttet, chef, ?)
            – Free auto-scaling features (by triggers)
            – Use Elastic loadbalancing on multiple layers
            – Use configurations in SimnpleDB to bootstrap instances

Build security in every layerider encrypted files

        – Physical is free
        – network is easy
            – Can confider app to talk to only web and db layer… etc. Everything can be automated.
        – The rest can be added
            – Create distinct Security Groups for each Amazon EC2 cluster
            – Use group-based rules for controlling access between layers
            – Restrict external access to specific IP ranges
            – Encrypt data "at-rest" in Amazon S3
            – Encrypt data "in-transit" (SSL)
            – Consider encrypted file systems in EC2 for sensitive data

Dont fear constraints

        – More RAM ?
            Distribute load across machines. Shared distributed cache
        – Better IOPS on my database ?
            Multiple read0only / sharding / DB clustering
        – Your server has better config ?
            Implement elasticity
        – Static IP ?
            Boot script for software reconfiguration from SimpleDB


Leverage aws storage solutions

        – Amazon S3: for large static objects (whats the maximum size per object ?)
        – Amazon Coudfront: content distribution
        – Amazon SimpleDB: simple data indexing/querying
        – Amazon EC2 local disk drive: transient data
        – Amazon EBS: RDBMS persistent storage + S3 Snapshots