The real concerns about Cloud infrastructure (as it is today)

While “private clouds may not be the future” they are definitely needed today. Here are some of the top issues bothering some organizations which have been thinking about going into the cloud. Some of issues were based on Craig Bolding’s talk on “Guide to cloud security”.cluod

  • Unlike your own data center, you will never know what the cloud vendors are running, or how they backup, or what their DR plans are. They will say you shouldn’t care, but do you remember what happened to the Tmobile customer’s on Danger ?
  • Uptime, availability and responsiveness is less predictable than in a self hosted environment. In most cases the cloud vendors may not even choose to let customers know about major maintenance if they don’t anticipate any issues. Organizations who manage their own infrastructure would always try to avoid doing two major changes which have interdependencies.
  • Multi-Tenancy means you may have to worry about a noisy neighbor.
  • Muti-Tenancy could also lead one to interesting issues which were never thought about before. What if there was a way to do an “injection attack”. Depending on how Multi-Tenancy is implemented, you could potentially touch other customers data.
  • Infrastructure and platform lock-in issues are worrying for many organizations who are thinking long term. Most cloud vendors don’t really have a long history to show their track record.
  • Change control and detailed change log is missing.
  • Individual customers don’t have much decision making power on what a vendor should do next. In a privately hosted environment the stake holders are asked before something is done, but in larger infrastructure, you are a small fish in a huge pond.
  • Most cloud vendors have multiple layers of cloud infrastructure dependent on each other. Its hard to understand how issues around one type of cloud could impact others. This is especially true from Security view point. A bad flaw in a lower layer of the architecture could impact all other platforms built over it.
  • Moving applications to cloud means dealing with a different style of programming designed for horizontal scalability, data consistency issues, health monitoring, load balancing, managing state, etc.
  • Identify management is still in early stages. Integration with corporate Identify management infrastructure would be important to make it easy for individuals from large organizations on external clouds.
  • Who takes care of scrubbing disks when data is moved around ? What about data on backup tapes ? This is very important in application handling highly sensitive data.
  • Just like credit card fraud, one has to worry about CPU time fraud. Is the current billing and reporting good enough to help large organizations figure out what is real and what could be fraud ? They need a real-time fraud detection mechanism. And what about loss of service due to DOS attacks ? Who pays for that ?
  • Need a better mechanism to bill large corporations.
  • On the non-technical side, there are a lot of questions related to SLAs, Compliance issues, Terms of services, Legal issues around cross border services, and even questions about whether law enforcement have a different set of rules when search and seizure is required.
  • Not too far from being another form of “outsourcing”.

Photo credit: akakumo

Private clouds not the future ?

James Hamilton is one of the leaders in this industry and has written a very thought provoking post about private clouds not being the future. This is what he said about private clouds when compared to existing not-cloud solutions.

  • A fix, Not the future (reference to an InformationWeek post)
  • Runs at lower utilization levels
  • Consumes more power
  • Less efficient environmentally
  • Runs at higher costs

Though I believe in most of his comments, I’m not convinced with the generalization of the conclusions. In particular, what is the maximum number of servers one need to own, beyond which outsourcing will become a liability. I suspect this is not a very high number today, but will grow over time.

Hardware costs: The scale at which Amazon buys infrastructure is just mind boggling, but organizations buying in bulk could get pretty good deal from those same vendors as well.  Its not clear to me how many servers one has to buy to get discounts like what amazon does.

Utilization levels: Cloud providers optimize utilization by making sure all the servers are getting used all the time. Its also important to remember that because they trying to maximize utilization they don’t always buy all the servers for all of its customers when they sign up. 

At scale, with high customer diversity, a wonderful property emerges: non-correlated peaks. Whereas each company has to provision to support their peak workload, when running in a shared cloud the peaks and valleys smooth. The retail market peaks in November, taxation in April, some financial business peak on quarter ends and many of these workloads have many cycles overlaid some daily, some weekly, some yearly and some event specific. For example, the death of Michael Jackson drove heavy workloads in some domains but had zero impact in others.

This is something which bothers Enterprise IT departments everywhere when they are building private clouds. Can they get away with buying less servers than what the organization really needs and at times say “no” to some departments when they run out of computing power ?  Its hard to beat the scale of shared clouds.

The other reason why utilization levels are low in private clouds is because most organizations don’t have computationally-intensive batch jobs which could take advantage of servers be done while servers are not in use. On Amazon one could even bid for a lower price on unused EC2 resources.

This is a tough problem and I don’t think private clouds can outperform shared clouds.

Power usage: Inefficient cooking and power conversion losses can quickly make hosting infrastructure more expensive. Having domain experts can definitely help, and that’s not something smaller organizations can do either.

Platform: There aren’t any stable, proven, internal cloud infrastructure platform  which comes cheap. VMware’s ROI calculator might claim its cheap, but I’m not convinced yet. The xen/kvm options look very stable, but they don’t come with decent management tools. In short there is a lot of work which needs to be done just to pick a platform.

A private hadoop cluster is still a cloud infrastructure. At lot of organizations are now switching to similar batch processing based clouds which could be shared for different kinds of jobs. And there are still others who could decide to invest in smarter deployment and automation scripts to fully utilize their private infrastructure without using virtualization.

Overhead of the shared cloud: Larger an organization is, more difficult it is for it to migrate to a shared cloud. In fact migrating an existing live RDBMS based application over to cloud would be impossible without significant resources to architect the whole application and datastore. These organizations also have extensive well tested security policies and guidelines in place, all of  which would have to be thrown to the dogs if they have to put their data on a public network over which they have no control. But I do believe this is a temporary problem which will be resolved over time in favor of shared clouds.

Cost: Cloud infrastructure providers are not non-profit organizations. Though are here to make money, they would still be significantly cheaper for many. But do your homework and make sure you and your management team is ok with giving up infrastructure control for some cost savings.

That being said, here are my predictions for next couple of years.

  1. Except to see more non-virtualized, application clouds in the enterprise.
  2. Expect the shared cloud providers to get even more cost effective over time as competition increases.
  3. See more open source initiatives to build tools which manage private cloud infrastructures.
  4. See more interesting tools which provide end-users the ability to visualize actual cost of resources they are using. Making the cost more transparent, could guide developers to design smarter applications.

HAProxy : Load balancing

Designing any scalable web architecture would be incomplete without investigating “load balancers”.  There used to be a time when selecting and installing load balancers was an art by itself. Not anymore.

A lot of organizations today, use Apache web servers as a proxy server (and also as a load balancer) for the backend application clusters. Though Apache is the most popular web server in the world, it also considered over-weight if all you want to do is proxy a web application. The huge codebase which apache comes with and the separate modules which need to be compiled and configured with it, could soon become a liability.

HAProxy is a tiny proxying engine which doesn’t have all the bells and whistles of apache, but is highly qualified to act as a HTTP/TCP proxy server. Here are some of the other wonderful things I liked about it

  • Extremely tiny codebase. Just two runtime files to worry about, the binary and the configuration file.
  • Compiles in seconds. 10 seconds the last time I did it.
  • Logs to syslog by default
  • Can load balance HTTP as well as regular TCP connections. Can easily load balance most non-HTTP applications.
  • Can do extremely detailed performance (and cookie capture) logging. It can differentiate backend processing time from the end-user request completion time. This is extremely helpful in monitoring performance of backend services.
  • It can do sticky load balancing out of the box
  • It can use application generated cookies instead of self-assigned cookies.
  • It can do health monitoring of the nodes and automatically removes them when health monitors fail
  • And it has a beautiful web interface for application admins who care about number.

A few other notes

  • HAProxy really doesn’t serve any files locally. So its definitely not a replacement for your apache instance if you are using it to serve local files.
  • It doesn’t do SSL, so you sill need an SSL engine in front of it if you need secure http.
  • HAProxy is not the only apache replacement. Varnish is a strong candidate which can also do caching (with ESI). And while you are at it, do take a look at Perlbal which looked interesting.

Some Interesting external linksimage

Finally a sample configuration file with most of the features I mentioned above configured for use. This is the entire thing and should be good enough for a production deployment with minor changes. 

global
        log loghost logfac info
        maxconn 4096
        user webuser 
        group webuser 
        daemon

defaults
        log     global
        stats   enable
        mode    http
        option  httplog
        option  dontlognull
        option  httpclose
        retries 3
        option  redispatch
        maxconn 2000
        contimeout      5000
        clitimeout      300000
        srvtimeout      300000

listen  http_proxy 0.0.0.0:8000
        option httpchk HEAD /app/health.jsp HTTP/1.0
        mode http
        cookie SERVERID insert
        capture cookie JSESSIONID len 50
        capture request header Cookie len 200
        capture request header Host len 50
        capture request header Referer len 200
        capture request header User-Agent len 150
        capture request header Custom-Cookie len 15
        appsession JSESSIONID len 32 timeout 3600000

        balance roundrobin
        server server1_name server1:8080 weight 1 cookie server1_name_cookie check inter 60000
        server server2_name server2:8080 weight 1 cookie server2_name_cookie check inter 60000

Heroku platform for scalable web applications

I’m so locked up in my own java world that I didn’t realize something as cool as this existed in the ruby world.

Heroku is the instant ruby platform. Deploy any ruby app instantly with a simple and familiar git push. Take advantage of advanced features like HTTP caching, memcached, rack middleware, and instant scaling built into every app. Never think about hosting or servers again.

From a layman’s point of view, Heroku looks like a ruby version of GAE image(Google app engine). It has some of the same features as GAE.  But unlike GAE, Heroku actually talks about their architecture in great detail.

They use Nginx as the front-end HTTP reverse proxy server and Varnish for the caching right behind Nginx. They wrote their own custom software to “route” requests between the web frontend and the backend services. The actual user code runs on the “Dyno Grid” where each dyno looks like a self contained ruby instance with user code (compiled slugs).

There could be multiple “dynos” on the same server, and a user application could use up multiple “dynos” on same or different servers. Since each“dyno” comes preconfigured with information about user’s database and cache connection information there is absolutely nothing else (configuration wise) “compiled slugs” need to do its job.

image

The “routing mesh” tracks detailed performance data for each of the apps and load balances as required. An unresponsive “dyno” is marked and replaced automatically. Based on the documentation they can initialize new dynos in about 2 seconds.

image

A dyno, incase you are curious is a single process running your code, somewhat like a jre container. And it looks like they put about 4 dynos for each core (CPU) they have on a server. The POSIX view of the system available to the ruby vm is read-only, and though they don’t use OS virtualization to separate each dyno, they do seperate them using “unix permissions”. I guess that means each dyno has its own unique userid/groupid pair. I don’t have much experience with ruby, but for those who care, they use “plain-vanilla MRI ruby”. image

Just like how GAE/Pyhon uses a stripped down version of Django and GAE/Java uses stripped down version of Jetty as their app server, Dyno uses a thin version of Mongrel. It also uses “Rack”/ “Rack Middleware” for apps interaction with Mongreal/webserver.

Now here is another interesting implementation choice they went with. To update your apps all you have to do is check in your changes using “Git”, and the Heroku will take care of compiling your slugs and deploy it for you. I wish GAE was like this.

The pricing looks slightly higher than raw EC2 cost, but you need to understand that Heroku is Platform (PAAS) and not Infrastructure (IAAS). They take care of the stuff you would otherwise have to struggle with if you were on AWS.

They also have some pretty interesting “Add-ons”. The one I liked was “Websolr” is a custom implementation of “solr” full-text search engine which is in turn based on lucene.

I’m curious if any of you have used Heroku and comment on what you feel about it. The devil is in the details.

Related interesting Links:

  1. http://highscalability.com/heroku-simultaneously-develop-and-deploy-automatically-scalable-rails-applications-cloud
  2. http://sazbean.com/2008/05/29/interview-with-james-lindenbaum-ceo-of-heroku/
  3. http://ec2onrails.rubyforge.org/

Weekend reading material

 

Products/Ideas

  1. redishttp://code.google.com/p/redis/ : Redis is a key-value database. It is similar to memcached but the dataset is not volatile, and values can be strings, exactly like in memcached, but also lists and sets with atomic operations to push/pop elements.
  2. HBasehttp://hadoop.apache.org/hbase/ : HBase is the Hadoop database. Its an open-source, distributed, column-oriented store modeled after the Google paper, Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop.
  3. Sherpahttp://research.yahoo.com/node/2139
  4. BigTablehttp://labs.google.com/papers/bigtable-osdi06.pdf
  5. voldemort – It is basically just a big, distributed, persistent, fault-tolerant hash table. For applications that can use an O/R mapper like active-record or hibernate this will provide horizontal scalability and much higher availability but at great loss of convenience. For large applications under internet-type scalability pressure, a system may likely consists of a number of functionally partitioned services or apis, which may manage storage resources across multiple data centers using storage systems which may themselves be horizontally partitioned. For applications in this space, arbitrary in-database joins are already impossible since all the data is not available in any single database. A typical pattern is to introduce a caching layer which will require hashtable semantics anyway. For these applications Voldemort offers a number of advantages
  6. Dynamo – A highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience.  To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.
  7. Cassandra – Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. Cassandra brings together the distributed systems technologies from Dynamo and the data model from Google’s BigTable. Like Dynamo, Cassandra is eventually consistent. Like BigTable, Cassandra provides a ColumnFamily-based data model richer than typical key/value systems.
  8. Hypertable – : Hypertable is an open source project based on published best practices and our own experience in solving large-scale data-intensive tasks.
  9. HDFS – The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

Blog/Posts/Links

  1. Eventually Consistent 
  2. Bunch of Links at bytepawn
  3. Fallacies of Distributed Computing