The real concerns about Cloud infrastructure (as it is today)

While “private clouds may not be the future” they are definitely needed today. Here are some of the top issues bothering some organizations which have been thinking about going into the cloud. Some of issues were based on Craig Bolding’s talk on “Guide to cloud security”.cluod

  • Unlike your own data center, you will never know what the cloud vendors are running, or how they backup, or what their DR plans are. They will say you shouldn’t care, but do you remember what happened to the Tmobile customer’s on Danger ?
  • Uptime, availability and responsiveness is less predictable than in a self hosted environment. In most cases the cloud vendors may not even choose to let customers know about major maintenance if they don’t anticipate any issues. Organizations who manage their own infrastructure would always try to avoid doing two major changes which have interdependencies.
  • Multi-Tenancy means you may have to worry about a noisy neighbor.
  • Muti-Tenancy could also lead one to interesting issues which were never thought about before. What if there was a way to do an “injection attack”. Depending on how Multi-Tenancy is implemented, you could potentially touch other customers data.
  • Infrastructure and platform lock-in issues are worrying for many organizations who are thinking long term. Most cloud vendors don’t really have a long history to show their track record.
  • Change control and detailed change log is missing.
  • Individual customers don’t have much decision making power on what a vendor should do next. In a privately hosted environment the stake holders are asked before something is done, but in larger infrastructure, you are a small fish in a huge pond.
  • Most cloud vendors have multiple layers of cloud infrastructure dependent on each other. Its hard to understand how issues around one type of cloud could impact others. This is especially true from Security view point. A bad flaw in a lower layer of the architecture could impact all other platforms built over it.
  • Moving applications to cloud means dealing with a different style of programming designed for horizontal scalability, data consistency issues, health monitoring, load balancing, managing state, etc.
  • Identify management is still in early stages. Integration with corporate Identify management infrastructure would be important to make it easy for individuals from large organizations on external clouds.
  • Who takes care of scrubbing disks when data is moved around ? What about data on backup tapes ? This is very important in application handling highly sensitive data.
  • Just like credit card fraud, one has to worry about CPU time fraud. Is the current billing and reporting good enough to help large organizations figure out what is real and what could be fraud ? They need a real-time fraud detection mechanism. And what about loss of service due to DOS attacks ? Who pays for that ?
  • Need a better mechanism to bill large corporations.
  • On the non-technical side, there are a lot of questions related to SLAs, Compliance issues, Terms of services, Legal issues around cross border services, and even questions about whether law enforcement have a different set of rules when search and seizure is required.
  • Not too far from being another form of “outsourcing”.

Photo credit: akakumo

Videos on scalable web architectures

If you are like me, you are already following all the talks and presentations published on YouTube. But if you have not been, nothing stops you from starting now. A new “Videos” page has been added to this blog to list the latest YouTube videos related to scalable web architectures.

Videos related to scalable web architectures

Please leave comments if you have a favorite online lecture/presentation which is not listed here.

Hive @Facebook

Hive is a data warehouse infrastructure built over Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce fromwork to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.

At a user group meeting, Ashish Thusoo from Facebook data team, spoke about how Facebook uses Hive for their data processing needs.

Problem

Facebook is a free service and has been experiencing rapid growth in last few years. The amount of data it collects, which used to be around 200GB per day in March 2008, has now grown to 15TB per day today.  Facebook realized early on that insights derived from simple algorithms on more data is better than insights from complex algorithm on smaller set of data.

But the traditional approach towards ETL on proprietary storage systems was not only getting expensive to maintain, it was also limited in the size it could scale to. This is when they started experimenting with Hadoop.

How Hadoop gave birth to Hive

Hadoop turned out to be superior in availability, scalability and manageability. Its efficiency wasn’t that great, but one could get more throughput by throwing more cheap hardware at it. Ashish pointed out that though at that point partial availability, resilience and scale was more important than ACID they had a hard time finding Hadoop programmers within Facebook to make use of the cluster.

It was this, that eventually forced Facebook, to build a new way of querying data from Hadoop which doesn’t require writing map-reduce jobs in java. That quickly lead to the development of hive, which does exactly what it was set out to do. Lets look at a couple of examples of hive queries.

  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar;
  hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;

Hive’s long term goal was to develop a system for managing and querying structured data built on top of Hadoop. To do that it used map-reduce mechanisms for execution and used HDFS for storage. They modeled the language on SQL, designed it to be extensible, interoperable and be able to out perform traditional processing mechanisms.

How it is usedimage

Facebook has a production Hive cluster which is primarily used for log summarization, including aggregation of impressions, click counts and statistics around user engagement. They have a separate cluster for “Ad hoc analysis” which is free for all/most Facebook employees to use. And over time they figured out how to use it for spam detection, ad optimization and a host of other undocumented stuff.

Facebook Hive/Hadoop statistics

The scribe/Hadoop cluster at Facebook has about 50 nodes in the cluster today and processes about 25TB of raw data. About 99% of its data is available for use within 20 seconds. The Hive/Hadoop cluster where most of the data processing happens has about 8400 cores with roughly about 12.5 PB of raw storage which translates to 4PB of usable storage after replication. Each node in the cluster is a 8 core server with 12TB of storage each.

All in all, Facebook gets 12 TB of compressed new data and scans about 135 TB of compressed data per day. There are more than 7500 Hive jobs which use up about 80000 computer hours each day.

References

HAProxy : Load balancing

Designing any scalable web architecture would be incomplete without investigating “load balancers”.  There used to be a time when selecting and installing load balancers was an art by itself. Not anymore.

A lot of organizations today, use Apache web servers as a proxy server (and also as a load balancer) for the backend application clusters. Though Apache is the most popular web server in the world, it also considered over-weight if all you want to do is proxy a web application. The huge codebase which apache comes with and the separate modules which need to be compiled and configured with it, could soon become a liability.

HAProxy is a tiny proxying engine which doesn’t have all the bells and whistles of apache, but is highly qualified to act as a HTTP/TCP proxy server. Here are some of the other wonderful things I liked about it

  • Extremely tiny codebase. Just two runtime files to worry about, the binary and the configuration file.
  • Compiles in seconds. 10 seconds the last time I did it.
  • Logs to syslog by default
  • Can load balance HTTP as well as regular TCP connections. Can easily load balance most non-HTTP applications.
  • Can do extremely detailed performance (and cookie capture) logging. It can differentiate backend processing time from the end-user request completion time. This is extremely helpful in monitoring performance of backend services.
  • It can do sticky load balancing out of the box
  • It can use application generated cookies instead of self-assigned cookies.
  • It can do health monitoring of the nodes and automatically removes them when health monitors fail
  • And it has a beautiful web interface for application admins who care about number.

A few other notes

  • HAProxy really doesn’t serve any files locally. So its definitely not a replacement for your apache instance if you are using it to serve local files.
  • It doesn’t do SSL, so you sill need an SSL engine in front of it if you need secure http.
  • HAProxy is not the only apache replacement. Varnish is a strong candidate which can also do caching (with ESI). And while you are at it, do take a look at Perlbal which looked interesting.

Some Interesting external linksimage

Finally a sample configuration file with most of the features I mentioned above configured for use. This is the entire thing and should be good enough for a production deployment with minor changes. 

global
        log loghost logfac info
        maxconn 4096
        user webuser 
        group webuser 
        daemon

defaults
        log     global
        stats   enable
        mode    http
        option  httplog
        option  dontlognull
        option  httpclose
        retries 3
        option  redispatch
        maxconn 2000
        contimeout      5000
        clitimeout      300000
        srvtimeout      300000

listen  http_proxy 0.0.0.0:8000
        option httpchk HEAD /app/health.jsp HTTP/1.0
        mode http
        cookie SERVERID insert
        capture cookie JSESSIONID len 50
        capture request header Cookie len 200
        capture request header Host len 50
        capture request header Referer len 200
        capture request header User-Agent len 150
        capture request header Custom-Cookie len 15
        appsession JSESSIONID len 32 timeout 3600000

        balance roundrobin
        server server1_name server1:8080 weight 1 cookie server1_name_cookie check inter 60000
        server server2_name server2:8080 weight 1 cookie server2_name_cookie check inter 60000

ESI: Edge Side Includes

Web page caching gets tricky once personalization is involved. Lets take twitter public_timeline for example which seems to be perfect for caching. Unfortunately when a user is logged in, it also shows the user’s information. Caching that particular page in its entirety, on the web server, in such scenarios, may not be an option. Another scenario is where parts of a page might expire faster than other (require different cache TTLs). Here again caching the whole page doesn’t help.

Edge side includes(ESI) is a markup language specifically designed to help web servers assemble dynamic content at the web layer.

<esi:include src="www.foo.com"/>

The above ESI tag is similar to tags in jsp/php/etc which allow one page to refer to another page for parts of the content on the page. By breaking up the page into smaller objects the webserver could apply different TTL settings (and user validation) to different parts of content. Twitter used to (and may still ) use “Varnish” which supports a subset of ESI specification out of the box.

But caching on the webserver may not be the real reason why this language was invented. ESI is also supported by Akamai  (CDN) on its edge caching product.  By allowing Akamai edge nodes to do the assembling close to the user, they significantly improve perceived end user performance without giving up personalization or content freshness requirements.

Heroku platform for scalable web applications

I’m so locked up in my own java world that I didn’t realize something as cool as this existed in the ruby world.

Heroku is the instant ruby platform. Deploy any ruby app instantly with a simple and familiar git push. Take advantage of advanced features like HTTP caching, memcached, rack middleware, and instant scaling built into every app. Never think about hosting or servers again.

From a layman’s point of view, Heroku looks like a ruby version of GAE image(Google app engine). It has some of the same features as GAE.  But unlike GAE, Heroku actually talks about their architecture in great detail.

They use Nginx as the front-end HTTP reverse proxy server and Varnish for the caching right behind Nginx. They wrote their own custom software to “route” requests between the web frontend and the backend services. The actual user code runs on the “Dyno Grid” where each dyno looks like a self contained ruby instance with user code (compiled slugs).

There could be multiple “dynos” on the same server, and a user application could use up multiple “dynos” on same or different servers. Since each“dyno” comes preconfigured with information about user’s database and cache connection information there is absolutely nothing else (configuration wise) “compiled slugs” need to do its job.

image

The “routing mesh” tracks detailed performance data for each of the apps and load balances as required. An unresponsive “dyno” is marked and replaced automatically. Based on the documentation they can initialize new dynos in about 2 seconds.

image

A dyno, incase you are curious is a single process running your code, somewhat like a jre container. And it looks like they put about 4 dynos for each core (CPU) they have on a server. The POSIX view of the system available to the ruby vm is read-only, and though they don’t use OS virtualization to separate each dyno, they do seperate them using “unix permissions”. I guess that means each dyno has its own unique userid/groupid pair. I don’t have much experience with ruby, but for those who care, they use “plain-vanilla MRI ruby”. image

Just like how GAE/Pyhon uses a stripped down version of Django and GAE/Java uses stripped down version of Jetty as their app server, Dyno uses a thin version of Mongrel. It also uses “Rack”/ “Rack Middleware” for apps interaction with Mongreal/webserver.

Now here is another interesting implementation choice they went with. To update your apps all you have to do is check in your changes using “Git”, and the Heroku will take care of compiling your slugs and deploy it for you. I wish GAE was like this.

The pricing looks slightly higher than raw EC2 cost, but you need to understand that Heroku is Platform (PAAS) and not Infrastructure (IAAS). They take care of the stuff you would otherwise have to struggle with if you were on AWS.

They also have some pretty interesting “Add-ons”. The one I liked was “Websolr” is a custom implementation of “solr” full-text search engine which is in turn based on lucene.

I’m curious if any of you have used Heroku and comment on what you feel about it. The devil is in the details.

Related interesting Links:

  1. http://highscalability.com/heroku-simultaneously-develop-and-deploy-automatically-scalable-rails-applications-cloud
  2. http://sazbean.com/2008/05/29/interview-with-james-lindenbaum-ceo-of-heroku/
  3. http://ec2onrails.rubyforge.org/

Monitoring large-scale application clusters

Most software engineering organizations build applications with some hooks in place to allow functional tests. Some organizations continuously build and test all software automatically at check-in. And then there are those who have learnt from mistakes, and have built a suite of tests which get triggered at startup to look for problems which could indicate a failed initialization.

The next step in building a scalable web application, is creating some form of self-monitoring logic (sometimes called a watchdog) which could periodically test itself (or monitor performance statistics) for problems worth escalating to operations team.

Arnon has a couple of (1)interesting  (2)posts on the topic which I came across today. He summarized the whole suggestion using 3 acronymns. And I’m going to add one more to make it complete

  1. BBIT : Build time Built in Tests.
  2. PBIT: Power-on Built in Tests
  3. CBIT: Continuous Built in Tests
  4. IBIT: Initiated Built in Tests

The organization I work for is relatively small in the number of servers but one thing we learnt early on was that waiting for customers to report problems was the worst way to find out about issues. Setting up monitoring using something like Nagios/Openview/curl_scripts/etc may work, but its not very scalable. Besides blackbox testing without insight into the application may not be sufficient/ideal for more complex applications.

To increase automation, reduce dependence on central monitoring infrastructure and to speed up failure detection (prediction), it might be important to have some tests in all of the 4 zones listed above.

Understanding Cloud computing efficiency

Picking a cloud service at times, unfortunately,  is far more complex  than picking up a brand new car. I remember how torn I was between a honda-hybrid, which came with some tax rebates and a carpool sticker and a non-hybrid one which was significantly cheaper. Understanding the short term and long term benefits is the key.

Today AWS is not the only game in the town. There are lots of other reliable efficiency_light_bulb(or some flavor off) options. GoGrid, JoyentMicrosoft and GoogleAppEngine are some.

Here are the key differences which one should understand before deciding which one to go for.

* IAAS (Infrastructure as a service) providers like AWS (EC2) and Rackspace provide virtual infrastructure which you can manage and control. In most cases you are billed by a time-unit and you would have control to increase or decrease resources available for your application. PAAS (Platform as a service) on the other hand only provides APIs for your application. PAAS based infrastructure is usually billed by number of requests or by the CPU cycles spent on supporting the requests. Microsoft’s Azure places itself somewhere in between these two paradigms which makes this even more interesting.

* If your application’s resource requirements fluctuate a lot on a daily basis and you don’t want to invest in building a scalable architecture and the logic to manage/monitor the process of scaling up and down, then PAAS based service might help you. But if you want higher performance, more control of your code and infrastructure (and the way it scales) then IAAS is the way to go.

* If you have consistent load throughout the year, you should think about reserving resources for longer term if possible. It could turn out to be cheaper. But at the same time more servers/resources you reserve, more expensive it gets for you. There is a point at which it might be cheaper to host the infrastructure yourself.

* If your application is has short but high CPU resource peaks, you should look at a vendor which doesn’t put a performance ceiling. “The BitSource” did some performance tests between Rackspace vs Amazon EC2 which explains this problem very well.

* Finally, If you already have a large computing infrastructure within your organization and want more “long term” computing resources, based on the studies I have seen, its cheaper to manage/setup new servers/storage within the organization than outsourcing it to AWS/Rackspace.

At the end of the day remember that vendors are there to make money as well. If you plan to make significant long term investment into cloud services, you should do some research to make sure that is really the cheapest solution.

Service registry (ESB) for scalable web applications.

This blog post is the result of my futile attempts at understanding how others have solved the problem of automatic service discovery.

How do organizations, which have a huge collection of custom applications, design scalable web application without having to hardcode server names and port numbers in the configuration file ?

I believe the terminology I’m hinting at is either called a “Service Registry” or a “Enterprise Service Bus” which is part of the whole SOA (Service oriented architecture) world.

The organization I work for, has a limited multicast based service announcement/discovery infrastructure, but not widely used across all the applications. In addition to the fact that multicast routing can become complicated (ACL management of yet another set of network addresses), its also not a solution where parts of applications could reside on the cloud. Amazon’s EC2, for instance, doesn’t allow multicast traffic between its hosts.

Microsoft Azure .Net services (part of the Azure platform) provides a service registry (or proxy) to which internal and external applications can connect to, to provide and consume services. The design doesn’t allow direct connection between provider and consumer which makes this a massive single point of failure. I agree any kind of registry has this problem, but the fact that you need this service to be up all the time to make every single request makes it extremely risky.

There are at least two open source projects by Apache foundation which touch this topic. One is Service Mix and the other is Synapse. I’ve also spoken to a few commercial entities who do this, and wasn’t really convinced that its worth spending big bucks for this.

The reason why I’m puzzled is that I don’t see a single open source project being widely used in this area. I’ve been a REST kind of a guy and hate the whole SOAP world… and was hoping there would be something simple which could be setup and used without pulling my hair off.

My contacts at some of the larger organizations seem to make me believe that they all use proprietary solutions for their infrastructure.  Is this really such a complex problem ?

If you have an ESB in your network, please do drop in a line about it.

Scalability for dummies

Alex Barrera has a very interesting post about how frustrating it is to figure out that you have a problem and how much trouble it is to fix it after the product is live.

I am there, I am suffering the redesign phase (twice now). It’s hard, it’s lonely, it’s discouraging and frustrating, but it needs to be done. I just wrote this post so that outsiders can get a glimpse of what is it to be there and how it affects the whole company, not just the tech department. Scalability problems aren’t something you can discard as being ONLY technical, it’s roots might be technical but its effects will shake the whole company.

The post actually reminded me of this post by Marton Trencseni which talks about the phases of improvement in scalability architecture a product goes through and digs a little deeper into what could have prevented it.

For startups or for companies which are just prototyping new ideas, their goals can sometimes be just to “test the waters”, and the product owners don’t care much about allocating/reserving enough resources for engineering to build it the “right way”. And there is a good reason for that as well, since a lot of prototypes ( or early products ) die off soon after launch because of issues completely unrelated to scalability. Its hard to figure out if you want to test the idea first or devote a lot of resources to get it done the right way from day one.