January 31, 2010

The real concerns about Cloud infrastructure (as it is today)

While “private clouds may not be the future” they are definitely needed today. Here are some of the top issues bothering some organizations which have been thinking about going into the cloud. Some of issues were based on Craig Bolding’s talk on “Guide to cloud security”.cluod

  • Unlike your own data center, you will never know what the cloud vendors are running, or how they backup, or what their DR plans are. They will say you shouldn’t care, but do you remember what happened to the Tmobile customer’s on Danger ?
  • Uptime, availability and responsiveness is less predictable than in a self hosted environment. In most cases the cloud vendors may not even choose to let customers know about major maintenance if they don’t anticipate any issues. Organizations who manage their own infrastructure would always try to avoid doing two major changes which have interdependencies.
  • Multi-Tenancy means you may have to worry about a noisy neighbor.
  • Muti-Tenancy could also lead one to interesting issues which were never thought about before. What if there was a way to do an “injection attack”. Depending on how Multi-Tenancy is implemented, you could potentially touch other customers data.
  • Infrastructure and platform lock-in issues are worrying for many organizations who are thinking long term. Most cloud vendors don’t really have a long history to show their track record.
  • Change control and detailed change log is missing.
  • Individual customers don’t have much decision making power on what a vendor should do next. In a privately hosted environment the stake holders are asked before something is done, but in larger infrastructure, you are a small fish in a huge pond.
  • Most cloud vendors have multiple layers of cloud infrastructure dependent on each other. Its hard to understand how issues around one type of cloud could impact others. This is especially true from Security view point. A bad flaw in a lower layer of the architecture could impact all other platforms built over it.
  • Moving applications to cloud means dealing with a different style of programming designed for horizontal scalability, data consistency issues, health monitoring, load balancing, managing state, etc.
  • Identify management is still in early stages. Integration with corporate Identify management infrastructure would be important to make it easy for individuals from large organizations on external clouds.
  • Who takes care of scrubbing disks when data is moved around ? What about data on backup tapes ? This is very important in application handling highly sensitive data.
  • Just like credit card fraud, one has to worry about CPU time fraud. Is the current billing and reporting good enough to help large organizations figure out what is real and what could be fraud ? They need a real-time fraud detection mechanism. And what about loss of service due to DOS attacks ? Who pays for that ?
  • Need a better mechanism to bill large corporations.
  • On the non-technical side, there are a lot of questions related to SLAs, Compliance issues, Terms of services, Legal issues around cross border services, and even questions about whether law enforcement have a different set of rules when search and seizure is required.
  • Not too far from being another form of “outsourcing”.

Photo credit: akakumo

Fixing GSLB (Global Server load balancing)

Standard DNS protocol allows DNS servers to respond with multiple addresses in the replies for simple DNS lookup queries. This, and the way that the order of records is changed in every reply is collectively known as the “Round Robin DNS” technique to load balance across a set of servers.

Though a lot of organizations are using Round Robin DNS to load balance across servers in the same datacenter, some are also trying to use it as an HA solution by load balancing across multiple datacenters. In an event of a failure in one of the datacenter, using such an implementation, the impact could be limited, and with a slight change of DNS configuration (removing the IP of the datacenter which went down) the site could become fully operational again.

It would be nicer if the DNS servers could monitor and remove servers which are inactive or are throwing  errors of some kind. This is what GSLBs are all about. But what they really excel at, which regular DNS servers can’t do, is that they can figure out (in a slightly unscientific way) where a user is located geographically. This allows it to figure out which datacenter is closest to the end user. If a customer in Asia can get to a datacenter within Asia, instead of coming all the way to US, it could save the customer at least 200ms of latency which can significantly improve bandwidth and response from the website.

Though GSLBs, today, are very popular among the larger service providers there are some interesting drawbacks which can limit its usefulness. The core problem is that GSLBs use source IP within theGSLB_Architecture.PNG DNS request to figure out where the customer is located. This works beautifully if the customers laptop is sending these out directly, and in most cases will also work if the customer is using his/her ISP’s DNS server. Unfortunately if the customer uses some free public DNS service like the one google provides which recursively looks up the DNS records for the user, then GSLB would find datacenters which are closest to the DNS server requesting the information instead of the actual end user. A similar problem exists if the user is forced to use a DNS server over a VPN link. Read this post for a better understanding of this problem (Why DNS Based GSLB doesn’t work)

A few days ago Google came out with a solution to this problem which was announced here (A  proposal to extend DNS protocol). They don’t mention GSLB, but there is no doubt this will help solve the GSLB issue mentioned above. Unfortunately, I’m also sure that Google has other, more important reasons, to push for this change. They are interested in location information to “provide better services” (location-aware advertising).

DNS is the system that translates an easy-to-remember name like www.google.com to a numeric address like 74.125.45.104. These are the IP addresses that computers use to communicate with one another on the Internet.

By returning different addresses to requests coming from different places, DNS can be used to load balance traffic and send users to a nearby server. For example, if you look up www.google.com from a computer in New York, it may resolve to an IP address pointing to a server in New York City. If you look up www.google.com from the Netherlands, the result could be an IP address pointing to a server in the Netherlands. Sending you to a nearby server improves speed, latency, and network utilization.

Currently, to determine your location,
authoritative nameservers look at the source IP address of the incoming request, which is the IP address of your DNS resolver, rather than your IP address. This DNS resolver is often managed by your ISP or alternately is a third-party resolver like Google Public DNS. In most cases the resolver is close to its users, in which case the authoritative nameservers will be able to find the nearest server. However, some DNS resolvers serve many users over a wider area. In these cases, your lookup for www.google.com may return the IP address of a server several countries away from you. If the authoritative nameserver could detect where you were, a closer server might have been available.

Our proposed DNS protocol extension lets recursive DNS resolvers include part of your IP address in the request sent to authoritative nameservers. Only the first three octets, or top 24 bits, are sent providing enough information to the authoritative nameserver to determine your network location, without affecting your privacy.

Regard less, its a step in the right direction and would significantly help in making web applications more available.

References

January 28, 2010

Cloud computing in 1963 ( actually Timesharing )

Found this on Feld Thoughts. Its not really about cloud computing, but they are interested in making efficient use of computational resources, which is one of the goals of today’s “cloud computing”  as well.

This magnificent video is from 1963 Timesharing: A Solution to Computer Bottlenecks where MIT Professor Fernando Corbato explains how timesharing works to MIT Science Reporter John Fitch (who has one of those magnificent deep reporter voices).

AppScale, an OpenSource GAE implementation

If you don’t like EC2 you have an option to move your app to a new vendor. But if you don’t like GAE  (Google app engine) there aren’t any solutions which can replace GAE easily.

AppScale might change that.

AppScale is an open-source implementation of the Google AppEngine (GAE) cloud computing interface from the RACELab at UC Santa Barbara. AppScale enables execution of GAE applications on virtualized cluster systems. In particular, AppScale enables users to execute GAE applications using their own clusters with greater scalability and reliability than the GAE SDK provides. Moreover, AppScale executes automatically and transparently over cloud infrastructures such as the Amazon Web Services (AWS) Elastic Compute Cloud (EC2) and Eucalyptus, the open-source implementation of the AWS interfaces.

The list of supported infrastructures is very impressive. However the key, in my personal opinion, would be stability and compatibility with current GAE APIs.

Learn more about AppScale:

  1. AppScale Home page
  2. Google Code page
  3. Google Group for AppScale
  4. Demo at Bay area GAE Developers meeting: At Googleplex ( Feb 10, 2010)

Videos on scalable web architectures

If you are like me, you are already following all the talks and presentations published on YouTube. But if you have not been, nothing stops you from starting now. A new “Videos” page has been added to this blog to list the latest YouTube videos related to scalable web architectures.

Videos related to scalable web architectures

Please leave comments if you have a favorite online lecture/presentation which is not listed here.

January 26, 2010

Scalability Updates for Jan 26th 2010

A few interesting updates for today

January 25, 2010

Hive @Facebook

Hive is a data warehouse infrastructure built over Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce fromwork to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.

At a user group meeting, Ashish Thusoo from Facebook data team, spoke about how Facebook uses Hive for their data processing needs.

Problem


Facebook is a free service and has been experiencing rapid growth in last few years. The amount of data it collects, which used to be around 200GB per day in March 2008, has now grown to 15TB per day today.  Facebook realized early on that insights derived from simple algorithms on more data is better than insights from complex algorithm on smaller set of data.

But the traditional approach towards ETL on proprietary storage systems was not only getting expensive to maintain, it was also limited in the size it could scale to. This is when they started experimenting with Hadoop.

How Hadoop gave birth to Hive


Hadoop turned out to be superior in availability, scalability and manageability. Its efficiency wasn’t that great, but one could get more throughput by throwing more cheap hardware at it. Ashish pointed out that though at that point partial availability, resilience and scale was more important than ACID they had a hard time finding Hadoop programmers within Facebook to make use of the cluster.

It was this, that eventually forced Facebook, to build a new way of querying data from Hadoop which doesn’t require writing map-reduce jobs in java. That quickly lead to the development of hive, which does exactly what it was set out to do. Lets look at a couple of examples of hive queries.
  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar;
hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;

Hive’s long term goal was to develop a system for managing and querying structured data built on top of Hadoop. To do that it used map-reduce mechanisms for execution and used HDFS for storage. They modeled the language on SQL, designed it to be extensible, interoperable and be able to out perform traditional processing mechanisms.

How it is usedimage


Facebook has a production Hive cluster which is primarily used for log summarization, including aggregation of impressions, click counts and statistics around user engagement. They have a separate cluster for "Ad hoc analysis” which is free for all/most Facebook employees to use. And over time they figured out how to use it for spam detection, ad optimization and a host of other undocumented stuff.

Facebook Hive/Hadoop statistics


The scribe/Hadoop cluster at Facebook has about 50 nodes in the cluster today and processes about 25TB of raw data. About 99% of its data is available for use within 20 seconds. The Hive/Hadoop cluster where most of the data processing happens has about 8400 cores with roughly about 12.5 PB of raw storage which translates to 4PB of usable storage after replication. Each node in the cluster is a 8 core server with 12TB of storage each.

All in all, Facebook gets 12 TB of compressed new data and scans about 135 TB of compressed data per day. There are more than 7500 Hive jobs which use up about 80000 computer hours each day.





References


January 24, 2010

Scalability Killers (The art of scalability)

Top 10 scalability killers from The Art of scalability: Scalable Web Architecture, Processes, and Organizations for Modern Enterprise

The Art of Scalability: Scalable Web Architecture, Processes, and Organizations for the Modern Enterprise

  1. Thinking Scalability is just about technology

  2. Overuse of Synchronous calls

  3. Failure to weed or seed soon enough

  4. Inappropriate use of databases

  5. Cesspools instead of swim lanes

  6. Reliance on Vertical scale

  7. Failure to Learn from History

  8. Changing Development methodologies to fix problems

  9. Too little caching, too late

  10. Overreliance on Third parties to scale

Private clouds not the future ?

James Hamilton is one of the leaders in this industry and has written a very thought provoking post about private clouds not being the future. This is what he said about private clouds when compared to existing not-cloud solutions.

  • A fix, Not the future (reference to an InformationWeek post)
  • Runs at lower utilization levels
  • Consumes more power
  • Less efficient environmentally
  • Runs at higher costs

Though I believe in most of his comments, I’m not convinced with the generalization of the conclusions. In particular, what is the maximum number of servers one need to own, beyond which outsourcing will become a liability. I suspect this is not a very high number today, but will grow over time.

Hardware costs: The scale at which Amazon buys infrastructure is just mind boggling, but organizations buying in bulk could get pretty good deal from those same vendors as well.  Its not clear to me how many servers one has to buy to get discounts like what amazon does.

Utilization levels: Cloud providers optimize utilization by making sure all the servers are getting used all the time. Its also important to remember that because they trying to maximize utilization they don’t always buy all the servers for all of its customers when they sign up. 

At scale, with high customer diversity, a wonderful property emerges: non-correlated peaks. Whereas each company has to provision to support their peak workload, when running in a shared cloud the peaks and valleys smooth. The retail market peaks in November, taxation in April, some financial business peak on quarter ends and many of these workloads have many cycles overlaid some daily, some weekly, some yearly and some event specific. For example, the death of Michael Jackson drove heavy workloads in some domains but had zero impact in others.

This is something which bothers Enterprise IT departments everywhere when they are building private clouds. Can they get away with buying less servers than what the organization really needs and at times say “no” to some departments when they run out of computing power ?  Its hard to beat the scale of shared clouds.

The other reason why utilization levels are low in private clouds is because most organizations don’t have computationally-intensive batch jobs which could take advantage of servers be done while servers are not in use. On Amazon one could even bid for a lower price on unused EC2 resources.

This is a tough problem and I don’t think private clouds can outperform shared clouds.

Power usage: Inefficient cooking and power conversion losses can quickly make hosting infrastructure more expensive. Having domain experts can definitely help, and that's not something smaller organizations can do either.

Platform: There aren’t any stable, proven, internal cloud infrastructure platform  which comes cheap. VMware’s ROI calculator might claim its cheap, but I’m not convinced yet. The xen/kvm options look very stable, but they don’t come with decent management tools. In short there is a lot of work which needs to be done just to pick a platform.

A private hadoop cluster is still a cloud infrastructure. At lot of organizations are now switching to similar batch processing based clouds which could be shared for different kinds of jobs. And there are still others who could decide to invest in smarter deployment and automation scripts to fully utilize their private infrastructure without using virtualization.

Overhead of the shared cloud: Larger an organization is, more difficult it is for it to migrate to a shared cloud. In fact migrating an existing live RDBMS based application over to cloud would be impossible without significant resources to architect the whole application and datastore. These organizations also have extensive well tested security policies and guidelines in place, all of  which would have to be thrown to the dogs if they have to put their data on a public network over which they have no control. But I do believe this is a temporary problem which will be resolved over time in favor of shared clouds.

Cost: Cloud infrastructure providers are not non-profit organizations. Though are here to make money, they would still be significantly cheaper for many. But do your homework and make sure you and your management team is ok with giving up infrastructure control for some cost savings.

That being said, here are my predictions for next couple of years.

  1. Except to see more non-virtualized, application clouds in the enterprise.
  2. Expect the shared cloud providers to get even more cost effective over time as competition increases.
  3. See more open source initiatives to build tools which manage private cloud infrastructures.
  4. See more interesting tools which provide end-users the ability to visualize actual cost of resources they are using. Making the cost more transparent, could guide developers to design smarter applications.

January 23, 2010

HAProxy : Load balancing

Designing any scalable web architecture would be incomplete without investigating “load balancers”.  There used to be a time when selecting and installing load balancers was an art by itself. Not anymore.

A lot of organizations today, use Apache web servers as a proxy server (and also as a load balancer) for the backend application clusters. Though Apache is the most popular web server in the world, it also considered over-weight if all you want to do is proxy a web application. The huge codebase which apache comes with and the separate modules which need to be compiled and configured with it, could soon become a liability.

HAProxy is a tiny proxying engine which doesn’t have all the bells and whistles of apache, but is highly qualified to act as a HTTP/TCP proxy server. Here are some of the other wonderful things I liked about it

  • Extremely tiny codebase. Just two runtime files to worry about, the binary and the configuration file.
  • Compiles in seconds. 10 seconds the last time I did it.
  • Logs to syslog by default
  • Can load balance HTTP as well as regular TCP connections. Can easily load balance most non-HTTP applications.
  • Can do extremely detailed performance (and cookie capture) logging. It can differentiate backend processing time from the end-user request completion time. This is extremely helpful in monitoring performance of backend services.
  • It can do sticky load balancing out of the box
  • It can use application generated cookies instead of self-assigned cookies.
  • It can do health monitoring of the nodes and automatically removes them when health monitors fail
  • And it has a beautiful web interface for application admins who care about number.

A few other notes

  • HAProxy really doesn’t serve any files locally. So its definitely not a replacement for your apache instance if you are using it to serve local files.
  • It doesn’t do SSL, so you sill need an SSL engine in front of it if you need secure http.
  • HAProxy is not the only apache replacement. Varnish is a strong candidate which can also do caching (with ESI). And while you are at it, do take a look at Perlbal which looked interesting.

Some Interesting external linksimage

Finally a sample configuration file with most of the features I mentioned above configured for use. This is the entire thing and should be good enough for a production deployment with minor changes. 

global
        log loghost logfac info
        maxconn 4096
        user webuser 
        group webuser 
        daemon

defaults
        log     global
        stats   enable
        mode    http
        option  httplog
        option  dontlognull
        option  httpclose
        retries 3
        option  redispatch
        maxconn 2000
        contimeout      5000
        clitimeout      300000
        srvtimeout      300000

listen  http_proxy 0.0.0.0:8000
        option httpchk HEAD /app/health.jsp HTTP/1.0
        mode http
        cookie SERVERID insert
        capture cookie JSESSIONID len 50
        capture request header Cookie len 200
        capture request header Host len 50
        capture request header Referer len 200
        capture request header User-Agent len 150
        capture request header Custom-Cookie len 15
        appsession JSESSIONID len 32 timeout 3600000

        balance roundrobin
        server server1_name server1:8080 weight 1 cookie server1_name_cookie check inter 60000
        server server2_name server2:8080 weight 1 cookie server2_name_cookie check inter 60000

January 22, 2010

January 20, 2010

ESI: Edge Side Includes

Web page caching gets tricky once personalization is involved. Lets take twitter public_timeline for example which seems to be perfect for caching. Unfortunately when a user is logged in, it also shows the user’s information. Caching that particular page in its entirety, on the web server, in such scenarios, may not be an option. Another scenario is where parts of a page might expire faster than other (require different cache TTLs). Here again caching the whole page doesn’t help.

Edge side includes(ESI) is a markup language specifically designed to help web servers assemble dynamic content at the web layer.

<esi:include src="www.foo.com"/>


The above ESI tag is similar to tags in jsp/php/etc which allow one page to refer to another page for parts of the content on the page. By breaking up the page into smaller objects the webserver could apply different TTL settings (and user validation) to different parts of content. Twitter used to (and may still ) use “Varnish” which supports a subset of ESI specification out of the box.

But caching on the webserver may not be the real reason why this language was invented. ESI is also supported by Akamai  (CDN) on its edge caching product.  By allowing Akamai edge nodes to do the assembling close to the user, they significantly improve perceived end user performance without giving up personalization or content freshness requirements.

January 19, 2010

Google patents Map reduce “System and method for efficient large-scale data processing”

After filing in 2004, google finally got its patent on “System and method for efficient large-scale data processing”  approved  yesterday.

Gigaom pointed out that if Google really wants to enforce it, it would have to go after many different vendors who are implementing “mapreduce” in some form in their applications and databases.

Google’s intentions of how to use it are not clear, but this is what one of the spokesperson  said.

Like other responsible, innovative companies, Google files patent applications on a variety of technologies it develops. While we do not comment about the use of this or any part of our portfolio, we feel that our behavior to date has been inline with our corporate values and priorities.

January 18, 2010

Heroku platform for scalable web applications

I’m so locked up in my own java world that I didn’t realize something as cool as this existed in the ruby world.

Heroku is the instant ruby platform. Deploy any ruby app instantly with a simple and familiar git push. Take advantage of advanced features like HTTP caching, memcached, rack middleware, and instant scaling built into every app. Never think about hosting or servers again.

From a layman’s point of view, Heroku looks like a ruby version of GAE image(Google app engine). It has some of the same features as GAE.  But unlike GAE, Heroku actually talks about their architecture in great detail.

They use Nginx as the front-end HTTP reverse proxy server and Varnish for the caching right behind Nginx. They wrote their own custom software to “route” requests between the web frontend and the backend services. The actual user code runs on the “Dyno Grid” where each dyno looks like a self contained ruby instance with user code (compiled slugs).

There could be multiple “dynos” on the same server, and a user application could use up multiple “dynos” on same or different servers. Since each“dyno” comes preconfigured with information about user’s database and cache connection information there is absolutely nothing else (configuration wise) “compiled slugs” need to do its job.

image

The “routing mesh” tracks detailed performance data for each of the apps and load balances as required. An unresponsive “dyno” is marked and replaced automatically. Based on the documentation they can initialize new dynos in about 2 seconds.

image

A dyno, incase you are curious is a single process running your code, somewhat like a jre container. And it looks like they put about 4 dynos for each core (CPU) they have on a server. The POSIX view of the system available to the ruby vm is read-only, and though they don’t use OS virtualization to separate each dyno, they do seperate them using “unix permissions”. I guess that means each dyno has its own unique userid/groupid pair. I don’t have much experience with ruby, but for those who care, they use “plain-vanilla MRI ruby”. image

Just like how GAE/Pyhon uses a stripped down version of Django and GAE/Java uses stripped down version of Jetty as their app server, Dyno uses a thin version of Mongrel. It also uses “Rack”/ “Rack Middleware” for apps interaction with Mongreal/webserver.

Now here is another interesting implementation choice they went with. To update your apps all you have to do is check in your changes using “Git”, and the Heroku will take care of compiling your slugs and deploy it for you. I wish GAE was like this.

The pricing looks slightly higher than raw EC2 cost, but you need to understand that Heroku is Platform (PAAS) and not Infrastructure (IAAS). They take care of the stuff you would otherwise have to struggle with if you were on AWS.

They also have some pretty interesting “Add-ons”. The one I liked was “Websolr” is a custom implementation of “solr” full-text search engine which is in turn based on lucene.

I’m curious if any of you have used Heroku and comment on what you feel about it. The devil is in the details.

Related interesting Links:

  1. http://highscalability.com/heroku-simultaneously-develop-and-deploy-automatically-scalable-rails-applications-cloud
  2. http://sazbean.com/2008/05/29/interview-with-james-lindenbaum-ceo-of-heroku/
  3. http://ec2onrails.rubyforge.org/

Dilbert and the cloud

Dilbert.com

January 17, 2010

Architecting for the Cloud: Best practices

Amazon has published another “Best practices” document. This one covers the almost the entire collection of services. Its biased towards AWS (obviously), but its still one of the best description summary of the various services amazon offers today.

image

Just the diagram above tells a lot about how the various AWS services interact with each other. Here is another small section from the document.

AWS specific tactics to automate your infrastructure

  1. Define Auto-scaling groups for different clusters using the Amazon Auto-scaling feature in Amazon EC2.
  2. Monitor your system metrics (CPU, Memory, Disk I/O, Network I/O) using Amazon CloudWatch and take appropriate actions (launching new AMIs dynamically using the Auto-scaling service) or send notifications.
  3. Store and retrieve machine configuration information dynamically: Utilize Amazon SimpleDB to fetch config data during boot-time of an instance (eg. database connection strings). SimpleDB may also be used to store information about an instance such as its IP address, machine name and role.
  4. Design a build process such that it dumps the latest builds to a bucket in Amazon S3; download the latest version of an application from during system startup.
  5. Invest in building resource management tools (Automated scripts, pre-configured images) or Use smart open source configuration management tools like Chef16, Puppet17, CFEngine 18or Genome19.
  6. Bundle Just Enough Operating System (JeOS20) and your software dependencies into an Amazon Machine Image so that it is easier to manage and maintain. Pass configuration files or parameters at launch time and retrieve user data21 and instance metadata after launch.
  7. Reduce bundling and launch time by booting from Amazon EBS volumes22 and attaching multiple Amazon EBS volumes to an instance. Create snapshots of common volumes and share snapshots23 among accounts wherever appropriate.
  8. Application components should not assume health or location of hardware it is running on. For example, dynamically attach the IP address of a new node to the cluster. Automatically failover and start a new clone in case of a failure.

Monitoring large-scale application clusters

Most software engineering organizations build applications with some hooks in place to allow functional tests. Some organizations continuously build and test all software automatically at check-in. And then there are those who have learnt from mistakes, and have built a suite of tests which get triggered at startup to look for problems which could indicate a failed initialization.

The next step in building a scalable web application, is creating some form of self-monitoring logic (sometimes called a watchdog) which could periodically test itself (or monitor performance statistics) for problems worth escalating to operations team.

Arnon has a couple of (1)interesting  (2)posts on the topic which I came across today. He summarized the whole suggestion using 3 acronymns. And I’m going to add one more to make it complete

  1. BBIT : Build time Built in Tests.
  2. PBIT: Power-on Built in Tests
  3. CBIT: Continuous Built in Tests
  4. IBIT: Initiated Built in Tests

The organization I work for is relatively small in the number of servers but one thing we learnt early on was that waiting for customers to report problems was the worst way to find out about issues. Setting up monitoring using something like Nagios/Openview/curl_scripts/etc may work, but its not very scalable. Besides blackbox testing without insight into the application may not be sufficient/ideal for more complex applications.

To increase automation, reduce dependence on central monitoring infrastructure and to speed up failure detection (prediction), it might be important to have some tests in all of the 4 zones listed above.

January 16, 2010

Understanding Cloud computing efficiency

Picking a cloud service at times, unfortunately,  is far more complex  than picking up a brand new car. I remember how torn I was between a honda-hybrid, which came with some tax rebates and a carpool sticker and a non-hybrid one which was significantly cheaper. Understanding the short term and long term benefits is the key.

Today AWS is not the only game in the town. There are lots of other reliable efficiency_light_bulb(or some flavor off) options. GoGrid, JoyentMicrosoft and GoogleAppEngine are some.

Here are the key differences which one should understand before deciding which one to go for.

* IAAS (Infrastructure as a service) providers like AWS (EC2) and Rackspace provide virtual infrastructure which you can manage and control. In most cases you are billed by a time-unit and you would have control to increase or decrease resources available for your application. PAAS (Platform as a service) on the other hand only provides APIs for your application. PAAS based infrastructure is usually billed by number of requests or by the CPU cycles spent on supporting the requests. Microsoft’s Azure places itself somewhere in between these two paradigms which makes this even more interesting.

* If your application’s resource requirements fluctuate a lot on a daily basis and you don’t want to invest in building a scalable architecture and the logic to manage/monitor the process of scaling up and down, then PAAS based service might help you. But if you want higher performance, more control of your code and infrastructure (and the way it scales) then IAAS is the way to go.

* If you have consistent load throughout the year, you should think about reserving resources for longer term if possible. It could turn out to be cheaper. But at the same time more servers/resources you reserve, more expensive it gets for you. There is a point at which it might be cheaper to host the infrastructure yourself.

* If your application is has short but high CPU resource peaks, you should look at a vendor which doesn’t put a performance ceiling. “The BitSource” did some performance tests between Rackspace vs Amazon EC2 which explains this problem very well.

* Finally, If you already have a large computing infrastructure within your organization and want more “long term” computing resources, based on the studies I have seen, its cheaper to manage/setup new servers/storage within the organization than outsourcing it to AWS/Rackspace.

At the end of the day remember that vendors are there to make money as well. If you plan to make significant long term investment into cloud services, you should do some research to make sure that is really the cheapest solution.