Scaling Powerset using Amazon’s EC2 and S3
The first thing most doc-com companies do before going public is setup an infrastructure to provide the service. And though it might sound straight forward to most of you, it can be a very expensive affair. To come up with the right kind of infrastructure for any new service a few key architectural decisions have to be made.
- Design the infrastructure and architecture to handle the traffic peaks (not average)
- Design with long term scaling in mind
- Design power and cooling infrastructure to support the servers
- Hire Hardware+Systems+network support staff ( more if 24/7 operations are required)
- Add buffers to support failures and short term grow requirements
- Take into account lead times of ordering and procuring hardware (which can be weeks if not months)..
- And a few others.. which I won’t bore you with here.
The point is that the initial capital investment can be in millions even before the first customer starts using the service. And once the capital investment is made, it is very difficult to scale down the operations if the plans change.
Powerset Inc is a secretive search startup with ambitions of out-smarting Google in its own turf. Based in San Francisco, this search startup is working on building a better search engine using natural language processing capabilities to understand the users question a little better before answering it. And just like any other search company its technology is a CPU hungry beast just waiting to be unleashed. Powerset could have gone the way most dot-com companies have gone, but instead they decided to try out Amazon’s EC2 (Elastic Cloud Computing) and S3(Simple Storage Service) to augment their computational needs.
Powerset has been repoted to be testing a 400 server instance EC2 cluster with Hadoop running Map/Reduce and HDFS (Hadoop Distributed File system). This does get a little tricky on EC2 because of absence of persistent storage (OS is re-initialized after every reboot). So they use a combination of HDFS and remote copying process to sync the data to their local network. Since there is no charge to move data from EC2 to S3, they have been thinking about implementing a native HDFS and S3 interface to move data around within Amazon’s network itself.
EC2 is charged on a per instance per hour usage basis, which means Powerset can bring new nodes online during heavy demand and shut off unused nodes at a flick of a switch at night. Powerset guys also built their own EC2 image configured to automatically join the HDFS cluster after every boot up. In an event of a node failure, Hadoop can take care of data replication, and EC2 takes care of replacing the failed node with a new one.
Amazon EC2 costs 10 cents an hour per instance. If you have to run a 400 node cluster for 1 month thats only about 30000. Based on the performance benchmarks, it looks like the actual CPU throughput from each of the EC2 instance is roughly equivalent of 1Ghz PIII. 72 dollars a month for that kind of server is not too cheap, but just like car leasing, atleast u don’t have to pay upfront and manage it.
So lets do the math. A regular AMD 64bit dual core, 2 cpu server with about 8GB of ram costs about 10000 USD which excluds the cost of hosting, power, cooling and maintainance. Based on some comments on Amazon forum this is about 2 to 3 times faster than the EC2′s infrastructure.If you had to replace the CPU computation power of this new hardware with 8 to 12 server instances on EC2, you would have spent about 700 to 800 dollars a month. It will take a company using EC2 infrastructure close to 12 months before they would have to pay 10000 towards EC2 computational services for the same amount of computation power. And remember that 10000 amount didnâ€™t take into account colocation, power, cooling and general server administration which can be significant as well. Also remember that 12 months is actually 12 months of actual computational usage.. which could over a period of 2 to 4 years depending on how often the instances are used.
However, I also have to point out that there are a few things to look out for. The maximum physical memory available is about 1.7GB which is relatively tiny if you are used to 8 to 16 GB of ram on a 64 bit hardware. And though CPU/Memory might scale horizontaly for some applications, cross-server communication can be extreemly expensive for some. Unless your application is designed to scale horizontally with under 1.7 GB of ram, I would seriously advice you against using EC2 until you figure out how to change that.
I’ve blogged about both S3 and EC2 before and it continues to facinate me. Success of companies like Slideshare.net, and the decision of companies like Powerset to use AWS is something which I’ll watch closely over time.
Links about Hadoop, and how to use it on Amazon EC2/S3
- Meet Hadoop (part 1, part 2) (Doug Cutting and Eric Baldeschwieler, OSCON, July 25 2007)
- Hadoop Distributed File System (HDFS) (Dhruba Borthakur, June 2007)
- Introduction To Hadoop (Owen O’Malley, May 2007)
- Hadoop Map/Reduce Architecture (Owen O’Malley, July 2006)
- Scalable Computing with Hadoop (Doug Cutting, May 2006)
- Running Hadoop on Amazon EC2
- Running Hadoop using Amazon S3
- Google: MapReduce in a Week
- CS490h, Winter 2007, University of Washington ( lecture notes & labs)
Other References about Powerset and Amazon