Posts

Showing posts from November, 2010

Google App Engine 1.4.0 pre-release is out

Image
The complete announcement is here , but here are the changes for the java SDK. The two big changes I liked is the fact that there is now an “always on” feature, and “tasks” feature has graduated out of beta/testing. The Always On feature allows applications to pay and keep 3 instances of their application always running, which can significantly reduce application latency. Developers can now enable Warmup Requests. By specifying  a handler in an app's appengine-web.xml, App Engine will attempt to to send a Warmup Request to initialize new instances before a user interacts with it. This can reduce  the latency an end-user sees for initializing your application. The Channel API is now available for all users. Task Queue has been officially released, and is no longer an experimental feature. The API import paths that use 'labs' have been deprecated. Task queue storage will count towards an application's overall storage quota, and will thus

How to setup Amazon Cloudfront ( learning with experimentation )

Image
I have some experience with Akamai’s WAA (Web applications archive) service, which I’ve been using in my professional capacity for a few years now. And I’ve have been curious about how  cloudfront compares with it . Until a few weeks ago, Cloudfront didn’t have a key feature which I think was critical for it to win the traditional CDN customers. “ Custom origin ” is an amazing new feature which I finally got to test last night and here are my notes for those who are curious as well. My test application which I tried to convert was my news aggregator portal http://www.scalebig.com/ . The application consists of a rapidly changing front page (few times a day) ,  a collection of old pages archived in a sub directory and some other webpage elements like headers, footers, images, style-sheets etc. While Amazon Coudfront does have a presence on AWS management console , it only supports S3 buckets as origins. Since my application didn’t have any components which requi

Netflix: Dev and Ops internals

I’ve seen a number of posts from Netflix folks talking about their architecture in the recent weeks. And part of that is due to an ongoing effort to expand their business for which they seem to be hiring like crazy. Here is the yet another interesting deck of slides which mentions stuff across both Dev and Ops. One of the most interesting deck of slides I’ve seen in recent past. View more presentations from Adrian Cockcroft .

The Cloud: Watch your step ( Google App engine limitations )

Image
Any blog which promotes the concept of cloud infrastructure would be doing injustice if it doesn’t provide references to implementations where it failed horribly. Here is an excellent post by Carlos Ble where he lists out all the problems he faced on Google App engine (python).  He lists 13 different limitations, most of which are very well known facts, and then lists some more frustrating reasons why he had to dump the solution and look for an alternative. The tone of the voice is understandable, and while it might look like App-Engine-bashing, I see it as a great story which others could lean from. For us, GAE has been a failure like Wave or Buzz were but this time, we have paid it with our money. I've been too stubborn just because this great company was behind the platform but I've learned an important lesson: good companies make mistakes too. I didn't do enough spikes before developing actual features. I should have performed more proofs of concept before inv

Sawzall and the PIG

Image
When I heard interesting uses cases of how “ Sawzall ” is used to hack huge amounts of log data within Google I was thinking about two things. Apache PIG, which is “a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.” CEP (Complex event processing) - consists in processing many events happening across all the layers of an organization , identifying the most meaningful events within the event cloud, analyzing their impact, and taking subsequent action in real time. [ Also look at esper ] Google has opened parts of this framework in a project called “ Szl ” Sawzall is a procedural language developed for parallel analysis of very large data sets (such as logs). It prov

Presentation: “OrientDB, the database of the web”

I knew there was something called “OrientDB”, but didn’t know much about it until I went through these slides. Here is what I learned in one sentence. Its a easy to install NoSQL(schemaless) datastore, with absolutely no configuration required, supports ACID transactions, it can be used as a document store, a graph store and a key value store, it can be queried using SQL-like and JSON syntax, supports indexing and triggers and its been benchmarked to do 150000 inserts using commodity hardware.  That’s a lot of features. OrientDB the database for the Web of lvca - Snoopal

Cloud economics: Not really black and white..

Image
While some of the interest in moving towards public cloud is based on sound economics, there is a small segment of this movement purely due to the “ herd mentality ”. The slide on the right is from a Microsoft publication shows that larger networks may be less economical on the cloud (at least today). Richard Farley, has been discussing this very topic for few months now. He observed that a medium sized organization which already has a decent IT infrastructure including a dedicated IT staff to support it has a significantly smaller overhead than what cloud vendors might make it look like. Here is a small snippet from his blog. If you are not afraid to get dirty with numbers read the rest here . Now, we know we need 300 virtual servers, each of which consumes 12.5% of a physical host.  This means we need a total of 37.5 physical hosts.  Our vendor tells us these servers can be had for $7k each including tax and delivery with the cabinet.  We can’t buy a half server, a

Cassandra: What is HintedHandoff ?

Nate has a very good post about how Cassandra is different from a lot of other distributed data-stores. In particular he explains that every node in a Cassandra cluster are identical to every other node . After using cassandra and a few months I can tell you for a fact that its true. It does come at a price though. Because its so decentralized, if you want to make a schema change, for example, configuration files of all of the nodes in the cluster need to be updated all at the same time. Some of these problems will go away when 0.70 finally comes out . While is true that Cassandra doesn’t have a concept of single “master” server, each node participating in the cluster do actually act as masters and slaves of parts of the entire key range. The actual % size of the key range owned by an individual node depends on the replication factor, on how one picks the keys and what partitioner algorithm was selected. The fun starts when a node, which could be the master for a range of ke

OpenTSDB – Distributed time series database

Image
Ever since I saw a demo of this tool, I’ve been on the edge, waiting for it to be opensourced so that I could use it.  The problem its trying to solve is a real pain-point which most webops folks would understand. Yesterday folks at stumbleupon finally opened it up. Its released under LGPLv3 license. You can find the source here and the documentation here . At StumbleUpon, we have found this system tremendously helpful to: Get real-time state information about our infrastructure and services. Understand outages or how complex systems interact together. Measure SLAs (availability, latency, etc.) Tune our applications and databases for maximum performance. Do capacity planning.

AWS cloudfront grows up… a little. Now allows Custom origins.

Image
  Cloudfront has come a long way from its humble beginnings. Here is what Jeff had to say when he announced that its out of “beta” …. First, we've removed the beta tag from CloudFront and it is now in full production. During the beta period we listened to our customers and added a number of important features including Invalidation , a default root object , HTTPS access , private content , streamed content , private streamed content , AWS Management Console support , request logging , and additional edge locations . We've also r educed our prices . There's now an SLA (Service Level Agreement) for CloudFront. If availability of your content drops below 99.9% in any given month, you can apply for a service credit equal to 10% of your monthly bill. If the availability drops below 99% you can apply for a service credit equal to 25% of your monthly bill. While all this is a big step forward, its probably not enough for the more advanced CDN users to

Building your first cloud application on AWS

Image
Building your first web application on AWS is like shopping for a car at pepboys, part by part . While manuals to build one might be on aisle 5, the experience of having built one already is harder to buy. Here are some interesting logistical questions, which I don’t think get enough attention, when people discuss issues around building a new AWS based service. Picking the right Linux distribution : Switching OS distribution may not be too simple if your applications need custom scripts. Picking and sticking with a single distribution will save a lot of lost time. Automated server builds : There are many ways to skin this cat. Chef , Puppet , Cfengine are all good... Whats important is to pick one early in the game. Multi-Availability Zone support: Find out if multi availability zone support is important. This can impact over all architecture of the solution and tools used to implement the solution. Data consistency requirements : Similar to the Multi-AZ support q

Riak MapReduce: A story in Three Acts

  Riak MapReduce: A Story In Three Acts

Rapid prototyping with solr

Extreme prototyping with Solr by Eric Hatcher At ApacheCon this week I presented “Rapid Prototyping with Solr” .  This is the third time I’ve given a presentation with the same title.  In the spirit of the rapid prototyping theme, each time I’ve created a new prototype just a day or so prior to presenting it.  At Lucene EuroCon the prototype used attendee data, a treemap visualization, and a cute little Solr-powered “app” for picking attendees at random for the conference giveaways.  For a recent Lucid webinar the prototype was more general purpose, bringing in and making searchable rich documents and faceting on file types with a pie chart visualization. This time around, the data set I chose was Data.gov’s catalog of datasets , which fit with the ApacheCon open source aura, and Lucid Imagination’s support of Open Source for America , which helps to advocate for open source in the US Federal Government.  The prototype built includes faceting browsing, query

Shipping Trunk : For web applications

I had briefly blogged about this presentation before from Velocity 2010. I wish they had released the video for this session. I went through this slide deck again today to see if Paul mentioned any of the problems organization like ours are dealing with in its transition from quarterly releases to weekly/continuous releases. One of the key observations Paul made during his talk is that most organizations still treat web applications as desktop software and have very strict quality controls which may not be as necessary since releasing changes for web app in a SAAS (Software as a service) is much more cheaper than for releasing patches for traditional desktop software. Here are some of the other points he made. For really detailed info check out the [ slides ] Deploy frequently, facilitating rapid product iteration Avoid large merges and the associated integration testing Easily perform A/B testing of functionality Run QA and beta testing on production hardware

Real-Time MapReduce using S4

Image
While trying to figure out how to do real-time log analysis in my own organization I realized that most map- reduce frameworks are designed to run as batch jobs in time delays manner rather than be instantaneous like a SQL query to a Mysql DB. There are some frameworks which are bucking the trend. Yahoo! Lab! recently announced that their “Advertising Sciences” group has built a general purpose, real-time, distributed, fault-tolerant, scalable, event driven, expandable platform called “S4” which allows programmers to easily implement applications for processing continuous unbounded streams of data. S4 clusters are built using low-cost commoditized hardware, and leverage many technologies from Yahoo!’s Hadoop project. S4 is written in Java and uses the Spring Framework to build a software component architecture. Over a dozen pluggable modules have been created so far. Why do we need a real-time map-reduce framework? Applications such as personalization, user fee

Storage options on app engine

For those who think google app engine only has one kind of datastore, the one built around “ bigtable ”, think again. Nick Johnson goes into details of all the other options available with their pro’s and con’s in his post. App Engine provides more data storage mechanisms than is apparent at first glance. All of them have different tradeoffs, so it's likely that one - or more - of them will suit your application well. Often, the ideal solution involves a combination, such as the datastore and memcache, or local files and instance memory. Storage options he lists.. Datastore Memcache Instance memory Local Files Read more: Original post from Nick

Is auto-sharding ready for auto-pilot ?

James Golick makes a point which lot of people miss. He doesn’t believe auto-sharding features NoSQL provides is ready for full auto-pilot yet, and that good developers have to think about sharding as part of design architecture, regardless of what datastore you pick. If you take at face value the marketing materials of many NoSQL database vendors, you'd think that with a horizontally scalable data store, operations engineering simply isn't necessary. Recent high profile outages suggest otherwise. MongoDB, Redis-cluster (if and when it ships), Cassandra, Riak, Voldemort, and friends are tools that may be able to help you scale your data storage to varying degrees. Compared to sharding a relational database by hand, using a partitioned data store may even reduce operations costs at scale. But fundamentally, no software can use system resources that aren't there. At the very least one has to understand how auto sharding in a NoSQL works, how easy is it to setu