Latest Publications

Scalability links for January 19th

Scalability links for January 19th:

Share:
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • RSS
  • Twitter
  • email

Beanstalk, Gogrid and HBase

Top 3 News items from scalebig

  • The big news is that Amazon has got into PAAS big time. I predicted it only a couple of days ago ( I said they will launch within next 1 year ). With beanstalk they plan to provide containers into which users can upload code to and let AWS manage rest of the complexities of things around it. They are starting with a tomcat based container for now and have mentioned plans to build other containers. Read more about it at “All things distributed
  • As weird as would sound, GoGrid is building the private cloud over public infrastructure. They are doing this just to let let CIO claim that they own the servers. This allows CIOs claim to be on two boat at the same time. At some point though CIOs will have to make a call and abandon one. BTW, this is not very different from managed infrastructure, with the exception that there now exists a virtualization toolkit to manage VMs on this managed infrastructure.
  • Hbase 0.90.0 is released. Lots of interesting improvements which a lot of people were waiting for. Alex has some observations here.
Share:
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • RSS
  • Twitter
  • email

How facebook ships code

Stumbled on a fascinating post about how facebook ships code. This level of detail is rare from an organization as big as this.  This is a very long piece… but here are a few lines from it to entice you to click on it.

From framethink

  • as of June 2010, the company has nearly 2000 employees, up from roughly 1100 employees 10 months ago.  Nearly doubling staff in under a year!
  • the two largest teams are Engineering and Ops, with roughly 400-500 team members each.  Between the two they make up about 50% of the company.
  • product manager to engineer ratio is roughly 1-to-7 or 1-to-10
  • all engineers go through 4 to 6 week “Boot Camp” training where they learn the Facebook system by fixing bugs and listening to lectures given by more senior/tenured engineers.  estimate 10% of each boot camp’s trainee class don’t make it and are counseled out of the organization.
  • after boot camp, all engineers get access to live DB (comes with standard lecture about “with great power comes great responsibility” and a clear list of “fire-able offenses”, e.g., sharing private user data)
  • any engineer can modify any part of FB’s code base and check-in at-will
  • very engineering driven culture.  ”product managers are essentially useless here.” is a quote from an engineer.  engineers can modify specs mid-process, re-order work projects, and inject new feature ideas anytime.
  • during monthly cross-team meetings, the engineers are the ones who present progress reports.  product marketing and product management attend these meetings, but if they are particularly outspoken, there is actually feedback to the leadership that “product spoke too much at the last meeting.”  they really want engineers to publicly own products and be the main point of contact for the things they built.
  • resourcing for projects is purely voluntary.
    • a PM lobbies group of engineers, tries to get them excited about their ideas.
    • Engineers decide which ones sound interesting to work on.
    • Engineer talks to their manager, says “I’d like to work on these 5 things this week.”
    • Engineering Manager mostly leaves engineers’ preferences alone, may sometimes ask that certain tasks get done first.
  • Engineers handle entire feature themselves — front end javascript, backend database code, and everything in between.  If they want help from a Designer (there are a limited staff of dedicated designers available), they need to get a Designer interested enough in their project to take it on.  Same for Architect help.  But in general, expectation is that engineers will handle everything they need themselves.
Share:
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • RSS
  • Twitter
  • email

Its logical – IAAS users will move to PAAS

Sysadmins love infrastructure control, and I have to say that there was a time when root access gave me a high. It wasn’t  until I moved to web operations team (and gave up my root access) that I realized that I was  more productive when I wasn’t dealing with day to day hardware and OS issues. After managing my own EC2/Rackspace instance for my blog for a few years , I came to another realization today that IAAS (infrastructure as a service) might be one of these fads which will give way to PAAS (Platform as a service).

WordPress is an excellent blogging platform, and I manage multiple instances of it for my blogs (and one for my  wife’s blog). I chose to run my own wordpress instance because I loved the same control which I used to have when I was a sysadmin. I not only wanted to run my own plugins, configured my own features, play with different kinds of caching features, I also wanted to choose my own linux distribution (Ubuntu ofcourse) and make it work the way I always wanted my servers to work.  But when it came to patching the OS, taking backups, updating wordpress and the zillion other plugins, I found it a little distracting, slightly frustrating and extremely time consuming.

Last week I moved one of my personal blogs to blogger.com and its possible that it may not be the last one. Whats important here is not that I picked blogger.com over wordpress.com, but the fact that I’m ready to give up control to be more productive. Amazon’s AWS started off as the first IAAS service provider, but today they provide a whole lot of other managed services like Elastic MapReduce, Amazon Route 53, Amazon cloudfront and Amazon Relational Database Service which are more of a PAAS than IAAS.

IAAS is a very powerful tool in the hands of professional systems admin. But I’m willing to bet that over the next few years lesser number organizations would be worried about kernel versions and linux distributions and would instead be happy with a simple API to upload “.war” files (if they are running tomcat for example) into some kind of cloud managed tomcat instances (like how hadoop runs in elastic mapreduce). Google App Engine (Java and Python) and Heroku (Ruby based, Salesforce bought them) are two examples of such service today and I’ll be surprised if  AWS doesn’t launch something  (or buy someone out) within the next year to do the same.

Share:
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • RSS
  • Twitter
  • email

Splunk : Fastest way to get web operations dashboard running

This is a cross-post from my personal blog.

Few weeks ago I asked a question on quora about log aggregation. I was surprised to find that no opensource solution came close to what I wanted, but I got a lot of suggessions to try out splunk. So I did.

What I wanted was an aggregation tool which collects, displays and alerts based on events logged by the various webservers across the network which could be in different datacenters. The organization where I set this up was generating about 300mb of production haproxy logs per day and something around 200mb of non-prod logs. Here is why splunk fit very well in this organization.

1) Log aggregation across multiple servers/datacenters- The organization had already solved this problem by piping haproxy logs using syslog-ng. They used a little bit of filtering to discard logs which are not interesting for splunk. Syslog-ng can be configured to use tcp instead of udp to make log delivery reliable. Splunk is capable of working as remote agents as well… but sending raw logs to it might increase the licensing costs.
2) Realtime dashboard – Splunk is a memory and cpu hog, but for smaller amount of logs, true realtime dashboard works beautifully. Even with multiple syslog-ng and splunk servers involved in the log fow, I was able to see realtime graphical dashboards updated within 5 to 10 seconds of the actual requests. Thats pretty impressive and may not too useful for high volume websites. Generating realtime dashboards which don’t update automatically is a more realistic use of splunks resource, and this again works pretty well as long as too many people are not trying to use it at the same time.
3) Querying/Filtering/Analyzing – Splunk’s querying language is very different from SQL but there are cheatsheets available to help you create queries. This querying language is very powerful and is perhaps the toughest part of the learning curve. The results from these queries can be sent to web dashboards or to alerting agents which can trigger emails/pages based on pre-defined conditions.
4) Its important to note that splunk is not just for http logs. So it has to be trained to generate reports you would like. Unlike something like awstats you would have to write your own queries and dashboards (which are in XML). There is extensive documentation available, and the support guys were very helpful when I called. On the other hand if all you wanted was awstats like dashboard you could just used google analytics.
5) Free/Commercial versions – While the free version can do most of the stuff there are some key enterprise features for which I’ll recommend buying the commercial version. Authentication, LDAP integration, Alerting features, Federation, etc are some of the features which are missing in free edition. Oh, and phone support.

I’m still not convinced that splunk is scalable.. the biggest issue with splunk is that the cost of maintaining splunk goes up with amount of logs generated per day. Hardware costs, and licensing costs at some point will cross the cost of developing/architecting/setting_up something like hadoop/flume/hive/opentsdb/etc in your own network. But unless you are a big shop, it might be a good idea to postpone that discussion until u really need to do it.

Share:
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • RSS
  • Twitter
  • email

Scalability links for January 13th

Scalability links for January 13th:

Share:
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • RSS
  • Twitter
  • email

Switching roles: next stop Google

Jan of 2011 will start a little different for me after 10 years. I’ve accepted a position in the Google Apps altEnterprise Group and would be joining them early next month.

Other than the fun stuff I do outside my regular job, I’ve been in IT related roles for as long as I can remember. And while IT has been very challenging and is an exciting field to be in , I feel that its time for a little exploration.

I will deeply miss all of my friends at Ingenuity. Some of whom I’ve worked with for over 10 years… but I’m ready for my next challenge.

Share:
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • RSS
  • Twitter
  • email

S4: Distributed Stream Computing Platform

A few weeks ago I mentioned Yahoo! Labs was working on something called S4 for real-time data analysis. Yesterday they released an 8 page paper with detailed description of how and why they built this. Here is the abstract from the paper.

Its interesting to note that the authors compared S4 with MapReduce and explained that MapReduce was too optimized for batch process and wasn’t the best place to do real time computation. They also made an architectural decision of not building a system which can do both offline (batch) processing and real-time processing since they feared such a system would end up to be not good for either.

S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable imageplatform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Keyed data events are routed with affinity to Processing Elements (PEs), which consume the events and do one or both of the following: (1) emit one or more events which may be consumed by other PEs, (2) publish results. The architecture resembles the Actors model [1], providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers. In this paper, we outline the S4 architecture in detail, describe various applications, including real-life deployments. Our de- sign is primarily driven by large scale applications for data mining and machine learning in a production environment. We show that the S4 design is surprisingly flexible and lends itself to run in large clusters built with commodity hardware.

Code: http://s4.io/

Authors: Neumeyer L, Robbins B, Nair A, Kesari A
Source: Yahoo! Labs

Share:
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • RSS
  • Twitter
  • email

REST APIs for cloud management and the Database.com launch

I found the top two stories on scalebig last night to be interesting enough for me to dig a little deeper. The one which surprised me the most was William Vambenepe’s post about why he thinks that REST APIs doesn’t matter in context of cloud management. While REST might be ideal for many different things, including web based applications which are accessed mostly by the browsers, Amazon chose to avoid REST for most of its infrastructure management APIs.

Has this lack of REStfulness stopped anyone from using it? Has it limited the scale of systems deployed on AWS? Does it limit the flexibility of the Cloud offering and somehow force people to consume more resources than they need? Has it made the Amazon Cloud less secure? Has it restricted the scope of platforms and languages from which the API can be invoked? Does it require more experienced engineers than competing solutions?

I don’t see any sign that the answer is “yes” to any of these questions. Considering the scale of the service, it would be a multi-million dollars blunder if indeed one of them had a positive answer.

Here’s a rule of thumb. If most invocations of your API come via libraries for object-oriented languages that more or less map each HTTP request to a method call, it probably doesn’t matter very much how RESTful your API is.

The Rackspace people are technically right when they point out the benefits of their API compared to Amazon’s. But it’s a rounding error compared to the innovation, pragmatism and frequency of iteration that distinguishes the services provided by Amazon. It’s the content that matters.

And the other big news was of course the launch of a new cloud datastore by salesforce at Database.com. database.comInterestingly, you should notice, that they decided to brand it with its own website instead of making it part of its existing set of services. Its possible they did it to distance this new service from an impression that its only useful for applications which need other salesforce services. For more in-depth technical information continue reading here.

The infrastructure promises automatic tuning, upgrades, backups and replication to remote data centers, and automatic creation of sandboxes for development, test and training. Database.com offers enterprise search services, allowing developers to access a full-text search engine that respects enterprise security rules

In terms of pricing, Database.com access will be free for 3 users, and up to 100,000 records and 50,000 transactions per month. The platform will $10 per month for each set of 100,000 records beyond that and another $10 per month for each set of 150,000 transactions beyond that benchmark. The enterprise-level services will be an additional $10 per user per month and will include user identity, authentication and row-level security access controls.

Other references: Dreamforce 2010 – Database.com Launch Interview with Eric Stahl Read more

Share:
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • RSS
  • Twitter
  • email

Providing Dynamic DNS over “Amazon Route 53” ( a hackathon )

On hindsight, yesterday’s “Route 53” announcement was not completely unexpected. Amazon is an IAAS provider and its in their own interest to automate infrastructure as much as possible. After tackling monitoring and cloudfront features, DNS was one of the more obvious targets for improvement.

So when I was trying to pick a challenge for this morning’s hackathon, I picked one around “Amazon Route 53”  service. At the end of the day I had a almost functional public dynamic DNS service using “Route 53” as the DNS service and Twitter’s oauth service for authentication. The final hack is up here http://www.flagthis.com/. You are most welcome to play with it and/or use it.image

After the initial creation of user in the system (with a little help from twitter’s oauth), the end user is free to use the browser based web application or  lynx or curl based REST interface to add/create/update host records. The current version only supports “A” records, but it would be expanded to other records if there is enough interest.

A one line script in cron is all you need to make sure all your DHCP based services are always publically addressable over the internet.

The current version of flagthis provides addresses in the following format. HOSTNAME.TWITTERHANDLE.dynamic.flagthis.com

Here are some of my other observations, as I tried to use Route 53 for the first time today.

  • I was forced to install some CPAN modules on my ec2 instance which sucked up a lot of time.
  • There are two scripts you really need from Amazon’s code library to learn the basics of the service “route53zone.pl” and “dnscurl.pl”. The first one helps you create the XML file which in turn creates new “zones”. And the second script is what gets used to upload and execute new requests using XML files.
  • dnscurl.pl” requires your secret and key in a file in home directory with 600 as the permission of the file.
  • Once the basic commands started working, I focused on more complex requests like “update” and “delete”
  • Thats when I figured out that there is no concept of “update” in Route 53. If you want to update a record you have to send out a “delete” and “create” request. To avoid disrupting a live system, they recommend the deletion and creation happen in the same request (acts like an atomic transaction)
  • Another thing I realized is that the only way to delete an old record is to have complete information about that request from when it was created. For example lets say you created an entry “xyz.flagthis.com” A 10.10.10.10 with 1400 as the TTL, then you have to send all of this information again at deletion time (including the TTL) to delete it. This was slightly frustrating.
  • At the end of the day I couldn’t figure out a quick way to list all the hosts in a zone, or search for details of a particular record in a zone. This is next on my list. The way I got around this step is by using “dig” to get all that information using DNS protocol instead of HTTP based API.
  • Every change request submitted has a 10 to 15 character request ID associated with it. A client can poll the status of this request ID anytime. Since I was playing with small number of changes, most of them completed within a second of submission. While the change is happening, the status of the request is set to “PENDING” and it switches to “INSYNC” as soon as the change is complete.

Folks who have been using S3/SimpleDB as a datastore for service registry should strongly consider using “Route 53” for the same thing. And if you are still running your own DNS servers on EC2 I think its time for you to question yourself if its time to move on.

Share:
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • RSS
  • Twitter
  • email