December 20, 2010
Other than the fun stuff I do outside my regular job, Iâ€™ve been in IT related roles for as long as I can remember. And while IT has been very challenging and is an exciting field to be in , I feel that its time for a little exploration.
I will deeply miss all of my friends at Ingenuity. Some of whom Iâ€™ve worked with for over 10 years... but I'm ready for my next challenge.
Jan of 2011 will start a little different for me after 10 long years. Iâ€™ve accepted a position in the Google Apps Enterprise group and would be joining them early next month.
Other than the fun stuff I do outside my regular job, Iâ€™ve been doing IT related stuff for as long as I can remember. And while IT has been very challenging and exciting field to be in , I feel that its time for a little exploration.
December 18, 2010
A few weeks ago I mentioned Yahoo! Labs was working on something called S4 for real-time data analysis. Yesterday they released an 8 page paper with detailed description of how and why they built this. Here is the abstract from the paper.
Its interesting to note that the authors compared S4 with MapReduce and explained that MapReduce was too optimized for batch process and wasnâ€™t the best place to do real time computation. They also made an architectural decision of not building a system which can do both offline (batch) processing and real-time processing since they feared such a system would end up to be not good for either.
S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Keyed data events are routed with affinity to Processing Elements (PEs), which consume the events and do one or both of the following: (1) emit one or more events which may be consumed by other PEs, (2) publish results. The architecture resembles the Actors model , providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers. In this paper, we outline the S4 architecture in detail, describe various applications, including real-life deployments. Our de- sign is primarily driven by large scale applications for data mining and machine learning in a production environment. We show that the S4 design is surprisingly flexible and lends itself to run in large clusters built with commodity hardware.
December 07, 2010
I found the top two stories on scalebig last night to be interesting enough for me to dig a little deeper. The one which surprised me the most was William Vambenepeâ€™s post about why he thinks that REST APIs doesnâ€™t matter in context of cloud management. While REST might be ideal for many different things, including web based applications which are accessed mostly by the browsers, Amazon chose to avoid REST for most of its infrastructure management APIs.
Has this lack of REStfulness stopped anyone from using it? Has it limited the scale of systems deployed on AWS? Does it limit the flexibility of the Cloud offering and somehow force people to consume more resources than they need? Has it made the Amazon Cloud less secure? Has it restricted the scope of platforms and languages from which the API can be invoked? Does it require more experienced engineers than competing solutions?
I donâ€™t see any sign that the answer is â€œyesâ€ to any of these questions. Considering the scale of the service, it would be a multi-million dollars blunder if indeed one of them had a positive answer.
Hereâ€™s a rule of thumb. If most invocations of your API come via libraries for object-oriented languages that more or less map each HTTP request to a method call, it probably doesnâ€™t matter very much how RESTful your API is.
The Rackspace people are technically right when they point out the benefits of their API compared to Amazonâ€™s. But itâ€™s a rounding error compared to the innovation, pragmatism and frequency of iteration that distinguishes the services provided by Amazon. Itâ€™s the content that matters.
And the other big news was of course the launch of a new cloud datastore by salesforce at Database.com. Interestingly, you should notice, that they decided to brand it with its own website instead of making it part of its existing set of services. Its possible they did it to distance this new service from an impression that its only useful for applications which need other salesforce services. For more in-depth technical information continue reading here.
Other references: Dreamforce 2010 â€“ Database.com Launch Interview with Eric Stahl Read more
The infrastructure promises automatic tuning, upgrades, backups and replication to remote data centers, and automatic creation of sandboxes for development, test and training. Database.com offers enterprise search services, allowing developers to access a full-text search engine that respects enterprise security rules
In terms of pricing, Database.com access will be free for 3 users, and up to 100,000 records and 50,000 transactions per month. The platform will $10 per month for each set of 100,000 records beyond that and another $10 per month for each set of 150,000 transactions beyond that benchmark. The enterprise-level services will be an additional $10 per user per month and will include user identity, authentication and row-level security access controls.
December 06, 2010
So when I was trying to pick a challenge for this morningâ€™s hackathon, I picked one around â€œAmazon Route 53â€ service. At the end of the day I had a almost functional public dynamic DNS service using â€œRoute 53â€ as the DNS service and Twitterâ€™s oauth service for authentication. The final hack is up here http://www.flagthis.com/. You are most welcome to play with it and/or use it.
After the initial creation of user in the system (with a little help from twitterâ€™s oauth), the end user is free to use the browser based web application or lynx or curl based REST interface to add/create/update host records. The current version only supports â€œAâ€ records, but it would be expanded to other records if there is enough interest.
A one line script in cron is all you need to make sure all your DHCP based services are always publically addressable over the internet.
The current version of flagthis provides addresses in the following format. HOSTNAME.TWITTERHANDLE.dynamic.flagthis.com
Here are some of my other observations, as I tried to use Route 53 for the first time today.
- I was forced to install some CPAN modules on my ec2 instance which sucked up a lot of time.
- There are two scripts you really need from Amazonâ€™s code library to learn the basics of the service â€œroute53zone.plâ€ and â€œdnscurl.plâ€. The first one helps you create the XML file which in turn creates new â€œzonesâ€. And the second script is what gets used to upload and execute new requests using XML files.
- â€œdnscurl.plâ€ requires your secret and key in a file in home directory with 600 as the permission of the file.
- Once the basic commands started working, I focused on more complex requests like â€œupdateâ€ and â€œdeleteâ€
- Thats when I figured out that there is no concept of â€œupdateâ€ in Route 53. If you want to update a record you have to send out a â€œdeleteâ€ and â€œcreateâ€ request. To avoid disrupting a live system, they recommend the deletion and creation happen in the same request (acts like an atomic transaction)
- Another thing I realized is that the only way to delete an old record is to have complete information about that request from when it was created. For example lets say you created an entry â€œxyz.flagthis.comâ€ A 10.10.10.10 with 1400 as the TTL, then you have to send all of this information again at deletion time (including the TTL) to delete it. This was slightly frustrating.
- At the end of the day I couldnâ€™t figure out a quick way to list all the hosts in a zone, or search for details of a particular record in a zone. This is next on my list. The way I got around this step is by using â€œdigâ€ to get all that information using DNS protocol instead of HTTP based API.
- Every change request submitted has a 10 to 15 character request ID associated with it. A client can poll the status of this request ID anytime. Since I was playing with small number of changes, most of them completed within a second of submission. While the change is happening, the status of the request is set to â€œPENDINGâ€ and it switches to â€œINSYNCâ€ as soon as the change is complete.
Folks who have been using S3/SimpleDB as a datastore for service registry should strongly consider using â€œRoute 53â€ for the same thing. And if you are still running your own DNS servers on EC2 I think its time for you to question yourself if its time to move on.
December 05, 2010
Managing DNS has been considered as an art by many. If you manage your own DNS records, and run your own external DNS servers, Iâ€™m sure you have some stories to share. Unfortunately unlike most other infrastructure on the internet, DNS screw-ups can get very costly, especially because caching policies can tend to keep your mistakes alive long after you have rolled back your changes.
The unforgiving nature of DNS has forced most, except a few hardcore sys-admins, from avoiding the DNS hell and choosing a managed service to do it for them. Domain name registrars like network solutions, mydomain and godaddy already provide these DNS services, but I canâ€™t recall any of them providing APIs to make these changes automatically. DynDNS does provide an API to change DNS mappings, but it costs15 bucks a year for a single host. There might be others which Iâ€™m not aware off, but the bottom line is that there is no standard, and its not cheap.
Customers on AWS today unfortunately have the same problem. And not surprisingly they too prefer to use 3rd party service providers to monitor, setup and manage DNS records for them. Today AWS is announcing a new service â€œAmazon Route 53â€ which, technically, isnâ€™t a significant breakthrough. But considering the number of users already on AWS, the demand for such a service this would be one of the biggest game changing events in the DNS world in the last decade.
The service is pretty cheap, about 12 bucks a year and gives a complete set of APIs to create, delete, modify maintain and query DNS records on this new service.
But stop thinking of this as a simple DNS service. I can see a whole range of interesting applications being built over this service in next few months. The simplest application which I think would be built over this is a service like â€œDynamic DNS serviceâ€, which would be cheap to build with Route 53 doing most of the grunt work for you.
Here is how Jeff Barr introduced this service
Today we are introducing Amazon Route 53, a programmable Domain Name Service. You can now create, modify, and delete DNS zone files for any domain that you own. You can do all of this under full program controlâ€”you can easily add and modify DNS entries in response to changing circumstances. For example, you could create a new sub-domain for each new customer of a Software as a Service (SaaS) application. DNS queries for information within your domains will be routed to a global network of 16 edge locations tuned for high availability and high performance.
Route 53 introduces a new concept called a Hosted Zone. A Hosted Zone is equivalent to a DNS zone file. It begins with the customary SOA (Start of Authority) record and can contain other records such as A (IPV4 address), AAAA (IPV6 address), CNAME (canonical name), MX (mail exchanger), NS (name server), and SPF (Sender Policy Framework). You have full control over the set of records in each Hosted Zone.
Here is some more info from Werner Vogels
Amazon Route 53
Amazon Route 53 is a new service in the Amazon Web Services suite that manages DNS names and answers DNS queries. Route 53 provides Authoritative DNS functionality implemented using a world-wide network of highly-available DNS servers. Amazon Route 53 sets itself apart from other DNS services that are being offered in several ways:
A familiar cloud business model: A complete self-service environment with no sales people in the loop. No upfront commitments are necessary and you only pay for what you have used. The pricing is transparent and no bundling is required and no overage fees are charged.
Very fast update propagation times: One of the difficulties with many of the existing DNS services are the very long update propagation times, sometimes it may even take up to 24 hours before updates are received at all replicas. Modern systems require much faster update propagation to for example deal with outages. We have designed Route 53 to propagate updates very quickly and give the customer the tools to find out when all changes have been propagated.
Low-latency query resolution The query resolution functionality of Route 53 is based on anycast, which will route the request automatically to the DNS server that is the closest. This achieves very low-latency for queries which is crucial for the overall performance of internet applications. Anycast is also very robust in the presence of network or server failures as requests are automatically routed to the next closest server.
No lock-in. While we have made sure that Route 53 works really well with other Amazon services such as Amazon EC2 and Amazon S3, it is not restricted to using it within AWS. You can use Route 53 with any of the resources and entities that you want to control, whether they are in the cloud or on premise.
We chose the name "Route 53" as a play on the fact that DNS servers respond to queries on port 53. But in the future we plan for Route 53 to also give you greater control over the final aspect of distributed system naming, the route your users take to reach an endpoint. If you want to learn more about Route 53 visithttp://aws.amazon.com/route53 and read the blog post at the AWS Developer weblog.
Here are some of the other comments on this service
December 04, 2010
Scalability links for December 4th:
- Presenting - La Brea - An interesting tool which could be used to understand how failures, latency and other annoying issues can impact an application. The tool allows one to insert system calls into an existing application without recompiling original application.
- What's new in Cassandra 0.7: Secondary indexes - I finally see an example of the promissed land !! :) Can't wait to try this out.
- NoCAP Part III GigaSpaces clustering explained.. -
- Devops - The War Is Over - if You Want It -
- Great Introductory Video on Scalability from Harvard Computer Science -
- Strategy: Google Sends Canary Requests into the Data Mine - This is another way of testing code thrown out by continuous deployments. Very nice.
- Very Low-Cost, Low-Power Servers -
- Better Workflow Management in CDH with Oozie 2 -
- Facebook at 13 Million Queries Per Second Recommends: Minimize Request Variance -
- Keeping Customers Happy - Another New Elastic Load Balancer Feature - This might look like a simple feature without much significance... but if you ask real developers who deal with http/https on a day to day basis, they will tell u know important it is to know how the request come in. I'm glad AWS did something about it.
December 02, 2010
- Basic Monitoring of Amazon EC2 instances at 5-minute intervals at no additional charge.
- Elastic Load Balancer Health Checks -Auto Scaling can now be instructed to automatically replace instances that have been deemed unhealthy by an Elastic Load Balancer.
- Alarms - You can now monitor Amazon CloudWatch metrics, with notification to the Amazon SNS topic of your choice when the metric falls outside of a defined range.
- Auto Scaling Suspend/Resume - You can now push a "big red button" in order to prevent scaling activities from being initiated.
- Auto Scaling Follow the Line -You can now use scheduled actions to perform scaling operations at particular points in time, creating a time-based scaling plan.
- Auto Scaling Policies - You now have more fine-grained control over the modifications to the size of your AutoScaling groups.
- VPC and HPC Support - You can now use AutoScaling with Amazon EC2 instances that are running within your Virtual Private Cloud or as Cluster Compute instances.
Kafka is a distributed publish-subscribe messaging system. It is designed to support the following
- Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.
- High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second.
- Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.
- Support for parallel data load into Hadoop.
Kafka is aimed at providing a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. This kind of activity (page views, searches, and other user actions) are a key ingredient in many of the social feature on the modern web. This data is typically handled by "logging" and ad hoc log aggregation solutions due to the throughput requirements. This kind of ad hoc solution is a viable solution to providing logging data to an offline analysis system like Hadoop, but is very limiting for building real-time processing. Kafka aims to unify offline and online processing by providing a mechanism for parallel load into Hadoop as well as the ability to partition real-time consumption over a cluster of machines.
The use for activity stream processing makes Kafka comparable to Facebook's Scribe or Cloudera's Flume, though the architecture and primitives are very different for these systems and make Kafka more comparable to a traditional messaging system. See our design page for more details.
One of the my problems with most cloud ROI worksheets is that they are heavily weighted for use-cases where resource usage is very bursty. But what if your resource requirements arenâ€™t bursty ? And what if you have a use case where you have to maintain a small IT team to manage some on-site resources due to compliance and other issues ?