Posts

Showing posts from 2009

Cassandra for service registry/discovery service

My last post was about my struggle to find a good distributed ESB/Service-discovery solution built over open source tools which was simple to use and maintain. Thanks to reader comments (Dan especially) and some other email exchanges, it seems like building a custom solution is unavoidable if I really want to keep things simple. Dan suggested that I could use DNS to find seed locations for config store which would work very well in a distributed network. If security wasn’t a concern this seed location could have been on S3 or SimpleDB, but the requirement that it needs to be secured on internal infrastructure forced me to investigate simple replicated/eventually-consistent databases which could be hosted internally in different data centers with little or no long term administration cost. My search lead me to investigate a few different NOSQL options Hadoop (HDFS) Dynomite (based on amazon’s dynamo) MongoDB CouchDB But the one I finally settled on as a po

Service registry (ESB) for scalable web applications.

This blog post is the result of my futile attempts at understanding how others have solved the problem of automatic service discovery. How do organizations, which have a huge collection of custom applications, design scalable web application without having to hardcode server names and port numbers in the configuration file ? I believe the terminology I’m hinting at is either called a “Service Registry” or a “ Enterprise Service Bus ” which is part of the whole SOA (Service oriented architecture) world. The organization I work for, has a limited multicast based service announcement/discovery infrastructure, but not widely used across all the applications. In addition to the fact that multicast routing can become complicated (ACL management of yet another set of network addresses), its also not a solution where parts of applications could reside on the cloud. Amazon’s EC2, for instance, doesn’t allow multicast traffic between its hosts. Microsoft Azure .Net services

Amazon launches Relational Database as a service

Its hard to say why I didn't see this coming even after amazon launched hadoop and hive as a service.  There is a huge demand for a relational database on the cloud and a lot of middlemen are raking a lot of cash. Today Amazon launches something they call " RDS ". The service basically provides a AWS managed Mysql instance which includes backups and api based option to add more nodes. The cheapest RDS service is about 11 cents an hour, and the most expensive is a little over 3 dollars an hour. From cloudave Amazon RDS is nothing but a MySQL 5.1 database instance that is exclusively for a particular user and can be accessed via a single API call. The user gets all the capabilities of MySQL database with an additional ability to scale up based on the needs. It rids the customers of any need for time consuming database administration tasks. The patches are applied automatically with the database backed up automatically with a possibility for the user to set the retention

Private clouds: By Amazon

Image
A few days ago I blogged about how VMware is going to do a huge push into “private clouds” around the VMware 2009 conference. But little did we know that Amazon had something up its sleeve as well. It has announced it today. AWS now supports creation of Virtual Private Cloud with private address space (including RFC 1918) which could be locked down by a VPN connection to only your organization only. You still get most of the benefit of Amazons cheap hardware pricing but you get to lock down the infrastructure for security reasons. Regardless of how you see it, this is huge for IT and the developer community. Some may love it, and I’m sure some will be pretty angry at Amazon for trying to commodities security and making it look as if network security was as simple as that. With VMware’s announcements next week, there is no doubt in my mind that the next one year at least there will be a significant push towards “private clouds”.

Vmware: internal + external “private” clouds

Image
Last year at VMware 2008 conference they discussed something called vCloud . Before VMware 2009, they will be announcing external clouds providers around that platform which allow internal clouds to extend their infrastructure to external clouds. What VMware is trying to do is allow organizations to build cloud networks with the possibility of moving few services/components to external clouds. To make this seamless the VMware vSphere tool which currently allows internal cloud management will be enhanced to allow it to manage instances on the external cloud almost as if it was part of the internal cloud. In fact if the rumors are true, they will even support vMotion across to external cloud providers (restrictions apply). VMware is getting on the cloud bandwagon in a big way… just take a look at the number of sessions they have mentioning cloud.

Scalability for dummies

Alex Barrera has a very interesting post about how frustrating it is to figure out that you have a problem and how much trouble it is to fix it after the product is live. I am there, I am suffering the redesign phase (twice now). It’s hard, it’s lonely, it’s discouraging and frustrating, but it needs to be done. I just wrote this post so that outsiders can get a glimpse of what is it to be there and how it affects the whole company, not just the tech department. Scalability problems aren’t something you can discard as being ONLY technical, it’s roots might be technical but its effects will shake the whole company. The post actually reminded me of this post by Marton Trencseni which talks about the phases of improvement in scalability architecture a product goes through and digs a little deeper into what could have prevented it. For startups or for companies which are just prototyping new ideas, their goals can sometimes be just to “test the waters”, and the pr

Weekend reading material

  Products/Ideas redis - http://code.google.com/p/redis/ : Redis is a key-value database. It is similar to memcached but the dataset is not volatile, and values can be strings, exactly like in memcached, but also lists and sets with atomic operations to push/pop elements. HBase - http://hadoop.apache.org/hbase/ : HBase is the Hadoop database. Its an open-source, distributed, column-oriented store modeled after the Google paper, Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop. Sherpa - http://research.yahoo.com/node/2139 BigTable - http://labs.google.com/papers/bigtable-osdi06.pdf voldemort - It is basically just a big, distributed, persistent, fault-tolerant hash table. For applications that can use an O/R mapper like active-record or hibernate this will provide horizontal scalabi

Is Yahoo launching a cloud storage solution : MObStor

Image
While rest of the world is busy with Microsoft and Google, Yahoo might be preparing to launch MObStor which they tout as the “Unstructured Storage for the Internet”. While comparing MObStor to the various Cloud computing storage solutions already available, Navneet Joneja, Sr. Product Manager, mentions Facebook’s Haystack to describe MObStor’s architectural design. He also points out that though Facebook’s Haystack was optimized to store photographs, MObStor was optimized for diverse set of use cases. Its a REST based, browser-accessible API with simple security model, and content-agnostic storage features. The focus of this service seems to be fast, reliable, secure storage with the option of allowing customers to layer additional services on top of the core service. It claims it would be optimized for high performance and high availability (who doesn’t). Here is more from the Yahoo Developer Network Blog Facebook's Haystack is based on commodity storage.

CouchDB scalability issues ? (updated)

Jonathan Ellis ’ started up a storm when he posted an entry about CouchDB about 6 months ago. He questioned some of CouchDB’s claims and made an attempt to warn users who don’t understand practical issues around CoughDB very well. After reading his post and some comments, it looked like he was specifically concerned about CouchDB’s ability to distribute/scale a growing database automatically. Its a good read if you are curious. He has stopped accepting comments on his blog, but that shouldn’t stop you from commenting here. As Jan pointed out in the comments Jonathan is assuming “distributed” means “auto-scaling” which is not true. -- links from the blog.. Cassandra   dynomite   Sawzall   Pig

Cloud architecture: Notes from an Amazon talk

  Some notes from a talk I was at. Didn’t get time to write it in detail. But hey, something is better than nothing… right ? Design for failure         - handle failure             - use elastic ip addresses             - use multiple amazon ec2 availability zones             - create mutliple database slaves across multiple zones             - use real-time monitoring (amazon cloudwatch)             - use amazon EBS for persistent file system                 - snapshot database to s3 (from ebs)    Loose coupling sets you free         - independent components         - design everything as a blackbox         - de-coupling for hybrid models         - loadbalance-clusters         - use SQS as buffers to queue messages. Allows elasticity    Design for dynamism         - build for changes in infrastructure              - Don't assume health of fixed location of components             - Use designs that are resil

Is Percentage of company Bloggers/Twitter_users inversely proportional to Company size ?

Small organizations often keep a very active online presence . For them, any news is good news. Larger organizations however try to be opposite of that and control information. What I’ve been trying to understand is how in spite of all that companies like Google and Microsoft still manage to have a huge online presence. No.Of.TwitterAccounts= (Size.Of.Company)^(1/2)  ? For example today, Google announced a list of all of its Twitter accounts in one page.  How do they do it ? General twitter.com/Google - our central account twitter.com/Blogger - for Blogger fans twitter.com/GoogleCalendar - user tips & updates twitter.com/GoogleImages - news, tips, tricks on our visual image search twitter.com/GoogleNews - latest headlines via Google News twitter.com/GoogleReader - from our feed reader team twitter.com/iGoogle - news & notes from Google's personalized homepage twitter.com/GoogleStudents - news of interest to students using Google

Cell phone speeds, reliability in US

Image
Novaram and PC World did a cell phone service provider test across the nation to compare the three big cell giants.  I was very shocked and surprised at how crappy the AT&T; wireless network’s reliability is in the city I live.  No wonder people have been constantly complaining about service problems. I wish Apple had gone with Verizon for iPhone… I’ve used verizon for years (before I switched to AT&T;) and was pretty happy with them.

Monitoring Cloud health

Image
Both Amazon and Google (and probably others as well) provide web pages which monitors its service status. The one which I go to, when I need to compare availability and to detect service problems is the one called Cloudstatus by Hyperic . They try to monitor most of the individual services provided by Google (Engine, Datastore, Memcache, Fetch) and Amazon (EC2, S3, SQS, SDB, FPS). On top of online graphs, you can also subscribe to twitter status updates which can be really helpful during a real outage.

BSET SearchEngine relevance test results

Image
A few days ago I started a tool called BSET – Blackbox Search engine Testing tool to evaluate how good Bing really is. If you watch the stats on the page, its clear which search engine is being consistently picked as the winner. The results were collected from 518 unique source IP addresses ( some were just NATs from larger organizations ). 251 users just executed 1 query each. 111 users executed 2 queries and rest executed more than that. A total of 808 results were submitted just for “standard web search” category and of that 44% of the submissions were in favor of Google. 32% of them were for Yahoo. Only about 28% results went for Microsoft’s new search engine “Bing”. Between Google and Yahoo, a user is 15% more likely to pick Google than Yahoo. Between Google and Bing, a user will pick Google 21% more frequently than Bing. The results may not be staggering for folks who have been following search engine trends over the last few weeks, but for me, to see the result

Velocity 2009 : Conference presentation slides

Image
If you are like me, and not attending Velocity 2009, you should track this page for the presentation slides from this years conference.     2 Years Later, Loving and Hating the Cloud Death of a Web Server: Crisis in Caching Fixing Twitter: Improving the Performance and Scalability of the World's Most Popular Micro-blogging Site Hadoop Operations: Managing Big Data Clusters Introduction to Managed Infrastructure with Puppet Metrics that Matter - Approaches To Managing High Performing Websites Scalable Internet Architectures Surviving the 2008 Elections The User and Business Impact of Server Delays, Additional Bytes, and HTTP Chunking in Web Search Writing Efficient JavaScript http://en.oreilly.com/velocity2009/public/schedule/proceedings

Building BlackboxSET on GAE/java

Image
Last week I spent a few hours building a search engine testing tool called “ BlackboxSET ”. The purpose of the tool was to allow users to see search results from three different search providers and vote for the best set of results without knowing the source of the results. The hope was that the search engine which presents best set of results on the top of the page will stand out. What we found was interesting. Though Google’s search score aren’t significantly better than Yahoo’s or Bing’s, it is the current leader on BlackboxSET. But this post is about what it took me to build BlackboxSET on GAE which as you can see is a relatively simple application. The entire app was built in a few hours of late night hacking and I decided to use Google’s AppEngine infrastructure to learn a little more about GAE. Primary goals Ability to randomly show results from the three search engines Persist data collected after the user votes Report the results using a simple p

BlackboxSET – Blackbox Search Engine Testing

Image
The launch of Bing has shaken the Google Kingdom a little bit. I for one have been doubting my own support for Google’s search engine . And I know others who swear by Yahoo’s search engine which is a trust I don’t share. To make such testing easier, I’ve spent a few hours last night to create a tool which allows you to search something against the 3 top search engines and lets you decide which one is the best. At the end of the exercise you should be able to find out if you are doing the right thing by sticking with your personal search engine. May the best search engine win.

Steps to migrate your webapp to AWS

Image
Most web applications needs at least the following services to be self sufficient. Computational power, storage, webserver/cdn, database,  messaging, loadbalancer and monitoring. Here is the tried and tested steps as recommended by AWS folks Move static web content to S3 storage first. Images, css stylesheets, javascript files, html, etc can all be moved to S3. Its easier to move some static content than others, so there might be some work required to understand how to breakup web content to move parts of it into the cloud. The content on S3 can be served by Amazon Cloudfront service which is Amazon’s CDN(content delivery network) service. Once you persist your data on S3, your users will get those objects from the S3 servers located closest to them. Move applications and webserver layer to the EC2 infrastructure. This step will require you to figure out how to automate deployments into cloud infrastructure Once your apps are in the cloud, you can start working on

Opera Unite: web server built in ?

Image
There seems to be a lot of talk about “Opera Unite” launch and everyone is so pumped up about the new feature, “webserver built into the web browser”. This is just like twitter. I think it might be a great idea for a few, but for the masses it might turn out to be just over-boated hype. Most of us who have used a recent OS have sharing features and we have been always on the look out for better firewalls to block it. Now here comes a browser which wants to do the same thing, and for some reason doesn’t expect firewalls to impact it? Have all the security concerns gone away all of a sudden ? While the world is switching to a lighter OS and browser, Opera is trying to build a kitchen sink. That being said, I think its a bold step on Opera’s part, and I have to give credit for its “unique” idea, regardless of how useful I think its going to be.

Working with Google App engine’s datastore

I heard a great set of Google App engine datastore related talks at the google I/O conference. I think this is one of the best out talks I heard which is now on Youtube. You should watch it if you are working with or planning to work with Google App Engine in the near future. Click on this link if you cant see the embedded video.  

Google wave : Let the predictions begin…

Image
During the keynote today Google premiered something completely brand new called Google Wave. From the look of it looked like next-gen SMTP+XMPP protocol which allows email-like-msgs/instant-communication/collaboration using fully distributed architecture (similar to SMTP). The focus was on collaboration and notification. During the whole demo I was thinking just two things. 1) Twitter is screwed  ! 2) Ditto facebook ! The solution proposed has a side effect of trying to solve the spamming issue as well. The key here is that they are not releasing an app on which people can login when its launched … they are instead releasing a protocol and possibly working opensource server which users can deploy and get running quickly. …won’t happen overnight…. But if they build this into gmail which has a large adoption rate, it could become the next big hot thing pretty fast. More here…. http://www.waveprotocol.org/ http://code.google.com/apis/wave/ http://wave.google.com/help/

Google app engine review (Java edition)

Image
For the last couple of weekends I’ve been playing with Google App Engine, (Java edition) and was pleasantly surprised at the direction it has taken. I was also fortunate enough to see some Google Engineers talk on this subject at Google I/O which helped me a lot to compile all this information. But before I get into the details, I like to warn you that I’m not a developer, let alone a java developer. My experience with java has been limited to prototyping ideas and wasting time (and now probably yours too). Developing on GAE isn’t very different from other Java based development environments. I used the eclipse plugin to build and test the GAE apps in the sandbox on my laptop. For most part everything you did before will work, but there are limitations introduced by GAE which tries to force you to write code which is scalable. Threads cant be created - But one can modify the existing thread state Direct network connections are not allowed – URLConnection can be us

New EC2 features: Elastic Load Balancing, Auto Scaling, and Monitoring

Image
  If you have not used EC2 because of some reason, chances are that those reasons don’t exist anymore. More information available in the following places. AWS Blog All things Distributed Right Scale

Safari crossed 10% mark ?

Apple released some statistics to show that thanks to Safar 4 beta, Safari has crossed 10% threshold for the first time . Though that might be true, I don’t see it sticking there. Safari 4 javascript execution was fast, but I found Chrome to be faster. I for one have already abandoned Safar 10 on my windows. Doesn’t mean I hate it… just means that I’m not convinced its the best yet.

Friendfeed using Mysql for Schema-less data

Image
Bret has a nice little article talking about why most people should still stick with known, tested database engines even if the data stored is not relational. Friendfeed uses a simple table to keep attribute value pairs and separate tables to keep indexes for each attribute which needs indexing. The design is very simple and reasonable, and makes an effective argument against using cloud DB (or something like CouchDB) when you can get away with what you need with true and tested engines.

Experimenting with SimpleDB (XXXXXXX.com)

This summary is not available. Please click here to view the post.

Techmeme run out of news ?

Image
A lot of us go to Techmeme for our hourly fix. But for the last few hours things haven’t been quite the same. Come to think of it, the quality of news on techmeme could be an indicator of whats left to come to the tech industry. The first couple of news of news has nothing to do with technology in general and the third news item is a few days old already. The three items after that are the same old news in different wrapping. Either the weekend is getting to me, or this is the lull before the storm.