June 27, 2009

Monitoring Cloud health

Both Amazon and Google (and probably others as well) provide web pages which monitors its service status. The one which I go to, when I need to compare availability and to detect service problems is the one called Cloudstatus by Hyperic.

They try to monitor most of the individual services provided by Google (Engine, Datastore, Memcache, Fetch) and Amazon (EC2, S3, SQS, SDB, FPS).

On top of online graphs, you can also subscribe to twitter status updates which can be really helpful during a real outage.

June 25, 2009

BSET SearchEngine relevance test results

A few days ago I started a tool called BSET – Blackbox Search engine Testing tool to evaluate how good Bing really is. If you watch the stats on the page, its clear which search engine is being consistently picked as the winner.

The results were collected from 518 unique source IP addresses (some were just NATs from larger organizations). 251 users just executed 1 query each. 111 users executed 2 queries and rest executed more than that.

A total of 808 results were submitted just for “standard web search” category and of that 44% of the submissions were in favor of Google. 32% of them were for Yahoo. Only about 28% results went for Microsoft’s new search engine “Bing”.

Between Google and Yahoo, a user is 15% more likely to pick Google than Yahoo. Between Google and Bing, a user will pick Google 21% more frequently than Bing.yahoo200

The results may not be staggering for folks who have been following search engine trends over the last few weeks, but for me, to see the results from this random test is surprising considering the amount of money Microsoft plans to pump into Bing’s advertisement. I wish I had done this test before Bing was launched to find out how different MSN is from Bing…

So why is Google better ?

google200Since search results were pulled using published search APIs from the search providers and because these search APIs may not always show the same results which users see on the real search page, it could be argued that these results may be inaccurate.

Another problem I noticed is that different search engines behave differently when there are spelling errors in search. For example look at the results for “steven hakwing” ( was looking for Stephen Hawking) on the 3 search engines

Bing  - Bing tells you that you could have spelt is wrong, and shows results for “steven hawking” instead.

Yahoo – Yahoo warns me that I should probably correct my spelling to “Stephen Hawking” but shows the search results for “steven hakwing”

Google – Google suggests that I could be looking for “Steven Hawking”, but actually shows me results for “Stephen Hawking” which is what I really wanted.

Since I didn’t use spell-sugession APIs to correct the search terms before it was submitted, it could be argued that my tests are biased towards google which does auto-correction. But as an end-user, I could argue that that I want to see what I intended to type and not what I actually typed. I think the ability to predict what users are thinking is is one of the core reasons why Google has a lead over other search engines.

And as for Bing’s cash-back plan, a friend of mine said that he’d be happy to use Bing to buy something.. as soon as he figures out what he really wants on Google or Yahoo.

I welcome your comments or feedback, especially if you have ideas on how I could improve the tests.

June 20, 2009

Building BlackboxSET on GAE/java

Last week I spent a few hours building a search engine testing tool called “BlackboxSET”. The purpose of the tool was to allow users to see search results from three different search providers and vote for the best set of results without knowing the source of the results. The hope was that the search engine which presents best set of results on the top of the page will stand out. What we found was interesting. Though Google’s search score aren’t significantly better than Yahoo’s or Bing’s, it is the current leader imageon BlackboxSET.

But this post is about what it took me to build BlackboxSET on GAE which as you can see is a relatively simple application. The entire app was built in a few hours of late night hacking and I decided to use Google’s AppEngine infrastructure to learn a little more about GAE.

Primary goals

  1. Ability to randomly show results from the three search engines
  2. Persist data collected after the user votes
  3. Report the results using a simple pie chart in real time if possible

Backend design

  1. Each time the user does a new search, a random sequence is generated on the server which represents the order in how the user will see the results on the browser.
  2. When the user clicks on ‘Vote’ button, the browser will make a call to the server to log the result and to retrieve the source of search results from the server.

Decisions and observations made while trying to build this on GAE

  1. Obviously using Java was not optional since I didn’t know python.
  2. And since I haven’t played with encrypted cookies, the decision was made to persist the randomized order in session object which looked pretty straight forward.
  3. Since the user sessions are relatively short and since session objects in GAE/java are persisted to memcache automatically, it was decided not to interact with memcache directly. This particular feature of GAE/java is not documented clearly, and from what I’ve heard from Google Engineers its something they don’t openly recommend to rely on. But it works and I have used in the past without any problems.
  4. When the voting results from the browser are sent to the server, the server logs it without any processing in a simple table in datastore. The plan was to keep sufficient information in these event logs so that if the app does get hacked/gamed, additional information in the event logs will help us filter out events which should be rejected. It unfortunately also means that to extract anything interesting from this data, one would have to spend a lot of computational resources to parse it.
  5. Google Chart API was used for graphing. This was a no brainer. But because GAE limits on the number of rows per datastore query to 1000, I had to limit the chart API to look at only last 1000 results. GAE now provides a “Task” feature which I think can be used offline processing but haven’t used it yet.

Problems I ran into – I had designed the app to resist gaming, but was not adequately prepared for some of the other challenging problems related to horizontal scalability.

  1. The first problem was that processing 1000 rows of voting logs to generate graph for each person was taking upto 10 to 15 seconds on GAE infrastructure. The options I had to solve this problem was, to either reduce the log sample size requested from Datastore (something smaller than 1000), or to cache the results for a period of time so that not all users were impacted by the problem.  I went with the second option.
  2. The second problem was sort of a showstopper. Some folks were reporting inaccurate search results… in some cases there were duplicates with the same set of search results shown in two out of three columns. This was bad. And even more weird was the fact that it never happened when I was running the app on my desktop inside the GAE sandbox. Also mysterious was that the problems didn’t show up until the load started picking up  app (thanks to a few folks who twittered it out).
    1. The root cause of these issues could be due to the way I assumed the session objects are persisted and replicated in GAE/java. I assumed that when I persist an object in the apps session object, it is synchronously replicated to the memcache.
    2. I also assumed that if multiple instances of the app were brought up by GAE under heavy load, it will try to do some kind of sticky loadbalancing. Sticky loadbalacing is an expensive affair so on hindsight I should have expected this problem. However I didn’t know that GAE infrastructure will start loadbalancing across multiple instances even at 2 requests per second which seems too low.
    3. Since the randomization data cannot be stored in cookie (without encrypting), I had to store it on the server. And from the point when the user is presented with a set of search results, to the point when the user votes on it, it would be nice to keep the user on the same app instance. Since I GAE was switching users (was doing loadbalancing based on load)  I had to find a more reliable way to persist the randomization information.
    4. The solution I implemented was two fold. First I reduced the number of interactions between the browser and the backend server from 4 to 2 HTTP requests. This effectively reduced the probability of users switching app instances during the most critical part of the app’s operation . The second change was that I decided not to use Session object and instead used memcache directly to make this the randomization data persist a little more reliably.
    5. On hindsight, I think encrypted cookies would have been a better approach for this particular application. It completely side-steps the requirement of keeping session information on the server.

I’m sure this is not the end of all the problems. If there is an update I’ll definitely post it here. If there are any readers who are curious about anything specific please let me know and I’ll be happy to share my experiences.

June 17, 2009

BlackboxSET – Blackbox Search Engine Testing

The launch of Bing has shaken the Google Kingdom a little bit. I for oneimage have been doubting my own support for Google’s search engine. And I know others who swear by Yahoo’s search engine which is a trust I don’t share. To make such testing easier, I’ve spent a few hours last night to create a tool which allows you to search something against the 3 top search engines and lets you decide which one is the best. At the end of the exercise you should be able to find out if you are doing the right thing by sticking with your personal search engine.

May the best search engine win.

June 16, 2009

Steps to migrate your webapp to AWS

Most web applications needs at least the following services to be self sufficient. Computational power, storage, webserver/cdn, database,  messaging, loadbalancer and monitoring.

Here is the tried and tested steps as recommended by AWS folks

  1. Move static web content to S3 storage first. Images, css stylesheets, javascript files, html, etc can all be moved to S3. Its easier to move some static content than others, so there See full size imagemight be some work required to understand how to breakup web content to move parts of it into the cloud.
  2. The content on S3 can be served by Amazon Cloudfront service which is Amazon’s CDN(content delivery network) service. Once you persist your data on S3, your users will get those objects from the S3 servers located closest to them.
  3. Move applications and webserver layer to the EC2 infrastructure. This step will require you to figure out how to automate deployments into cloud infrastructure
  4. Once your apps are in the cloud, you can start working on building your availability zones to make your infrastructure tolerant to failures of Amazon datacenters. For example if you have apps deployed across US and Europe, if the US datacenters have problems, European datacenters would be able to absorb the shock and keep your services available.
  5. Start using Amazons auto-scaling functionality to add/remove infrastructure automatically depending on the load on the system.
  6. The most complicated part might be moving your databases to the AWS cloud. If you plan to keep your databases on RDBMS (Mysql/Postgress) then you should try to EBS (Elastic Block Storage) and figure out how to take snapshots to S3. You should also try to figure out how to do DB replication across availability zones to keep your site available during single datacenter failures.
  7. At this point since most of your application components are in the cloud, you should be able to start using new amazon services to make your service even better. One possible example is SQS which allows frontend applications to queue requests for other parts of the application (or DB) for asynchronous processing.
  8. Investigate the possibility of moving more of the DB components to S3 and SimpleDB to reduce the need of RDBMS as much as possible. S3 is ideal for storing large objects while SimpleDB is ideal for small stubs of data. A lot of applications using these services , use them together.
  9. After your apps are all configured on aws, this would be a good time to setup monitoring. Amazon provides CloudWatch service which allows you to monitor your applications.

Issues to worry about. Moving to the cloud can be full of small potholes. If you understand them and anticipate them it would be easier for you to move. Here are some, you should be careful about

  1. S3 service is “eventually consistent”. Which means that the data saved to S3 server may not be immediately available on read. Its also possible that if the same content is updated on two different S3 servers at the same time, one of the writes would be lost. This is not always bad, and if you understand it you will realize that there are ways around it.
  2. The loadbalancer service Amazon provides doesn’t support SSL.
  3. SimpleDB has per row max size limitation. This is why SimpleDB is better for keeping metadata which can be searched with reference to the complete data which could be kept in S3.

Parts of this post was summarized from Jinesh’s talk at the “AWS Start-up Tour 2009”.

June 15, 2009

Opera Unite: web server built in ?

Opera Logo

There seems to be a lot of talk about “Opera Unite” launch and everyone is so pumped up about the new feature, “webserver built into the web browser”.

This is just like twitter. I think it might be a great idea for a few, but for the masses it might turn out to be just over-boated hype. Most of us who have used a recent OS have sharing features and we have been always on the look out for better firewalls to block it. Now here comes a browser which wants to do the same thing, and for some reason doesn’t expect firewalls to impact it?

Have all the security concerns gone away all of a sudden ? While the world is switching to a lighter OS and browser, Opera is trying to build a kitchen sink.

That being said, I think its a bold step on Opera’s part, and I have to give credit for its “unique” idea, regardless of how useful I think its going to be.

June 14, 2009

Working with Google App engine’s datastore

I heard a great set of Google App engine datastore related talks at the google I/O conference. I think this is one of the best out talks I heard which is now on Youtube. You should watch it if you are working with or planning to work with Google App Engine in the near future. Click on this link if you cant see the embedded video.