Building BlackboxSET on GAE/java

Last week I spent a few hours building a search engine testing tool called “BlackboxSET”. The purpose of the tool was to allow users to see search results from three different search providers and vote for the best set of results without knowing the source of the results. The hope was that the search engine which presents best set of results on the top of the page will stand out. What we found was interesting. Though Google’s search score aren’t significantly better than Yahoo’s or Bing’s, it is the current leader imageon BlackboxSET.

But this post is about what it took me to build BlackboxSET on GAE which as you can see is a relatively simple application. The entire app was built in a few hours of late night hacking and I decided to use Google’s AppEngine infrastructure to learn a little more about GAE.

Primary goals

  1. Ability to randomly show results from the three search engines
  2. Persist data collected after the user votes
  3. Report the results using a simple pie chart in real time if possible

Backend design

  1. Each time the user does a new search, a random sequence is generated on the server which represents the order in how the user will see the results on the browser.
  2. When the user clicks on ‘Vote’ button, the browser will make a call to the server to log the result and to retrieve the source of search results from the server.

Decisions and observations made while trying to build this on GAE

  1. Obviously using Java was not optional since I didn’t know python.
  2. And since I haven’t played with encrypted cookies, the decision was made to persist the randomized order in session object which looked pretty straight forward.
  3. Since the user sessions are relatively short and since session objects in GAE/java are persisted to memcache automatically, it was decided not to interact with memcache directly. This particular feature of GAE/java is not documented clearly, and from what I’ve heard from Google Engineers its something they don’t openly recommend to rely on. But it works and I have used in the past without any problems.
  4. When the voting results from the browser are sent to the server, the server logs it without any processing in a simple table in datastore. The plan was to keep sufficient information in these event logs so that if the app does get hacked/gamed, additional information in the event logs will help us filter out events which should be rejected. It unfortunately also means that to extract anything interesting from this data, one would have to spend a lot of computational resources to parse it.
  5. Google Chart API was used for graphing. This was a no brainer. But because GAE limits on the number of rows per datastore query to 1000, I had to limit the chart API to look at only last 1000 results. GAE now provides a “Task” feature which I think can be used offline processing but haven’t used it yet.

Problems I ran into – I had designed the app to resist gaming, but was not adequately prepared for some of the other challenging problems related to horizontal scalability.

  1. The first problem was that processing 1000 rows of voting logs to generate graph for each person was taking upto 10 to 15 seconds on GAE infrastructure. The options I had to solve this problem was, to either reduce the log sample size requested from Datastore (something smaller than 1000), or to cache the results for a period of time so that not all users were impacted by the problem.  I went with the second option.
  2. The second problem was sort of a showstopper. Some folks were reporting inaccurate search results… in some cases there were duplicates with the same set of search results shown in two out of three columns. This was bad. And even more weird was the fact that it never happened when I was running the app on my desktop inside the GAE sandbox. Also mysterious was that the problems didn’t show up until the load started picking up  app (thanks to a few folks who twittered it out).
    1. The root cause of these issues could be due to the way I assumed the session objects are persisted and replicated in GAE/java. I assumed that when I persist an object in the apps session object, it is synchronously replicated to the memcache.
    2. I also assumed that if multiple instances of the app were brought up by GAE under heavy load, it will try to do some kind of sticky loadbalancing. Sticky loadbalacing is an expensive affair so on hindsight I should have expected this problem. However I didn’t know that GAE infrastructure will start loadbalancing across multiple instances even at 2 requests per second which seems too low.
    3. Since the randomization data cannot be stored in cookie (without encrypting), I had to store it on the server. And from the point when the user is presented with a set of search results, to the point when the user votes on it, it would be nice to keep the user on the same app instance. Since I GAE was switching users (was doing loadbalancing based on load)  I had to find a more reliable way to persist the randomization information.
    4. The solution I implemented was two fold. First I reduced the number of interactions between the browser and the backend server from 4 to 2 HTTP requests. This effectively reduced the probability of users switching app instances during the most critical part of the app’s operation . The second change was that I decided not to use Session object and instead used memcache directly to make this the randomization data persist a little more reliably.
    5. On hindsight, I think encrypted cookies would have been a better approach for this particular application. It completely side-steps the requirement of keeping session information on the server.

I’m sure this is not the end of all the problems. If there is an update I’ll definitely post it here. If there are any readers who are curious about anything specific please let me know and I’ll be happy to share my experiences.