Google App Engine 1.4.0 pre-release is out

The complete announcement is here, but here are the changes for the java SDK. The two big changes I liked is the fact that there is now an “always on” feature, and “tasks” feature has graduated out of beta/testing.

  • The Always On feature allows applications to pay and keep 3 instances of
    their application always running, which can significantly reduce application latency.
  • Developers can now enable Warmup Requests. By specifying  a handler in an app’s appengine-web.xml, App Engine will attempt to to send a Warmup Request to initialize new instances before a user interacts with it. This can reduce  the latency an end-user sees for initializing your application.
  • The Channel API is now available for all users.
  • Task Queue has been officially released, and is no longer an experimental feature. The API import paths that use ‘labs’ have been deprecated. Task queue storage will count towards an application’s overall storage quota, and will thus be charged for.
  • The deadline for Task Queue and Cron requests has been raised to 10 minutes.  Datastore and API deadlines within those requests remain unchanged.
  • For the Task Queue, developers can specify task retry-parameters in their queue.xml.
  • Metadata Queries on the datastore for datastore kinds, namespaces, and entity  properties are available.
  • URL Fetch allowed response size has been increased, up to 32 MB. Request
    size is still limited to 1 MB.
  • The Admin Console Blacklist page lists the top blacklist rejected visitors.
  • The automatic image thumbnailing service supports arbitrary crop sizes up to 1600px.
  • Overall average instance latency in the Admin Console is now a weighted  average over QPS per instance.
  • Added a low-level AysncDatastoreService for making calls to the datastore asynchronously.
  • Added a getBodyAsBytes() method to QueueStateInfo.TaskStateInfo, this returns the body of the task state as a pure byte-string.
  • The whitelist has been updated to include all classes from javax.xml.soap.
  • Fixed an issue sending email to multiple recipients. http://code.google.com/p/googleappengine/issues/detail?id=1623

“Chrome instant” feature could break your webapp

The “Google instant” wasn’t a ground breaking idea by itself. We have all been using various forms of imageauto-completes for a while now. What makes it stand out is that unlike all the previous kinds of auto-completes, this one is able to search the entire web archive, at an amazing speed and still be able to serve personalized, hyper-local results.  You can get more information about its backend here and here.

It wasn’t surprising that Google even put this feature inside chrome itself. Take a look at this demo from lifehacker. This is where it gets interesting…

 

At the beginning this looked very exciting. I was pleasantly surprised when chrome brought up websites, in addition to auto-completing URLs,  as I typed. The impact on the servers didn’t sink in until I was debugging a bug in my own application which required me to take a look at the apache logs. Look at the following log snippet from apache. Not surprisingly, I found 17 calls instead of just 1 made to my web application while I was typing the URL. All of this happened in 6 seconds, which is about the time it took me to type the URL.

[29/Sep/2010:02:39:04 -0700] "GET /cfmap/create.jsp?p HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:04 -0700] "GET /cfmap/create.jsp?po HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:04 -0700] "GET /cfmap/create.jsp?por HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:05 -0700] "GET /cfmap/create.jsp?port HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:05 -0700] "GET /cfmap/create.jsp?port= HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:05 -0700] "GET /cfmap/create.jsp?port=1 HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:05 -0700] "GET /cfmap/create.jsp?port=1 HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:08 -0700] "GET /cfmap/create.jsp?port=1& HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:08 -0700] "GET /cfmap/create.jsp?port=1&a HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:08 -0700] "GET /cfmap/create.jsp?port=1&ap HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:08 -0700] "GET /cfmap/create.jsp?port=1&app HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:09 -0700] "GET /cfmap/create.jsp?port=1&appn HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:09 -0700] "GET /cfmap/create.jsp?port=1&appna HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:09 -0700] "GET /cfmap/create.jsp?port=1&appnam HTTP/1.1" 200 88 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:09 -0700] "GET /cfmap/create.jsp?port=1&appname HTTP/1.1" 200 60 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:09 -0700] "GET /cfmap/create.jsp?port=1&appname= HTTP/1.1" 200 60 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:10 -0700] "GET /cfmap/create.jsp?port=1&appname=34 HTTP/1.1" 200 60 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0
[29/Sep/2010:02:39:10 -0700] "GET /cfmap/create.jsp?port=1&appname=34 HTTP/1.1" 200 60 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.17 Safari/534.7" ::  847 0

There are two issues here which made me very concerned

  1. Volume of requests: This is a no brainer. The example I used above is not a normal use case since we don’t expect users to type URLs every time they use web-applications. But if the app has an easy to use API which can be used by users in this way, the impact of that small percentage of users who use will get magnified many folds very quickly. It may get very important to figure out how to queue requests, and also important to figure out how to distinguish between users who are spamming the website with 10 requests per second from the user who makes 1 request. All this problem could also go away if your app can actually handle 5 to 20 times more traffic already, which is probably the best solution.
  2. Robust APIs: This is a more tricky one which developers need to plan for. Lets say there was an API like this “/api/transfermoney.php?from=account1&to=account2&amount=10000”. How much money will this API transfer if you type this url in a browser which auto-executes partial URLs ?

What broke the camels back was the fact this particular feature was often flagged by Google’s own search engine as being spammy/automated.  It got so bad that I had to switch to firefox to do a simple google search.  image

And here is an example of how my Google history is now polluted with things I didn’t really search for. In this example I was looking for “ohdoctah” after I heard about it on twit. The key here is that while Google might have thought about how to mine this polluted search data, other web applications might find this impossible to deal with without significant addition in resources. 

image

For now I’ve disabled the feature in the browser. I hope that either there is an easy solution to this problem, otherwise I don’t see this feature making it into the production version of Chrome soon.

Urs Holzle from google on “Speed Matters”

From Urs’ talk at the velocity2010 conference [ More info : Google, datacenterknowledge ]

  • Average web page – 320kb, 44 resources, 7 dns lookups, doesn’t compress 3rd of its content
  • Aiming for 100ms page load times for chrome
  • Chrome: HTML5, V8 JS engine, DNS prefetching, VP8 codec, opensource, spurs competition
  • TCP improvements
    • Fast start (higher initial congestion window)
    • Quick loss recovery (lower retransmit timeouts)
    • Makes Google products 12% faster
    • No handshake delay (app payload in SYN packets)  [ Didn’t know this was possible !!! ]
  • DNS improvements
    • Propagate client IP in DNS requests (to allow servers to better map users to the closest servers)
  • SSL improvements
    • False start (reduce 1 round trip from handshake)
      • 10% faster (for Android implementation)
    • Snap start (zero round trip handshakes, resumes)
    • OCSP stapling (avoid inline roundtrips)
  • HTTP improvements (SPDY):
    • Header compression
    • Stream multiplexing and prioritization
    • Server push/hints
    • 25% faster
  • Test done
    • Download the same “top 25” pages via HTTP and SPDY, network simulates a 2Mbps DSL link, 0% packet loss – Number of packets dropped by 40%
    • On low bandwidth links, headers are surprisingly costly. Can add 1 second of latency.
  • Public DNS:
    • reduces recursive resolve time by continuously refreshing cache
    • Increases availability through adequate provisioning
  • Broadband pilot testing going on
    • Fix the “last mile” complaint
    • Huge increase of 100x
  • More developer tools by Google
    • Page speed, speed tracer, closure compiler, Auto spriter
  • More awareness about performance

Spanner: Google’s next Massive Storage and Computation infrastructure

MapReduce, Bigtable and Pregel have their origins in Google and they all deal with “large systems”. But all of them may be dwarfed in size and complexity by a new project Google is working on, which was mentioned briefly (may be un-intentionally) at an event last year.

Instead of caching data closer to user, it looks like Google is trying to take “the data” to the user. If you use GMail or a Google Doc service, then with this framework, Google could, auto-magically, “move” one of the master copies of your data to the nearest Google data center without really having to cache anything locally. And because they are building one single datastore cluster around the world, instead of building hundreds of smaller ones for different applications, it looks like they may not don’t need dedicated clusters for specific projects anymore.

Below is the gist of “Spanner” from a talk by Jeff Dean at Symposium held at Cornell. Take a look at the rest of the slides if you are interested in some impressive statistics on hardware performance and reliability.

  • Spanner: Storage & computation system that spans all our datacenters
    • Single global namespace
      • Names are independent of location(s) of data
      • Similarities to Bigtable: table, families, locality groups, coprocessors,…
      • Differences: hierarchical directories instead of rows, fine-grained replication
      • Fine-grained ACLs, replication configuration at the per-directory level
    • support mix of strong and weak consistency across datacenters
      • Strong consistency implemented with Paxos across tablet replicas
      • Full support for distributed transactions across directories/machines
    • much more automated operation
      • System automatically moves and adds replicas of data and computation based on constraints and usage patterns
      • Automated allocation of resources across entire fleet of machines.

image

 
References

Fixing GSLB (Global Server load balancing)

Standard DNS protocol allows DNS servers to respond with multiple addresses in the replies for simple DNS lookup queries. This, and the way that the order of records is changed in every reply is collectively known as the “Round Robin DNS” technique to load balance across a set of servers.

Though a lot of organizations are using Round Robin DNS to load balance across servers in the same datacenter, some are also trying to use it as an HA solution by load balancing across multiple datacenters. In an event of a failure in one of the datacenter, using such an implementation, the impact could be limited, and with a slight change of DNS configuration (removing the IP of the datacenter which went down) the site could become fully operational again.

It would be nicer if the DNS servers could monitor and remove servers which are inactive or are throwing  errors of some kind. This is what GSLBs are all about. But what they really excel at, which regular DNS servers can’t do, is that they can figure out (in a slightly unscientific way) where a user is located geographically. This allows it to figure out which datacenter is closest to the end user. If a customer in Asia can get to a datacenter within Asia, instead of coming all the way to US, it could save the customer at least 200ms of latency which can significantly improve bandwidth and response from the website.

Though GSLBs, today, are very popular among the larger service providers there are some interesting drawbacks which can limit its usefulness. The core problem is that GSLBs use source IP within theGSLB_Architecture.PNG DNS request to figure out where the customer is located. This works beautifully if the customers laptop is sending these out directly, and in most cases will also work if the customer is using his/her ISP’s DNS server. Unfortunately if the customer uses some free public DNS service like the one google provides which recursively looks up the DNS records for the user, then GSLB would find datacenters which are closest to the DNS server requesting the information instead of the actual end user. A similar problem exists if the user is forced to use a DNS server over a VPN link. Read this post for a better understanding of this problem (Why DNS Based GSLB doesn’t work)

A few days ago Google came out with a solution to this problem which was announced here (A  proposal to extend DNS protocol). They don’t mention GSLB, but there is no doubt this will help solve the GSLB issue mentioned above. Unfortunately, I’m also sure that Google has other, more important reasons, to push for this change. They are interested in location information to “provide better services” (location-aware advertising).

DNS is the system that translates an easy-to-remember name like www.google.com to a numeric address like 74.125.45.104. These are the IP addresses that computers use to communicate with one another on the Internet.

By returning different addresses to requests coming from different places, DNS can be used to load balance traffic and send users to a nearby server. For example, if you look up www.google.com from a computer in New York, it may resolve to an IP address pointing to a server in New York City. If you look up www.google.com from the Netherlands, the result could be an IP address pointing to a server in the Netherlands. Sending you to a nearby server improves speed, latency, and network utilization.

Currently, to determine your location, authoritative nameservers look at the source IP address of the incoming request, which is the IP address of your DNS resolver, rather than your IP address. This DNS resolver is often managed by your ISP or alternately is a third-party resolver like Google Public DNS. In most cases the resolver is close to its users, in which case the authoritative nameservers will be able to find the nearest server. However, some DNS resolvers serve many users over a wider area. In these cases, your lookup for www.google.com may return the IP address of a server several countries away from you. If the authoritative nameserver could detect where you were, a closer server might have been available.

Our proposed DNS protocol extension lets recursive DNS resolvers include part of your IP address in the request sent to authoritative nameservers. Only the first three octets, or top 24 bits, are sent providing enough information to the authoritative nameserver to determine your network location, without affecting your privacy.

Regard less, its a step in the right direction and would significantly help in making web applications more available.

References

AppScale, an OpenSource GAE implementation

If you don’t like EC2 you have an option to move your app to a new vendor. But if you don’t like GAE  (Google app engine) there aren’t any solutions which can replace GAE easily.

AppScale might change that.

AppScale is an open-source implementation of the Google AppEngine (GAE) cloud computing interface from the RACELab at UC Santa Barbara. AppScale enables execution of GAE applications on virtualized cluster systems. In particular, AppScale enables users to execute GAE applications using their own clusters with greater scalability and reliability than the GAE SDK provides. Moreover, AppScale executes automatically and transparently over cloud infrastructures such as the Amazon Web Services (AWS) Elastic Compute Cloud (EC2) and Eucalyptus, the open-source implementation of the AWS interfaces.

The list of supported infrastructures is very impressive. However the key, in my personal opinion, would be stability and compatibility with current GAE APIs.

Learn more about AppScale:

  1. AppScale Home page
  2. Google Code page
  3. Google Group for AppScale
  4. Demo at Bay area GAE Developers meeting: At Googleplex ( Feb 10, 2010)

Google patents Map reduce “System and method for efficient large-scale data processing”

After filing in 2004, google finally got its patent on “System and method for efficient large-scale data processing”  approved  yesterday.

Gigaom pointed out that if Google really wants to enforce it, it would have to go after many different vendors who are implementing “mapreduce” in some form in their applications and databases.

Google’s intentions of how to use it are not clear, but this is what one of the spokesperson  said.

Like other responsible, innovative companies, Google files patent applications on a variety of technologies it develops. While we do not comment about the use of this or any part of our portfolio, we feel that our behavior to date has been inline with our corporate values and priorities.

Building BlackboxSET on GAE/java

Last week I spent a few hours building a search engine testing tool called “BlackboxSET”. The purpose of the tool was to allow users to see search results from three different search providers and vote for the best set of results without knowing the source of the results. The hope was that the search engine which presents best set of results on the top of the page will stand out. What we found was interesting. Though Google’s search score aren’t significantly better than Yahoo’s or Bing’s, it is the current leader imageon BlackboxSET.

But this post is about what it took me to build BlackboxSET on GAE which as you can see is a relatively simple application. The entire app was built in a few hours of late night hacking and I decided to use Google’s AppEngine infrastructure to learn a little more about GAE.

Primary goals

  1. Ability to randomly show results from the three search engines
  2. Persist data collected after the user votes
  3. Report the results using a simple pie chart in real time if possible

Backend design

  1. Each time the user does a new search, a random sequence is generated on the server which represents the order in how the user will see the results on the browser.
  2. When the user clicks on ‘Vote’ button, the browser will make a call to the server to log the result and to retrieve the source of search results from the server.

Decisions and observations made while trying to build this on GAE

  1. Obviously using Java was not optional since I didn’t know python.
  2. And since I haven’t played with encrypted cookies, the decision was made to persist the randomized order in session object which looked pretty straight forward.
  3. Since the user sessions are relatively short and since session objects in GAE/java are persisted to memcache automatically, it was decided not to interact with memcache directly. This particular feature of GAE/java is not documented clearly, and from what I’ve heard from Google Engineers its something they don’t openly recommend to rely on. But it works and I have used in the past without any problems.
  4. When the voting results from the browser are sent to the server, the server logs it without any processing in a simple table in datastore. The plan was to keep sufficient information in these event logs so that if the app does get hacked/gamed, additional information in the event logs will help us filter out events which should be rejected. It unfortunately also means that to extract anything interesting from this data, one would have to spend a lot of computational resources to parse it.
  5. Google Chart API was used for graphing. This was a no brainer. But because GAE limits on the number of rows per datastore query to 1000, I had to limit the chart API to look at only last 1000 results. GAE now provides a “Task” feature which I think can be used offline processing but haven’t used it yet.

Problems I ran into – I had designed the app to resist gaming, but was not adequately prepared for some of the other challenging problems related to horizontal scalability.

  1. The first problem was that processing 1000 rows of voting logs to generate graph for each person was taking upto 10 to 15 seconds on GAE infrastructure. The options I had to solve this problem was, to either reduce the log sample size requested from Datastore (something smaller than 1000), or to cache the results for a period of time so that not all users were impacted by the problem.  I went with the second option.
  2. The second problem was sort of a showstopper. Some folks were reporting inaccurate search results… in some cases there were duplicates with the same set of search results shown in two out of three columns. This was bad. And even more weird was the fact that it never happened when I was running the app on my desktop inside the GAE sandbox. Also mysterious was that the problems didn’t show up until the load started picking up  app (thanks to a few folks who twittered it out).
    1. The root cause of these issues could be due to the way I assumed the session objects are persisted and replicated in GAE/java. I assumed that when I persist an object in the apps session object, it is synchronously replicated to the memcache.
    2. I also assumed that if multiple instances of the app were brought up by GAE under heavy load, it will try to do some kind of sticky loadbalancing. Sticky loadbalacing is an expensive affair so on hindsight I should have expected this problem. However I didn’t know that GAE infrastructure will start loadbalancing across multiple instances even at 2 requests per second which seems too low.
    3. Since the randomization data cannot be stored in cookie (without encrypting), I had to store it on the server. And from the point when the user is presented with a set of search results, to the point when the user votes on it, it would be nice to keep the user on the same app instance. Since I GAE was switching users (was doing loadbalancing based on load)  I had to find a more reliable way to persist the randomization information.
    4. The solution I implemented was two fold. First I reduced the number of interactions between the browser and the backend server from 4 to 2 HTTP requests. This effectively reduced the probability of users switching app instances during the most critical part of the app’s operation . The second change was that I decided not to use Session object and instead used memcache directly to make this the randomization data persist a little more reliably.
    5. On hindsight, I think encrypted cookies would have been a better approach for this particular application. It completely side-steps the requirement of keeping session information on the server.

I’m sure this is not the end of all the problems. If there is an update I’ll definitely post it here. If there are any readers who are curious about anything specific please let me know and I’ll be happy to share my experiences.

Google app engine review (Java edition)

For the last couple of weekends I’ve been playing with Google App Engine, (Java edition) and was pleasantly surprised at the direction it has taken. I was also fortunate enough to see some Google Engineers talk on this subject at Google I/O which helped me a lot to compile all this information.

But before I get into the details, I like to warn you that I’m not a developer, let alone a java developer. My experience with java has been limited to prototyping ideas and wasting time (and now probably yours too). appengine_lowres

Developing on GAE isn’t very different from other Java based development environments. I used the eclipse plugin to build and test the GAE apps in the sandbox on my laptop. For most part everything you did before will work, but there are limitations introduced by GAE which tries to force you to write code which is scalable.

  1. Threads cant be created – But one can modify the existing thread state
  2. Direct network connections are not allowed – URLConnection can be used instead
  3. Direct file system writes not allowed. – Use Memory, memcache, datastore instead. ( Apps can read files which are uploaded as part of the apps)
  4. Java2D not allowed -  But certain Images API, Software rendering allowed
  5. Native Code not allowed-  Only pure Java libraries are allowed
  6. There is a JRE class whitelist which you can refer to to know which classes supported by GAE.

GAE runs inside a heavily version of jetty/jasper servlet container currently using Sun’s 1.6 JVM (client mode). Most of what you would did to build a webapp world still applies, but because of limitations of what can work on GAE, the libraries and frameworks which are known to work should be explicitly checked for. If you are curious whether the library/framework you use for your webapp will work in GAE, check out this page for the official list of known/working options (will it play in app engine).

Now the interesting part. Each request gets a maximum of 30 seconds in which it has to complete or GAE will throw an exception. If you are building a web application which requires large number of datastore operations, you have to figure out how to break requests into small chunks such that it does complete in 30 seconds. You also have to figure out how to detect failures such that clients can reissue the request if they fail.

But this limitation has a silver lining. Though you are limited by how long a request can take to execute, you are not limited by the number of simultaneous requests currently (you can get to 32 simultaneous threads in free account, and can go up higher if you want to pay). Theoretically you should be able to scale horizontally to as many requests per second as you want.  There are few other factors, like how you architect your data in datastore, which can still limit how many operations per second you can do. Some of the other GAE limits are listed here.

You have to use google’s datastore api’s to persist data to maximize GAE’s potential. You could still use S3, SimpleDB or your favorite cloud DB storage, but the high latency would probably kill your app first.

The Datastore on GAE is where GAE gets very interesting and departs significantly from most traditional java webapp development experiences. Here are a few quick things which took me a while to figure out.

  1. Datastore is schemaless (I’m sure you knew this already)
  2. Its built over google’s BigTable infrastructure. (you knew this as well…)
  3. It looks like SQL, but don’t be fooled. Its so crippled that you won’t recognize it from two feet away. After a week of playing with GAE I know there are at least 2 to 3 ways to query this data, and the various syntaxes are confusing.  ( I’ll give an update once a figure this whole thing out)
  4. You can have Datastore generate keys for your entities, or you can assign it yourself. If you decide to create your own keys (which has its benefits BTW) you need to figure out how to build the keys in such a way that they don’t collide with unintentional consequences.
  5. Creation of “uniqueness” index is not supported.
  6. Nor can you do joins across tables. If you really need a join, you would have to do it at the app. I heard there are some folks coming out with libraries which can fake a relational data model over datastore… don’t have more information on it right now.
  7. The amount of datastore CPU (in addition to regular app CPU) you use is monitored. So if you create a lot of indexes, you better be ready to pay for it.
  8. Figuring out how to index your data isn’t rocket science. Single column indexes are automatically built for you. Multi-column indexes need to be configured in the app. GAE sandbox running on your desktop/laptop does figure out which indexes you need by monitoring your queries, so you may not have to do much for most part. When you upload the app, the config file instructing which index are required is uploaded with it. In GAE Python, there are ways to tell google not to index some fields
  9. Index creation on GAE takes a long time for some reason. Even for small tables. This is a known issue, but not a show stopper in my personal opinion
  10. Figuring out how to breakup/store/normalize/denormalize your data to best use GAE’s datastore would probably be one of the most interesting challenges you would have to deal with.
  11. The problem gets trickier if you have a huge amount of data to process in each request. There are strict CPU resource timeouts which currently look slightly buggy to me (or work in a way I don’t understand yet). If a single query takes over a few seconds (5 to 10) it generally fails for me. And if the same HTTP request generates a lot of datastore queries, there is a 30 second limit on the HTTP request after which the request would be killed.
  12. From what I understand datastore is optimized for reads and writes are expensive. Not only do indexes have to be updated, each write needs to be written to the disk before the operation is considered complete. That brings in physical limitations of how fast you can process data if you are planning to write a lot of data. Breaking data into multiple tables is probably a better way to go
  13. There is no way to drop a table or a datastore. You have to delete it 1000 rows at a time using you app currently. This is one of the biggest issues brought up by the developers and its possible it would be fixed soon.
  14. There is no way to delete an application either…
  15. There is a python script to upload large amount of data to the GAE datastore. Unfortunately, one needs to understand how the datamodel you designed for java app looks like in python world. This has been a blocker for me, but I’m sure I could have figured it out using google groups if I really wanted to.
  16. If I understand correctly the datastore (uses BigTable architecture) is built on top of 4 large bigtables.
  17. If I understand correctly, though GAE’s datastore architecture supports transactions, its Master-Master replication across multiple-datacenters has some caveats which needs to be understood. GAE engineers explained that 2 Phase comit and Paxos are better at handling data consistencies across datacenters but suffers from heavy latency because of which its not used for GAE’s datastore currently. They hope/plan to give some kind of support for a more reliable data consistency mechanism.

Other than the Datastore, I’d like to mention a few other key things which are important central elements of the GAE architecture.

  1. Memcache support is built in. I was able to use it within a minute of figuring out that its possible. Hitting datastore is expensive and if you can get by with just using memcache, thats what is recommended.
  2. Session persistence exist and its persisted to both memcache and datastore. However its disabled by default and GAE engineers recommend to stay away from it. Managing sessions is expensive, especially if you are hitting datastore very frequently.
  3. Apps can send emails (there are paid/free limits)
  4. Apps can make HTTP requests to outside world using URLConnection
  5. Apps get google authentication support out of the box. Apps don’t have to manage user information or build login application/module to create user specific content.
  6. Currently GAE doesn’t provide a way to set which datacenter (or country) to host your app from (Amazon allows users to choose US or EU). They are actively working to solve this problem.

Thats all for now, I’ll keep you updated as things move along. If you are curious about something very specific, please do leave a comment here or at the GAE java google group.