For the last couple of weekends Iâ€™ve been playing with Google App Engine, (Java edition) and was pleasantly surprised at the direction it has taken. I was also fortunate enough to see some Google Engineers talk on this subject at Google I/O which helped me a lot to compile all this information.
But before I get into the details, I like to warn you that Iâ€™m not a developer, let alone a java developer. My experience with java has been limited to prototyping ideas and wasting time (and now probably yours too).
Developing on GAE isnâ€™t very different from other Java based development environments. I used the eclipse plugin to build and test the GAE apps in the sandbox on my laptop. For most part everything you did before will work, but there are limitations introduced by GAE which tries to force you to write code which is scalable.
- Threads cant be created – But one can modify the existing thread state
- Direct network connections are not allowed â€“ URLConnection can be used instead
- Direct file system writes not allowed. – Use Memory, memcache, datastore instead. ( Apps can read files which are uploaded as part of the apps)
- Java2D not allowed - But certain Images API, Software rendering allowed
- Native Code not allowed- Only pure Java libraries are allowed
- There is a JRE class whitelist which you can refer to to know which classes supported by GAE.
GAE runs inside a heavily version of jetty/jasper servlet container currently using Sunâ€™s 1.6 JVM (client mode). Most of what you would did to build a webapp world still applies, but because of limitations of what can work on GAE, the libraries and frameworks which are known to work should be explicitly checked for. If you are curious whether the library/framework you use for your webapp will work in GAE, check out this page for the official list of known/working options (will it play in app engine).
Now the interesting part. Each request gets a maximum of 30 seconds in which it has to complete or GAE will throw an exception. If you are building a web application which requires large number of datastore operations, you have to figure out how to break requests into small chunks such that it does complete in 30 seconds. You also have to figure out how to detect failures such that clients can reissue the request if they fail.
But this limitation has a silver lining. Though you are limited by how long a request can take to execute, you are not limited by the number of simultaneous requests currently (you can get to 32 simultaneous threads in free account, and can go up higher if you want to pay). Theoretically you should be able to scale horizontally to as many requests per second as you want. There are few other factors, like how you architect your data in datastore, which can still limit how many operations per second you can do. Some of the other GAE limits are listed here.
You have to use googleâ€™s datastore apiâ€™s to persist data to maximize GAEâ€™s potential. You could still use S3, SimpleDB or your favorite cloud DB storage, but the high latency would probably kill your app first.
The Datastore on GAE is where GAE gets very interesting and departs significantly from most traditional java webapp development experiences. Here are a few quick things which took me a while to figure out.
- Datastore is schemaless (Iâ€™m sure you knew this already)
- Its built over googleâ€™s BigTable infrastructure. (you knew this as wellâ€¦)
- It looks like SQL, but donâ€™t be fooled. Its so crippled that you wonâ€™t recognize it from two feet away. After a week of playing with GAE I know there are at least 2 to 3 ways to query this data, and the various syntaxes are confusing. ( Iâ€™ll give an update once a figure this whole thing out)
- You can have Datastore generate keys for your entities, or you can assign it yourself. If you decide to create your own keys (which has its benefits BTW) you need to figure out how to build the keys in such a way that they donâ€™t collide with unintentional consequences.
- Creation of â€œuniquenessâ€ index is not supported.
- Nor can you do joins across tables. If you really need a join, you would have to do it at the app. I heard there are some folks coming out with libraries which can fake a relational data model over datastoreâ€¦ donâ€™t have more information on it right now.
- The amount of datastore CPU (in addition to regular app CPU) you use is monitored. So if you create a lot of indexes, you better be ready to pay for it.
- Figuring out how to index your data isnâ€™t rocket science. Single column indexes are automatically built for you. Multi-column indexes need to be configured in the app. GAE sandbox running on your desktop/laptop does figure out which indexes you need by monitoring your queries, so you may not have to do much for most part. When you upload the app, the config file instructing which index are required is uploaded with it. In GAE Python, there are ways to tell google not to index some fields
- Index creation on GAE takes a long time for some reason. Even for small tables. This is a known issue, but not a show stopper in my personal opinion
- Figuring out how to breakup/store/normalize/denormalize your data to best use GAEâ€™s datastore would probably be one of the most interesting challenges you would have to deal with.
- The problem gets trickier if you have a huge amount of data to process in each request. There are strict CPU resource timeouts which currently look slightly buggy to me (or work in a way I donâ€™t understand yet). If a single query takes over a few seconds (5 to 10) it generally fails for me. And if the same HTTP request generates a lot of datastore queries, there is a 30 second limit on the HTTP request after which the request would be killed.
- From what I understand datastore is optimized for reads and writes are expensive. Not only do indexes have to be updated, each write needs to be written to the disk before the operation is considered complete. That brings in physical limitations of how fast you can process data if you are planning to write a lot of data. Breaking data into multiple tables is probably a better way to go
- There is no way to drop a table or a datastore. You have to delete it 1000 rows at a time using you app currently. This is one of the biggest issues brought up by the developers and its possible it would be fixed soon.
- There is no way to delete an application eitherâ€¦
- There is a python script to upload large amount of data to the GAE datastore. Unfortunately, one needs to understand how the datamodel you designed for java app looks like in python world. This has been a blocker for me, but Iâ€™m sure I could have figured it out using google groups if I really wanted to.
- If I understand correctly the datastore (uses BigTable architecture) is built on top of 4 large bigtables.
- If I understand correctly, though GAEâ€™s datastore architecture supports transactions, its Master-Master replication across multiple-datacenters has some caveats which needs to be understood. GAE engineers explained that 2 Phase comit and Paxos are better at handling data consistencies across datacenters but suffers from heavy latency because of which its not used for GAEâ€™s datastore currently. They hope/plan to give some kind of support for a more reliable data consistency mechanism.
Other than the Datastore, Iâ€™d like to mention a few other key things which are important central elements of the GAE architecture.
- Memcache support is built in. I was able to use it within a minute of figuring out that its possible. Hitting datastore is expensive and if you can get by with just using memcache, thats what is recommended.
- Session persistence exist and its persisted to both memcache and datastore. However its disabled by default and GAE engineers recommend to stay away from it. Managing sessions is expensive, especially if you are hitting datastore very frequently.
- Apps can send emails (there are paid/free limits)
- Apps can make HTTP requests to outside world using URLConnection
- Apps get google authentication support out of the box. Apps donâ€™t have to manage user information or build login application/module to create user specific content.
- Currently GAE doesnâ€™t provide a way to set which datacenter (or country) to host your app from (Amazon allows users to choose US or EU). They are actively working to solve this problem.
Thats all for now, Iâ€™ll keep you updated as things move along. If you are curious about something very specific, please do leave a comment here or at the GAE java google group.