Both Amazon and Google (and probably others as well) provide web pages which monitors its service status. The one which I go to, when I need to compare availability and to detect service problems is the one called Cloudstatus by Hyperic.
They try to monitor most of the individual services provided by Google (Engine, Datastore, Memcache, Fetch) and Amazon (EC2, S3, SQS, SDB, FPS).
On top of online graphs, you can also subscribe to twitter status updates which can be really helpful during a real outage.
Last week I spent a few hours building a search engine testing tool called “BlackboxSET”. The purpose of the tool was to allow users to see search results from three different search providers and vote for the best set of results without knowing the source of the results. The hope was that the search engine which presents best set of results on the top of the page will stand out. What we found was interesting. Though Google’s search score aren’t significantly better than Yahoo’s or Bing’s, it is the current leader on BlackboxSET.
But this post is about what it took me to build BlackboxSET on GAE which as you can see is a relatively simple application. The entire app was built in a few hours of late night hacking and I decided to use Google’s AppEngine infrastructure to learn a little more about GAE.
Primary goals
Ability to randomly show results from the three search engines
Persist data collected after the user votes
Report the results using a simple pie chart in real time if possible
Backend design
Each time the user does a new search, a random sequence is generated on the server which represents the order in how the user will see the results on the browser.
When the user clicks on ‘Vote’ button, the browser will make a call to the server to log the result and to retrieve the source of search results from the server.
Decisions and observations made while trying to build this on GAE
Obviously using Java was not optional since I didn’t know python.
And since I haven’t played with encrypted cookies, the decision was made to persist the randomized order in session object which looked pretty straight forward.
Since the user sessions are relatively short and since session objects in GAE/java are persisted to memcache automatically, it was decided not to interact with memcache directly. This particular feature of GAE/java is not documented clearly, and from what I’ve heard from Google Engineers its something they don’t openly recommend to rely on. But it works and I have used in the past without any problems.
When the voting results from the browser are sent to the server, the server logs it without any processing in a simple table in datastore. The plan was to keep sufficient information in these event logs so that if the app does get hacked/gamed, additional information in the event logs will help us filter out events which should be rejected. It unfortunately also means that to extract anything interesting from this data, one would have to spend a lot of computational resources to parse it.
Google Chart API was used for graphing. This was a no brainer. But because GAE limits on the number of rows per datastore query to 1000, I had to limit the chart API to look at only last 1000 results. GAE now provides a “Task” feature which I think can be used offline processing but haven’t used it yet.
Problems I ran into – I had designed the app to resist gaming, but was not adequately prepared for some of the other challenging problems related to horizontal scalability.
The first problem was that processing 1000 rows of voting logs to generate graph for each person was taking upto 10 to 15 seconds on GAE infrastructure. The options I had to solve this problem was, to either reduce the log sample size requested from Datastore (something smaller than 1000), or to cache the results for a period of time so that not all users were impacted by the problem. I went with the second option.
The second problem was sort of a showstopper. Some folks were reporting inaccurate search results… in some cases there were duplicates with the same set of search results shown in two out of three columns. This was bad. And even more weird was the fact that it never happened when I was running the app on my desktop inside the GAE sandbox. Also mysterious was that the problems didn’t show up until the load started picking up app (thanks to a few folks who twittered it out).
The root cause of these issues could be due to the way I assumed the session objects are persisted and replicated in GAE/java. I assumed that when I persist an object in the apps session object, it is synchronously replicated to the memcache.
I also assumed that if multiple instances of the app were brought up by GAE under heavy load, it will try to do some kind of sticky loadbalancing. Sticky loadbalacing is an expensive affair so on hindsight I should have expected this problem. However I didn’t know that GAE infrastructure will start loadbalancing across multiple instances even at 2 requests per second which seems too low.
Since the randomization data cannot be stored in cookie (without encrypting), I had to store it on the server. And from the point when the user is presented with a set of search results, to the point when the user votes on it, it would be nice to keep the user on the same app instance. Since I GAE was switching users (was doing loadbalancing based on load) I had to find a more reliable way to persist the randomization information.
The solution I implemented was two fold. First I reduced the number of interactions between the browser and the backend server from 4 to 2 HTTP requests. This effectively reduced the probability of users switching app instances during the most critical part of the app’s operation . The second change was that I decided not to use Session object and instead used memcache directly to make this the randomization data persist a little more reliably.
On hindsight, I think encrypted cookies would have been a better approach for this particular application. It completely side-steps the requirement of keeping session information on the server.
I’m sure this is not the end of all the problems. If there is an update I’ll definitely post it here. If there are any readers who are curious about anything specific please let me know and I’ll be happy to share my experiences.
Most web applications needs at least the following services to be self sufficient. Computational power, storage, webserver/cdn, database, messaging, loadbalancer and monitoring.
Here is the tried and tested steps as recommended by AWS folks
Move static web content to S3 storage first. Images, css stylesheets, javascript files, html, etc can all be moved to S3. Its easier to move some static content than others, so there might be some work required to understand how to breakup web content to move parts of it into the cloud.
The content on S3 can be served by Amazon Cloudfront service which is Amazon’s CDN(content delivery network) service. Once you persist your data on S3, your users will get those objects from the S3 servers located closest to them.
Move applications and webserver layer to the EC2 infrastructure. This step will require you to figure out how to automate deployments into cloud infrastructure
Once your apps are in the cloud, you can start working on building your availability zones to make your infrastructure tolerant to failures of Amazon datacenters. For example if you have apps deployed across US and Europe, if the US datacenters have problems, European datacenters would be able to absorb the shock and keep your services available.
Start using Amazons auto-scaling functionality to add/remove infrastructure automatically depending on the load on the system.
The most complicated part might be moving your databases to the AWS cloud. If you plan to keep your databases on RDBMS (Mysql/Postgress) then you should try to EBS (Elastic Block Storage) and figure out how to take snapshots to S3. You should also try to figure out how to do DB replication across availability zones to keep your site available during single datacenter failures.
At this point since most of your application components are in the cloud, you should be able to start using new amazon services to make your service even better. One possible example is SQS which allows frontend applications to queue requests for other parts of the application (or DB) for asynchronous processing.
Investigate the possibility of moving more of the DB components to S3 and SimpleDB to reduce the need of RDBMS as much as possible. S3 is ideal for storing large objects while SimpleDB is ideal for small stubs of data. A lot of applications using these services , use them together.
After your apps are all configured on aws, this would be a good time to setup monitoring. Amazon provides CloudWatch service which allows you to monitor your applications.
Issues to worry about. Moving to the cloud can be full of small potholes. If you understand them and anticipate them it would be easier for you to move. Here are some, you should be careful about
S3 service is “eventually consistent”. Which means that the data saved to S3 server may not be immediately available on read. Its also possible that if the same content is updated on two different S3 servers at the same time, one of the writes would be lost. This is not always bad, and if you understand it you will realize that there are ways around it.
The loadbalancer service Amazon provides doesn’t support SSL.
SimpleDB has per row max size limitation. This is why SimpleDB is better for keeping metadata which can be searched with reference to the complete data which could be kept in S3.
Parts of this post was summarized from Jinesh’s talk at the “AWS Start-up Tour 2009”.
I heard a great set of Google App engine datastore related talks at the google I/O conference. I think this is one of the best out talks I heard which is now on Youtube. You should watch it if you are working with or planning to work with Google App Engine in the near future. Click on this link if you cant see the embedded video.
For the last couple of weekends I’ve been playing with Google App Engine, (Java edition) and was pleasantly surprised at the direction it has taken. I was also fortunate enough to see some Google Engineers talk on this subject at Google I/O which helped me a lot to compile all this information.
But before I get into the details, I like to warn you that I’m not a developer, let alone a java developer. My experience with java has been limited to prototyping ideas and wasting time (and now probably yours too).
Developing on GAE isn’t very different from other Java based development environments. I used the eclipse plugin to build and test the GAE apps in the sandbox on my laptop. For most part everything you did before will work, but there are limitations introduced by GAE which tries to force you to write code which is scalable.
Threads cant be created – But one can modify the existing thread state
Direct network connections are not allowed – URLConnection can be used instead
Direct file system writes not allowed. – Use Memory, memcache, datastore instead. ( Apps can read files which are uploaded as part of the apps)
Java2D not allowed - But certain Images API, Software rendering allowed
Native Code not allowed- Only pure Java libraries are allowed
There is a JRE class whitelist which you can refer to to know which classes supported by GAE.
GAE runs inside a heavily version of jetty/jasper servlet container currently using Sun’s 1.6 JVM (client mode). Most of what you would did to build a webapp world still applies, but because of limitations of what can work on GAE, the libraries and frameworks which are known to work should be explicitly checked for. If you are curious whether the library/framework you use for your webapp will work in GAE, check out this page for the official list of known/working options (will it play in app engine).
Now the interesting part. Each request gets a maximum of 30 seconds in which it has to complete or GAE will throw an exception. If you are building a web application which requires large number of datastore operations, you have to figure out how to break requests into small chunks such that it does complete in 30 seconds. You also have to figure out how to detect failures such that clients can reissue the request if they fail.
But this limitation has a silver lining. Though you are limited by how long a request can take to execute, you are not limited by the number of simultaneous requests currently (you can get to 32 simultaneous threads in free account, and can go up higher if you want to pay). Theoretically you should be able to scale horizontally to as many requests per second as you want. There are few other factors, like how you architect your data in datastore, which can still limit how many operations per second you can do. Some of the other GAE limits are listed here.
You have to use google’s datastore api’s to persist data to maximize GAE’s potential. You could still use S3, SimpleDB or your favorite cloud DB storage, but the high latency would probably kill your app first.
The Datastore on GAE is where GAE gets very interesting and departs significantly from most traditional java webapp development experiences. Here are a few quick things which took me a while to figure out.
Datastore is schemaless (I’m sure you knew this already)
Its built over google’s BigTable infrastructure. (you knew this as well…)
It looks like SQL, but don’t be fooled. Its so crippled that you won’t recognize it from two feet away. After a week of playing with GAE I know there are at least 2 to 3 ways to query this data, and the various syntaxes are confusing. ( I’ll give an update once a figure this whole thing out)
You can have Datastore generate keys for your entities, or you can assign it yourself. If you decide to create your own keys (which has its benefits BTW) you need to figure out how to build the keys in such a way that they don’t collide with unintentional consequences.
Creation of “uniqueness” index is not supported.
Nor can you do joins across tables. If you really need a join, you would have to do it at the app. I heard there are some folks coming out with libraries which can fake a relational data model over datastore… don’t have more information on it right now.
The amount of datastore CPU (in addition to regular app CPU) you use is monitored. So if you create a lot of indexes, you better be ready to pay for it.
Figuring out how to index your data isn’t rocket science. Single column indexes are automatically built for you. Multi-column indexes need to be configured in the app. GAE sandbox running on your desktop/laptop does figure out which indexes you need by monitoring your queries, so you may not have to do much for most part. When you upload the app, the config file instructing which index are required is uploaded with it. In GAE Python, there are ways to tell google not to index some fields
Index creation on GAE takes a long time for some reason. Even for small tables. This is a known issue, but not a show stopper in my personal opinion
Figuring out how to breakup/store/normalize/denormalize your data to best use GAE’s datastore would probably be one of the most interesting challenges you would have to deal with.
The problem gets trickier if you have a huge amount of data to process in each request. There are strict CPU resource timeouts which currently look slightly buggy to me (or work in a way I don’t understand yet). If a single query takes over a few seconds (5 to 10) it generally fails for me. And if the same HTTP request generates a lot of datastore queries, there is a 30 second limit on the HTTP request after which the request would be killed.
From what I understand datastore is optimized for reads and writes are expensive. Not only do indexes have to be updated, each write needs to be written to the disk before the operation is considered complete. That brings in physical limitations of how fast you can process data if you are planning to write a lot of data. Breaking data into multiple tables is probably a better way to go
There is no way to drop a table or a datastore. You have to delete it 1000 rows at a time using you app currently. This is one of the biggest issues brought up by the developers and its possible it would be fixed soon.
There is no way to delete an application either…
There is a python script to upload large amount of data to the GAE datastore. Unfortunately, one needs to understand how the datamodel you designed for java app looks like in python world. This has been a blocker for me, but I’m sure I could have figured it out using google groups if I really wanted to.
If I understand correctly the datastore (uses BigTable architecture) is built on top of 4 large bigtables.
If I understand correctly, though GAE’s datastore architecture supports transactions, its Master-Master replication across multiple-datacenters has some caveats which needs to be understood. GAE engineers explained that 2 Phase comit and Paxos are better at handling data consistencies across datacenters but suffers from heavy latency because of which its not used for GAE’s datastore currently. They hope/plan to give some kind of support for a more reliable data consistency mechanism.
Other than the Datastore, I’d like to mention a few other key things which are important central elements of the GAE architecture.
Memcache support is built in. I was able to use it within a minute of figuring out that its possible. Hitting datastore is expensive and if you can get by with just using memcache, thats what is recommended.
Session persistence exist and its persisted to both memcache and datastore. However its disabled by default and GAE engineers recommend to stay away from it. Managing sessions is expensive, especially if you are hitting datastore very frequently.
Apps can send emails (there are paid/free limits)
Apps can make HTTP requests to outside world using URLConnection
Apps get google authentication support out of the box. Apps don’t have to manage user information or build login application/module to create user specific content.
Currently GAE doesn’t provide a way to set which datacenter (or country) to host your app from (Amazon allows users to choose US or EU). They are actively working to solve this problem.
Thats all for now, I’ll keep you updated as things move along. If you are curious about something very specific, please do leave a comment here or at the GAE java google group.
If you have not used EC2 because of some reason, chances are that those reasons don’t exist anymore. More information available in the following places.
Bret has a nice little article talking about why most people should still stick with known, tested database engines even if the data stored is not relational. Friendfeed uses a simple table to keep attribute value pairs and separate tables to keep indexes for each attribute which needs indexing.
The design is very simple and reasonable, and makes an effective argument against using cloud DB (or something like CouchDB) when you can get away with what you need with true and tested engines.
A few years ago I wrote a simple online bookmarking tool called Flagthis. The tool allowed one to bookmark sites using a javascript bookmarklet from the bookmark tab. The problem it was trying to solve is that most links people bookmark are never used again if they are not checked out within the next few days. The tool helps the user ignore bookmarks which were not used in last 30 days.
The initial version of this tool used MySQL database. The original application architecture was very simple, and other than the database it could have scaled horizontally. Over the weekend I played a little with SimpleDB and was able to convert my code to use SimpleDB in a matter of hours.
Here are some things I observed during my experimentation
Its not a relational database.
Can’t do joins in the database. If joins have to be done, it has to be done at the application which can be very expensive .
De-normalizing data is recommended.
Schemaless: You can add new columns (which are actually just new row attributes) anytime you want.
You have to create your own unique row identifiers. SimpleDB doesn’t have a concept of auto-increment
All attributes are auto-Indexed. I think in Google App Engine you had to specify which columns need indexing. I’m wondering if this would increase cost of using SimpleDB.
Data is automatically replicated across Amazon’s huge SimpleDB cloud. But they only guarantee something called “Eventually Consistent”. Which means data which is “put” into the system is not guaranteed to be available in the next “get”.
I couldn’t find a GUI based tool to browse my SimpleDB like the way some S3 browsers do. I’m sure someone will come up with something soon. [Updated: Jeff blogged about some simpleDB tools here]
There are limits imposed by SimpleDB on the amount of data you can put in. Look at the tables below.
Update as of Feb 28th 2009: Contradictory to my initial speculation, Amazon CloudFront is nothing like Akamai WAA. This is very depressing to me as an Akamai/WAA customer… I’m sure folks at Akamai don’t share this opinion. CloudFront seems to be a glorified S3 solution which is mostly used for static (non-dynamic) content.
————-
Amazon has finally opened the doors of its new CDN (Content Delivery Network) called CloudFront. But instead of building a completely new product it has interestingly expanded its S3 network to include content replication for lower latency content delivery. By not reinventing a whole new way of uploading data to the CDN network, Amazon has seriously cut down the cost for end users to try out this technology.
Most of the CDNs I’ve investigated do very well with static content which needs to be periodically refreshed somehow.
There is at least one service from Akamai called WAA – Web application accelerator which seem to understand the importance of accelerating extremely dynamic content using intelligent routing and closer points of presence to end user. WAA doesn’t put the content closer to the end user, but provides an extremely efficient conduit for this traffic where Akamai controls both ends network by placing a POP in front of the client and the server. By doing this Akamai can take control of TCP/IP window sizes within its network and provide a low latency, higher bandwidth response to the customer. In addition to all this Akamai also provides an option to cache some data ( as defined in the HTTP headers, or WAA configuration ) to be cached for a longer duration.
Though Amazon might be doing replication as well, it may be closer to the Akamai’s WAA model than what you thought. Its kind of obvious that if the data is going to change all the time, there has to be some kind of master-slave concept, and its also clear that if many people are accessing that data around the world it has to be transported through a very efficient high bandwidth network to the various Amazon Points of presence around the world. And finally just like the Akamai’s WAA model, it probably does the cache content to serve the content directly from its local cache incase the object hasn’t changed on the master since the last time it was retrieved.
A month ago I went shopping, looking for alternatives to Akamai’s WAA and didn’t find anyone. I suspect CloudFront changes that a little bit. One significant difference between Amazon and most CDNs out there including CloudFront is that there is relatively very little work which needs to be done by the developer to integrate with WAA. This is not true with most CDNs, and certainly not true for CloudFront if you are not already on S3. But it does change the dynamics of this industry.