Other than wonderful fun loving employees who are also "super heroes" in his eyes, he talks about the how they are doubling the storage requirements on an yearly basis. They already have about 300 TB in use, and as of today all of that is on Amazon's S3. Don estimates that based on his estimates, for the storage they are using, they are saving about 500K per year which is pretty big for a small operation like theirs.
The Smugmug architecture has evolved over time. Internally it can serve images in 3 different ways.
- It could either "proxy" the request, where the app server would get the object from storage and serve it to the client.
- It could do a "redirect" where the app server could redirect the user to the right resource.
- And finally it could just serve the images directly from the storage using REST based APIs.
This flexibility allowed Smugmug try different ways of using internal and external/S3 storage. Interestingly even though they wanted to get away from self-managed internal storage, they noticed that S3 storage wasn't very cheap when you take into account the bandwidth utilization. After various permutation and combinations, they managed to setup a system where they continued to use S3 for primary storage, but kept 10% of the hottest objects locally on Smugmug's own servers to minimize bandwidth utilization on S3 service. This allowed them to get away with not buying 95% of the storage drives they were originally supposed to buy. 5% of 300TB is still 15TB, which is smaller but not small enough unfortunately.
Though storage management is easy with S3, it can at times make things difficult. Permission management was one of those things where they had to sacrifice speed/performance in favour of using "Proxy" mechanism which is more robust/reliable way of serving protected objects.
On reliability, Don mentioned that having multiple single points of failures is not really helpful if you want to provide near 100% availability. With S3 in picture, not only do they have to worry about connectivity from customer to Smugmug and Smugmug's own servers, they also have to worry about connectivity to Amazon and availability of Amazon services round the clock. Hence they had to design their app ground up to handle failures gracefully. For example, write failures to S3 is handled by recording the change locally which is then synced asynchronously. At other times when things break, its designed to either try again, or alert the right folks so that action can be taken.
While talking about nice-to-have-features, he mentioned that S3 shouldn't be confused with CDN. Its not a distributed caching service and it doesn't have global locations like a CDN has. Regardless, he said, S3 probably should have limited Cache or Stream ability which will boost performance and add value to an already invaluable service.
On the topic of Smugmug's future, Don mentioned they are flirting with the possibility of using EC2 in the near future. EC2 would probably be used as image processing computation nodes. Since the EC2 servers are located in the Amazon facility, it will save them bandwidth of transferring data from S3. And since they can turn off server instances on demand (and not pay for them while its offline) it will probably cut down their operating costs to maintain image processing servers as well.
At the end he mentioned a few services which he would like to see Amazon offer in future. Two important ones which he mentioned (and which I think are critical) are the absence of some kind of Database, and Loadbalanacer APIs.
PDF Slides of the same presentation here (34 slides in this one)