How to setup Amazon Cloudfront ( learning with experimentation )

I have some experience with Akamai’s WAA (Web applications archive) service, which I’ve been using in my professional capacity for a few years now. And I’ve have been curious about how  cloudfront compares with it. Until a few weeks ago, Cloudfront didn’t have a key feature which I think was critical for it to win the traditional CDN customers. “Custom origin” is an amazing new feature which I finally got to test last night and here are my notes for those who are curious as well.

My test application which I tried to convert was my news aggregator portal http://www.scalebig.com/. The application consists of a rapidly changing front page (few times a day) ,  a collection of old pages archived in a sub directory and some other webpage elements like headers, footers, images, style-sheets etc.

  • While Amazon Coudfront does have a presence on AWS management console, it only supports S3 buckets as origins.
  • Since my application didn’t have any components which requires server side processing, I tried to put the whole website on an S3 bucket and tried to use S3 as the origin.
  • When I initially set it up, I ended up with multiple URLs which I had to understand
    • S3 URL – This is the unique URL to your S3 bucket. All requests to this URL will go to Amazons S3 server cluster, and if your objects are marked as private, anyone can get these objects. The object could be a movie, an image, or even an HTML file.
    • Cloudfront URL  – This is the unique Cloudfront URL which maps to your S3 resource through the cloudfront network. For all practical purposes its the same as the first one, except that this is through the CDN service.
    • Your own domain name – This is the actual URL which end users will see, which will be a CNAME to the cloudfront URL.
  • So in my case, I configured the DNS entry for www.scalebig.com to point to DNS entry Cloudfront service created for me (dbnqedizktbfa.cloudfront.net).
  • First thing which broke is that I forgot that this is just an S3 bucket, so it can’t handle things like “sparsed html” to dynamically append headers/footers. I also realized that it can’t control cache policies, setup expiry, etc. But the worst problem was that if you went to “http://www.scalebig.com/” it would throw an error. It was expecting a file name, so http://www.scalebig.com/index.html would have worked.
  • In short I realized that my idea of using S3 as a webserver full of holes.
  • When I started digging for options to enable “custom origin” I realized that those options do not exist on the AWS management console !!. I was instead directed to some third party applications to do this instead. (most of them were commercial products, except two)
  • I finally created the cloudfront configuration using Cloudberry S3 Explorer PRO which allowed me to point Cloudfront to a custom domain name (instead of an S3 resource).
  • In my case my server was running on EC2 with a public reserved IP.  I’m not yet using AWS ELB (Elastic loadbalancer).
  • Once I got that working, which literally worked out of the box, the next challenge is to setup the cache controls and expiries working. If they are set incorrectly, it may stop users from getting latest content. I setup the policies using “.htaccess”. Below I’ve attached a part of the .htaccess I have for the /index.html page which is updated many times a day. There is a similar .htaccess page for rest of the website which recommends a much longer expiry.
  • Finally I realized that it is possible that I might have to invalidate parts of the caches at times (could be due to a bug). Cloudberry and AWS management console didn’t have any option avaliable, but apparently “boto” has some APIs which can work with Amazon cloudfront APIs to do this.

# turn on the module for this directory
ExpiresActive on
# set default
ExpiresDefault "access plus 1 hours"
ExpiresByType image/jpg "access plus 1 hours"
ExpiresByType image/gif "access plus 1 hours"
ExpiresByType image/jpeg "access plus 1 hours"
ExpiresByType image/png "access plus 1 hours"
ExpiresByType text/css "access plus 1 hours"
ExpiresByType text/javascript "access plus 1 hours"
ExpiresByType application/javascript "access plus 1 hours"
ExpiresByType application/x-javascript "access plus 1 hours"
ExpiresByType application/x-shockwave-flash "access plus 1 hours"

Header set Cache-Control "max-age=3600"

AddOutputFilterByType DEFLATE text/html text/plain text/xml application/javascript text/javascript  application/x-javascript text/css

Here is how I would summarize the current state of Amazon cloudfront.

  • Its definitely ready for static websites which don’t have any server side execution code.
  • Cloudfront only accepts GET and HEAD requests
  • Cloudfront ignores cookies, so server can’t set any. (Browser based cookie management will still work, which could be used to keep in-browser session data)
  • If you do want to use serverside code, use iframes, jsonp, javascript widgets or some other mechanism to execute code from a different domain name (which is not on cloudfront).
  • While Cloudfront can log access logs to an S3 bucket of your choice, I’ll recommend using something like Google Analytics to do log analysis.
  • I’ll recommend buying one of the commercial third party products if you want to use Custom Origin and would recommend reading more about the protocols/APIs before you fully trust a production service to Cloudfront.
  • I wish Cloudfront starts supporting something like ESI, which could effectively make an S3 bucket a full fledged webserver without the need of having a running EC2 instance all the time.
  • Overall Cloudfront has a very long way to go, in the number of features, to be treated as a competitor for Akamai’s current range of services.
  • And if you look at Akamai’s current world wide presence, Cloudfront is just a tiny blip.  [ Cloudfront edge locations ]
  • But I suspect that Cloudfront’s continuous evolution is being watched by many and the next set of features could change the balance.

I’m planning to leave http://www.scalebig.com/ on Cloudfront for some time to learn a little more about its operational issues. If you have been using Cloudfront please feel free to leave comments about what important features, you think, are still missing.

ESI: Edge Side Includes

Web page caching gets tricky once personalization is involved. Lets take twitter public_timeline for example which seems to be perfect for caching. Unfortunately when a user is logged in, it also shows the user’s information. Caching that particular page in its entirety, on the web server, in such scenarios, may not be an option. Another scenario is where parts of a page might expire faster than other (require different cache TTLs). Here again caching the whole page doesn’t help.

Edge side includes(ESI) is a markup language specifically designed to help web servers assemble dynamic content at the web layer.

<esi:include src="www.foo.com"/>

The above ESI tag is similar to tags in jsp/php/etc which allow one page to refer to another page for parts of the content on the page. By breaking up the page into smaller objects the webserver could apply different TTL settings (and user validation) to different parts of content. Twitter used to (and may still ) use “Varnish” which supports a subset of ESI specification out of the box.

But caching on the webserver may not be the real reason why this language was invented. ESI is also supported by Akamai  (CDN) on its edge caching product.  By allowing Akamai edge nodes to do the assembling close to the user, they significantly improve perceived end user performance without giving up personalization or content freshness requirements.