How to setup Amazon Cloudfront ( learning with experimentation )

I have some experience with Akamai’s WAA (Web applications archive) service, which I’ve been using in my professional capacity for a few years now. And I’ve have been curious about how  cloudfront compares with it. Until a few weeks ago, Cloudfront didn’t have a key feature which I think was critical for it to win the traditional CDN customers. “Custom origin” is an amazing new feature which I finally got to test last night and here are my notes for those who are curious as well.

My test application which I tried to convert was my news aggregator portal http://www.scalebig.com/. The application consists of a rapidly changing front page (few times a day) ,  a collection of old pages archived in a sub directory and some other webpage elements like headers, footers, images, style-sheets etc.

  • While Amazon Coudfront does have a presence on AWS management console, it only supports S3 buckets as origins.
  • Since my application didn’t have any components which requires server side processing, I tried to put the whole website on an S3 bucket and tried to use S3 as the origin.
  • When I initially set it up, I ended up with multiple URLs which I had to understand
    • S3 URL – This is the unique URL to your S3 bucket. All requests to this URL will go to Amazons S3 server cluster, and if your objects are marked as private, anyone can get these objects. The object could be a movie, an image, or even an HTML file.
    • Cloudfront URL  – This is the unique Cloudfront URL which maps to your S3 resource through the cloudfront network. For all practical purposes its the same as the first one, except that this is through the CDN service.
    • Your own domain name – This is the actual URL which end users will see, which will be a CNAME to the cloudfront URL.
  • So in my case, I configured the DNS entry for www.scalebig.com to point to DNS entry Cloudfront service created for me (dbnqedizktbfa.cloudfront.net).
  • First thing which broke is that I forgot that this is just an S3 bucket, so it can’t handle things like “sparsed html” to dynamically append headers/footers. I also realized that it can’t control cache policies, setup expiry, etc. But the worst problem was that if you went to “http://www.scalebig.com/” it would throw an error. It was expecting a file name, so http://www.scalebig.com/index.html would have worked.
  • In short I realized that my idea of using S3 as a webserver full of holes.
  • When I started digging for options to enable “custom origin” I realized that those options do not exist on the AWS management console !!. I was instead directed to some third party applications to do this instead. (most of them were commercial products, except two)
  • I finally created the cloudfront configuration using Cloudberry S3 Explorer PRO which allowed me to point Cloudfront to a custom domain name (instead of an S3 resource).
  • In my case my server was running on EC2 with a public reserved IP.  I’m not yet using AWS ELB (Elastic loadbalancer).
  • Once I got that working, which literally worked out of the box, the next challenge is to setup the cache controls and expiries working. If they are set incorrectly, it may stop users from getting latest content. I setup the policies using “.htaccess”. Below I’ve attached a part of the .htaccess I have for the /index.html page which is updated many times a day. There is a similar .htaccess page for rest of the website which recommends a much longer expiry.
  • Finally I realized that it is possible that I might have to invalidate parts of the caches at times (could be due to a bug). Cloudberry and AWS management console didn’t have any option avaliable, but apparently “boto” has some APIs which can work with Amazon cloudfront APIs to do this.

# turn on the module for this directory
ExpiresActive on
# set default
ExpiresDefault "access plus 1 hours"
ExpiresByType image/jpg "access plus 1 hours"
ExpiresByType image/gif "access plus 1 hours"
ExpiresByType image/jpeg "access plus 1 hours"
ExpiresByType image/png "access plus 1 hours"
ExpiresByType text/css "access plus 1 hours"
ExpiresByType text/javascript "access plus 1 hours"
ExpiresByType application/javascript "access plus 1 hours"
ExpiresByType application/x-javascript "access plus 1 hours"
ExpiresByType application/x-shockwave-flash "access plus 1 hours"

Header set Cache-Control "max-age=3600"

AddOutputFilterByType DEFLATE text/html text/plain text/xml application/javascript text/javascript  application/x-javascript text/css

Here is how I would summarize the current state of Amazon cloudfront.

  • Its definitely ready for static websites which don’t have any server side execution code.
  • Cloudfront only accepts GET and HEAD requests
  • Cloudfront ignores cookies, so server can’t set any. (Browser based cookie management will still work, which could be used to keep in-browser session data)
  • If you do want to use serverside code, use iframes, jsonp, javascript widgets or some other mechanism to execute code from a different domain name (which is not on cloudfront).
  • While Cloudfront can log access logs to an S3 bucket of your choice, I’ll recommend using something like Google Analytics to do log analysis.
  • I’ll recommend buying one of the commercial third party products if you want to use Custom Origin and would recommend reading more about the protocols/APIs before you fully trust a production service to Cloudfront.
  • I wish Cloudfront starts supporting something like ESI, which could effectively make an S3 bucket a full fledged webserver without the need of having a running EC2 instance all the time.
  • Overall Cloudfront has a very long way to go, in the number of features, to be treated as a competitor for Akamai’s current range of services.
  • And if you look at Akamai’s current world wide presence, Cloudfront is just a tiny blip.  [ Cloudfront edge locations ]
  • But I suspect that Cloudfront’s continuous evolution is being watched by many and the next set of features could change the balance.

I’m planning to leave http://www.scalebig.com/ on Cloudfront for some time to learn a little more about its operational issues. If you have been using Cloudfront please feel free to leave comments about what important features, you think, are still missing.

More on Amazon S3 versioning (webinar)

If you missed the AWS S3 versioning webcast, I have a copy of the video here. And here are the highlights..

image

  • You can enable and disable this at the bucket level
  • They don’t think there is a performance penalty of turning versioning (but it was kind of obvious S3 would be doing slightly extra work to figure out which is the latest version of any object you have)
  • There isn’t any additional cost for using versioning. But you have to pay for extra copy of each object.
  • MFA (multi factor authentication) to delete objects is not mandatory when versioning is turned on. It needs to be turned on. This was slightly confusing in the original email I got from AWS.
  • If you are planning to use this, please watch this video. There is a part where they explain what happens if you disable versioning after using the feature. This is something you might like to know about.
  • They use GUID for versioning of each object
  • You can iterate over objects and figure out how many versions you have for each object, but currently its not possible to find all objects which have versions older than X date. This is important if you are planning to garbage collection (cleaning up older copies of data) for a later time.

More References

Versioning data in S3 on AWS

One of the problem with Amazon’s S3 was the inability to take a “snapshot” of the state of S3 at anyAmazon Web Services given moment. This is one of the most important DR (disaster recovery) steps of any major upgrade which could potentially corrupt data during a release. Until now the applications using S3 would have had to manage versioning of data, but it seems Amazon has launched a versioning feature built into S3 itself to do this particular task. In addition to that, they have made it a requirement that delete operations on versioned data can only be done using MFA (Multi factor authentication).

Versioning allows you to preserve, retrieve, and restore every version of every object in an Amazon S3 bucket. Once you enable Versioning for a bucket, Amazon S3 preserves existing objects any time you perform a PUT, POST, COPY, or DELETE operation on them. By default, GET requests will retrieve the most recently written version. Older versions of an overwritten or deleted object can be retrieved by specifying a version in the request.

The way AWS Blog describes the feature, it looks like a version would be created every time an object is modified and each object in S3 could have different number of copies depending on the number of times it was modified.

This kind of reminds me of SVN/CVS like versioning control system and I wonder how long it will take for someone to build a source code versioning system on S3.

BTW, data requests to a versioned object is priced the same way as regular data, which basically means you are getting this feature for free.

References