hatestheinternet

Jacking in to the Cloooooud

So, now we've got our fully formed ghetto CDN running, everything's caching properly, coming from the right host, etc, there's one little problem: We still have just a little bit moar faster we can add, thanks to our favourite CDN. I chose to go with Cloudfront because, so long as you stick to the scrub tier, it's free for a year and, really, I don't ever see it costing that much. You're free to choose whatever you want here, however, it's the same principle no matter what buttons you're clicking or where you're clicking them.

When it's all said and done, we're going to have created 3 distributions: One for static, one for content, and one for our apex host. The reason I make three is simple: Each is going to have a slightly different configuration which would affect the caching of the other.

The first thing you're going to have to do, since this isn't Web 1.0 and we're working with those new-fangled pull CDNs, is whip up some CNAMEs for your apex, static, and content hosts. This is why, if you look at the Nginx samples from the previous step, the regexes match both, for example, "org.static" and "static". Theoretically, our server doesn't need to know about origin domains as the CDN will be sending the Host: header, this just makes debugging much easier. Once it's working, it's time to hit up your CDN's control panel and find out why I serve from 3 different hosts.

Content distribution networks, for the most part, have edge servers (or "points of presence") all over the world and generally provide a mechanism to choose which edges serve your content, with world wide obviously being the most expensive. So far as actual charges go, CDNs mostly use two metrics: Data transferred out of the edges and data pulled from the origin, the latter being more expensive.

With this in mind, let's pull a crayon out of the cup and do some back-of-the-coaster math: Obviously, world-wide everything is going to cost the most. When we stop and think about it, however, the heaviest part of a webpage is generally the images and videos and whatnot it's displaying. Someone visiting your site from Tokyo (or vice versa) probably won't notice those few extra milliseconds those images or that video take to load, but the everything else you need to view a page? Probably.

So, if we take our content host out of the mix, we're left with just a few kilobytes of HTML, JavaScript, and CSS. Even though they change more often, their combined total still weighs in at less than the average thumbnail, so I generally give my static and dynamic content hosts worldwide billing and stick to North America for my content because cheaper. At the end of the day we're really only talking about a few milliseconds round trip so don't get too hung up on this, because we've got a bit to go yet.

Static

This distribution is the easiest of all: Basically create the distribution and, since we know our caching game is already tight, set it to use origin headers. We do, naturally, have to make sure it's sending the Host, Origin, and Referer headers, so we'll need to whitelist them along with whatever custom headers you may or may not need.

Also, something to note, whitelisting headers above basically does the same thing as the Vary response header as it outlines a request's "uniqueness" (ie. The same URL and host could return content with a valid referer and 404/403 without), so if your CDN or upstream proxy respects the Vary header, you should do that through something like .htaccess first as it's more correct.

Something else I do out of personal preference is, if your CDN allows (and most do), lock this down to only accepting GET and HEAD requests. In case you're worried about the lack of OPTIONS for CORS, remember that the browser will only send an OPTIONS request while pre-flighting a POST, which we'll never use on this host.

No dynamic content also means no use for query strings or anything fancy, so I disable those and cookies and pretty much everything else too.

Content

For all intents and purposes this is exactly the same as static but lives on its own since we access it through a different hostname. If you're using Cloudfront, went the S3 route towards the end of the ghetto CDN, and aren't using Drupal (or anything that generates missing content on demand) you can even bolt this directly on to the S3 bucket for an only academically noticable bit of faster, but doing so will bypass any RedirectRules or CORS configurations we set up through the bucket's properties, so TL;DR it'll break your image styles.

Drupal or non-Cloudfront, you'll have to use the hostname of the bucket's endpoint as the origin, which is probably content.[yourdomain.com].s3-website-[region].amazonaws.com, but you might want to double check that.

As with static, we can stick with the origin's caching instructions here since it's either coming from Apache or an S3 object with a Cache-Control header defined (courtesy of my RioFS fork). The only real difference is I allow query strings here since Drupal's image token would get swallowed otherwise.

Dynamic Content

This is the hardest bucket to set up in that it requires a few more clicks and is a lot more dependent on how our CMS behaves. If you don't see anti-cache headers (ie Cache-Control: max-age=0 and the like) in administrative areas and previewing unpublished content, you probably shouldn't even think of hooking this up to a CDN as those could possibly wind up being cached since we use origin headers here too since we believe in doing things properly.

Request-wise, I allow GET, POST, and OPTIONS because that's all Drupal needs. You should allow PUT, PATCH, DELETE, etc as required by your CMS. Obviously we allow query strings too since, well, they're not going anywhere any time soon.

"Uniqocity"-wise, you should allow the usual trio of headers plus whitelist whatever cookies your CMS needs for its session management (Drupal this is SESS*, but it's really not that hard to find even if you have to watch your developer tools' Network tab). I found a few scary posts recommending you whitelist every cookie, but if something Javascripty sets one with a unique ID, you've basically turned your CDN in to a personal caching proxy for everyone who visits your site.

Testing

Now that we've dotted all the Is and crossed all the Ts (and the CNAMEs pointing our apex, static, and content hosts to the CDN have propagated, of course), we should find ourselves browsing a fully CDN-fronted site. As when we first got our ghetto CDN setup rolling, pop open your developer tools and pay attention to the headers being returned, keeping an eye out for not only the usual Cache-Control but also any your CDN may add. Cloudfront, for example, adds "X-Cache" indicating a hit or miss, Akamai will set X-Cache to TCP_HIT or TCP_MISS, but your CDN's knowledge base will tell you what to look for. In most cases, however, you'll see an increasing Age header if everything's working.