Monday, January 2, 2012

Taking your web sites and apps offline with the HTML5 appcache

There’s a general (and understandable) belief by even many developers that web sites and web applications can only be used when the browser has a web connection. Indeed, this is routinely cited as one of the real advantages of “native” apps over web apps. But as unintuitive as it sounds, in almost every modern browser and device (except even for now IE10 developer previews, but here’s hoping that changes), that’s not the case, provided the developer does a little extra work to make their app or site persist when a browser is offline. (Of course the user must have visited your site while their browser did have a connection)
In this article, I hope to clear this whole areas up once and for all, show you how to do it, and point to some great resources out there for learning more about creating offline versions of your web sites and apps. Perhaps most importantly, introduce a simple new tool I’ve built to do the heavy lifting for you, ManifestR.
Even if you develop web sites, rather than applications, you can benefit from the techniques outlined here, because caching resources can seriously decrease the load time for your site, particularly on a visitors subsequent site visits.

Making a cache

As I’m sure you know, browsers cache HTML, CSS, JavaScript files, images and other resources of the sites you visit, to speed up the subsequent loading of pages. However, you never know when the browser might discard cached files, and so this is not a reliable way for sites to work offline. But what if we could tell the browser what to cache? Well, with HTML5 application caches (also known as applications caches or “appcaches”) we can do just that. Let’s look at how.

Making it manifest

The heart of the technique is to create an appcache manifest, a simple text file, which tells the browser what to cache (and also what not to). The resources are then cached in an “application cache”, or “appcache”, which is distinct from the cache a browser uses for its own purposes. The anatomy of an appcache manifest is straightforward, but there are a few subtleties.
An appcache manifest
  • begins with the string “CACHE MANIFEST” (this is required)
  • has a section, introduced by the string “CACHE:” which specifies the URLs of resources (either absolute, or relative to the location where the manifest file will be located on the server) to be cached.
  • We can also optionally specify which resources should not be cached, in a section of the manifest file introduced by the string “NETWORK:”. These resources aren’t just not cached, but further, won’t be used when the user is offline, even if the browser has cached them in its own caches.
  • We can also optionally specify fallback resources to be used when the user is not connected, in a section of the file called “FALLBACK:”
  • You can add comments to the file with, simply by beginning a line with “#”
It’s recommended that the extension for a manifest file is .appcache (previously, .manifest was the recommended extension).
Here is a very straightforward example
CACHE MANIFEST

CACHE:

#images
/images/image1.png
/images/image2.png

#pages
/pages/page1.html
/pages/page2.html

#CSS
/style/style.css

#scripts
/js/script.js

FALLBACK:
/ /offline.html

NETWORK:
signup.html
The CACHE section
In the CACHE section we list the resources we want cached. We can use either a URL relative to the .appcache file, or an absolute URL. We can cache resources both in the same domain as the cache, as well as (in most cases) other domains (we’ll cover this in more detail in a moment)
Often, the only section of an appcache manifest is this section, in which case, the CACHE: header may be omitted.
Be careful with what you cache. Once a resource, for example an HTML document is cached, the browser will continue to use this cached version, effectively forever, even if you change the file on the server. To ensure the browser updates the cache, you need to change the .appcache file. This can play havoc while you are developing a site, and we’ll cover some techniques for managing this in a moment.
One suggestion is to add a version, and or a date stamp to the manifest as a comment. This way, you can quickly change the date or version number, and then browsers will refresh the appcache. Browsers do this intelligently, checking to see which resources might have changed since they were last cached, and only re-​​caching those which have.
The Network Section
Probably the most subtle aspect of app caching is the NETWORK section. Because a great many web sites, and particularly applications, have dynamically generated content, pulled in from APIs, CGIs and so on, we may want to ensure certain resources aren’t cached, and are always directly loaded when the browser is online.
That’s where the NETWORK section of the cache comes in. Here, we list the resources we never want to be cached, which is referred to as an “online whitelist”. So, in our example above, we are specifying that the page signup.html (located at the root of our site — remember the entries in our manifest are either absolute or relative URLs) is never cached. When online, any request for this page will always cause the page to be loaded from the server (even if the browser might have previously cached it itself). When the user is offline, any request for this resource resuls in an error (again, even if the browser might have cached the resource in its own caches).
We can also specify a group of resources located within a site in the NETWORK section using a partial URL (technically a “prefix match pattern”) (note that we can’t use this technique in the CACHE section, where all resources must be explicitly listed to be cached in the appcache, with one exception we’ll get to shortly). Any resources which have URLs beginning with this pattern are included in the online whitelist, and never cached. As developers we then need to handle the cases where these resources aren’t available because the user is offline.
There’s also a special wildcard, *. The asterisk specifies that any resources that aren’t explicitly cached in the appcache manifest should not be cached.
Fallbacks
App caching also allow us to specify fallback resources. The form of an entry in the FALLBACK section is two resource identification patterns. The first (in the case above simply “/​”, which matches any resource in the site), specifies resources to be replaced with a fallback when the user is offline. The second specifies the resource to replace any resources matching the patter. So, in this case, when any resource in the site has not been cached, and the user is offline, the page offline.html will be used instead. We can also specify resources to be replaced more specifically. For example, we could specify an offline image for any images that haven’t been loaded like so
/images/ /images/missing.png
Here we’re specifying that any resources located in the directory images at the top level of our site that have not been cached, be replaced with the image called missing.png found in that same directory, when we are offline.

Using the appcache manifest

So now we’ve created our appcache manifest, we need to associate it with our HTML documents. We do this by adding the manifest attribute to the html element of a document, where the value of this attribute is the URL of the appcache file.
The current recommendation is the appcache file have the extension .appcache. So, if our manifest is located at the root of our site, we’d link to it like so
<html manifest='manifest.appcache'>
There are also suggestions that using the HTML5 doctype may be required for some browsers to use app cache, so, make sure you use the doctype
<!DOCTYPE html>
And we’re all set. Well, almost. In order for the browser to recognize the appcache file, it needs to be served with the mimetype text/cache-manifest. How you set this up depends on your site’s server. At the time of writing, it’s likely that this is a step you’ll need to take, so if caching isn’t working, that’s very likely why. One of the most common servers is Apache. There are two ways in which you can set up Apache to serve .appchace files as type text/​cache-​​manifest. At the root directory of your site, add a file with the name .htaccess, with the entry AddType text/cache-manifest .appcache (if there’s already a .htaccess file, just add this line to it).

Gotchas

App caching can be very powerful, allowing apps to work while the user is offline, and can increase site performance, but there are some definite gotchas it pays to be aware of. Here’s a few well worth knowing about.
Resources in style sheets
You’d be forgiven for thinking that any images in a style sheet that has been cached will be included in the appcache, but that’s not so. Images your style sheet refers to must be explicitly referenced in the CACHE section of the manifest as well.
Similarly, style sheets that are imported using @import, and resources included via JavaScript must also be explicitly cached.
To help build an appcache manifest, I’ve developed manifestR, which we’ll look at in detail in a moment. It will generate an appcache manifest for you, which includes all the scripts, style sheets, including those @imported, images, linked pages at the same site, and any images linked to in stylesheets.
There’s one exception to the rule that only explicitly listed resources are cached, and it is important to understand. Any HTML document that has a manifest attribute will be cached, even if it is not listed in the manifest. This can cause all kinds of headaches while developing, which we cover shortly.

Caching Cross Domain Resources

While there is some confusion on the issue, you can in general cache content from across different domains in an app cache. In fact, without this ability, the real world value of app caching would be limited, as content distributed via a CDN (content distribution networks like Akamai) could not be cached (even content served from a differently named server within the same domain couldn’t be cached). The exception to this is that when content is served over secure http (https), then the specification says all resources must come from the same origin. In an exception to this exception, Chrome in fact does not adhere to this part of the specification, and it has been argued that the single origin policy for https is too restrictive in the real world for app caching to be of genuine value.

Refreshing the cache

In effect, unlike most caching of web resources, appcaches do not expire. So, once the browser has cached a particular resource, it will continue to use that cached version, even if you change the resource on the server (for example by editing the contents of an HTML document). The exception to this is when a manifest file is edited. When the manifest is changed, the browser will recache all the resources listed in the manifest.
Caching can cause real headaches while developing a site or application, so it is recommended that during development, you avoid appcaching. One way of achieving this is to serve .appcache files with the wrong mimetype. This way, you can include the manifest attribute in your HTML elements, and serve the appcache file, just as you wold in production, but not have the effects of appcaching. Moving from development to production is as simple then as associating .appcache files with the right mimetype.
Resource hogging and lazy loading
To improve the performance of a site, you might be tempted to preload the entire site, by adding all the pages, images etc in it to the appcache manifest. And, in certain circumstances this might be desirable. It will however place considerable demands on your server, and use more bandwidth, as the first time a person visits your site, they will download more resources than they otherwise might have. Luckily, appcaches have an additional feature that can help here.
You might recall earlier that even when an HTML document with a link to an appcache manifest isn’t included in the manifest file, it will still be cached. The benefit of this is that rather than explicitly listing all the pages at your site in a manifest for them to be cached, each time someone visits a page that links to a manifest, it will then be cached.
If the primary motivation for using an appcache is to ensure your site or more likely app works offline, you’ll likely want to explicitly list the pages of the site, so that they’ll be available offline even if the user hasn’t visited them. If your primary motivation is increased performance, then let pages lazily cache when the user visits them, but cache scripts, CSS, and perhaps commonly used images.
Cache failure
An important, but subtle gotcha with appcaching is that if even one of the resources you include in your cache manifest is not available, then no resources will be cached. So, it is really important to ensure that any resource listed in your appcache manifest is available online. There’s a tool we discuss in a moment, the Cache Manifest Validator, to help ensure all those resources are online.
Size limits
While there specification places no limits on the size an appcache can be, different browsers, and different devices have different limits. Grinning Gecko reports that:
  • Safari desktop browser (Mac and Windows) have no limit
  • Mobile Safari has a 10MB limit
  • Chrome has a 5MB limit
  • Android browser has no limit to appcache size
  • Firefox desktop has unlimited appcache size
  • Opera’s appcache limit can be managed by the user, but has a default size of 50MB
User Permission
In Firefox, when the user first visits an appcached site, the browser asks the user’s permission (as it and other browsers do for location with the geo-​​location API). However, unlike with geo, other browsers don’t ask the user’s permission. Just something to be aware of, as there’ll be no guarantee with Firefox that appcaching is being used, even when supported.

Flakiness and browser support

It must also be noted that the general consensus is that appcaching is currently far from perfect across all browsers which support it. The specification is still in draft, but it should also be noted that most browsers have supported at least some appcaching for quite some time.
According to an amalgam of When can I use, Dive into HTML5 and other online resources:
  • Safari has supported offline web apps since version 4
  • Chrome has supported the feature since version 5
  • Mobile Safari has supported offline apps since iOS 2.1
  • Firefox has supported it since version 3.5
  • Opera has supported appcache since version 11
  • Internet Explorer as yet does not support offline web apps, including in IE10 developer previews
  • Android has support appcache since version 2.1

Introducing ManifestR

We’ve already mentioned that a particular challenge in creating an appcache is identifying all the resources you need to add to the manifest. To help you with this, I’ve developed ManifestR, an online tool to help you create an appcache manifest for any page. I don’t recommend you use it without at least a little additional fine tuning, as what it attempts to do is locate any resources referenced from a given page. As discussed above, depending on the purpose of your appcache, this is likely to be overkill.
Drag me to your bookmarks bar.
When you use ManifestR on a page, here’s what it looks for
  • images both in the same and other domains referenced in the src attribute of any img element in the page.
  • links to pages in the same domain. This can improve the performance of your site for visitors viewing other pages, and is vital if you want the entire site/​app to work offline, but means potentially considerable additional load on your server the first time someone visits the site. Whether you choose to keep this list, or remove some or all of the links is an important decision to make.
  • style sheets, linked, or included via @import statements, located both in your domain, or other domains
  • images linked to in any style sheet, both those in the same domain, or other domains
  • JavaScript files, both those in the same domain, and served from other domains. Here too, you’ll need to consider carefully which to include and which you want to add to the online whitelist via the NETWORK section of the manifest.
it then puts them all together in a manifest, ready for you to cut and paste, tweak, save and upload.
I hope you find it useful in building appcache manifests (and make sure you let me know via twitter what you think, and how we can improve it).

More reading

There’s quite a bit available online about app caching, though keep in mind the specification, and implementations are still somewhat in a state of flux. Here’s some articles and other online resources I have found very helpful —
Overviews and specifications
Tutorials and how-​​tos
Critiques and gotchas
as we know, all is not yet perfect in the world of the offline web just yet. Here are a couple of critiques of the current, and collections of gotchas discovered by offline pioneers.

Tools

We’ve already mentioned ManifestR, but you should find the The Cache Manifest Validator another really useful tool. Remember, for appcaching to work, every resource you list in your manifest must be available, or nothing will be cached. The Cache Manifest Validator can make sure all your resources are available.
Compatibility
Probably the best place to keep up to date with the ever changing field of HTML, CSS3 and other new web technology support in all modern browsers is When Can I Use?. You can find a snapshot of current browser support above.

No comments: