Thursday, August 4, 2011

Measuring the mobile web is hard

I believe strongly that you can't solve a problem until you can measure it. At Google, I've been charged with making the mobile web fast, so naturally, the first step is measuring mobile web performance across a wide range of devices, browsers, networks, and sites. As it turns out, the state of the art in mobile measurement is a complete mess. Different browsers report completely different timings for the same events. There is very little agreement on what metrics we should be optimizing for. Getting good timing out of a mobile device is harder than it should be, and there are many broken tools out there that report incorrect or even imaginary timings.

The desktop web optimization space is pretty complicated, of course, although there's a lot more experience in desktop than in mobile. It's also a lot easier to instrument a desktop web browser than a mobile phone running on a 3G network. Most mobile platforms are fairly closed and fail to expose basic performance metrics in a way that makes it easy for web developers to get at them. We currently resort to jailbreaking phones and running tcpdump and other debugging tools to uncover what is going on at the network and browser level. Clearly it would be better for everyone if this process were simpler.

When we talk about making the mobile web fast, what we are really trying to optimize for is some fuzzy notion of "information latency" from the device to the user. The concept of information latency will vary tremendously from site to site, and depend on what the user is trying to do. Someone trying to check a sports score or weather report only needs limited information from the page they are trying to visit. Someone making a restaurant reservation or buying an airline ticket will require a confirmation that the action was complete before they are satisfied. In most cases, users are going to care most about the "main content" of a page and not things like ads and auxiliary material.

If I were a UX person, I'd say we run a big user study and measure what human beings do while interacting with mobile web sites, using eye trackers, video recordings, instrumented phones -- the works. Unfortunately those techniques don't scale very well and we need something that can be automated.

It also doesn't help that there are (in my opinion) too many metrics out there, many of which have little to do with what matters to the user.

The HTTP Archive (HAR) format is used by a lot of (mostly desktop) measurement tools and is a fairly common interchange format. Steve Souders' httparchive.org site collects HAR files and has some nice tools for visualizing and aggregating them. The HAR spec defines two timing fields for a web page load: onLoad and onContentLoad. onLoad means the time when the "page is loaded (onLoad event fired)", but this has dubious value for capturing user-perceived latency. If you start digging around and trying to find out exactly what the JavaScript onLoad event actually means, you will be hard-pressed to find a definitive answer. The folklore is that onLoad is fired after all of the resources for a given page have been loaded, except that different browsers report this event at different times during the load and render cycle, and JavaScript and Flash can load additional resources after the onLoad event fires. So it's essentially an arbitrary, browser-specific measure of some point during the web page load cycle.

onContentLoad is defined in the HAR Spec as the time when the "Content of the page loaded ... Depeding [sic] on the browser, onContentLoad property represents DOMContentLoad [sic -- should be DOMContentLoaded] event or document.readyState == interactive." Roughly, this seems to correspond to the time when "just" the DOM for the page has been loaded. Normally you would expect this to happen before onLoad, but apparently in some sites and browsers it can happen after onLoad. So, it's hard to interpret what these two numbers actually mean.

The W3C Navigation Timing API goes a long way towards cleaning up this mess by exposing a bunch of events to JavaScript including redirects, DNS lookups, load times, etc. and these times are fairly well-defined. While this API is supported by WebKit, many mobile browsers platforms do not have it enabled; notably iOS (I hope this will be fixed in in iOS5, we will see). The HAR spec will need to be updated with these timings, and someone should carefully document how effectively different browser platforms implement this API in order for it to be really useful.

The W3C Resource Timing API provides an expanded set of events for capturing individual resource timings on a page, which is essential for deep analysis. However, this API is still in the early design stages and there seems to be a lot of ongoing debate about how much information can and should be exposed through JavaScript, e.g., for privacy reasons.

A couple of other metrics depend less on the browser and more on empirical measures, which I tend to prefer.

Time to first byte generally means time to the first byte of the HTTP payload reception on the browser. For WebPageTest, this includes redirects (so redirects are factored into time to first byte). Probably not that useful by itself, but perhaps in conjunction with other metrics. (And God bless Pat Meenan for carefully documenting the measures that WebPageTest reports -- you'd be surprised how often these things are hard to track down.)

WebPageTest also reports time to first paint, which is the first time anything non-white appears in the browser window. This could be as little as a single pixel or a background image, so it's probably not that useful as a metric.

My current favorite metric is the above-the-fold render time, which reports the time for the first screen ("above the fold") of a website to finish rendering. This requires screenshots and image analysis to measure, but it's browser-independent and user-centric, so I like it. It's harder to measure than you would think, because of animations, reflow events, and so forth; see this nice technical presentation for how it's done. Video capture from mobile devices is pretty hard. Solutions like DeviceAnywhere involve hacking into the phone hardware to bring out the video signal, though my preference is for a high-frame-rate video camera in a calibrated environment (which happens to scale well across multiple devices).

One of my team's goals is to provide a robust set of tools and best practices for measuring mobile websites that we can all agree on. In a future post I'll talk some more about the measurements we are taking at Google and some of the tools we are developing.

Startup Life: Three Months In

I've posted a story to Medium on what it's been like to work at a startup, after years at Google. Check it out here.