Improving Drupal: Episodes integration

published on August 25, 2009

In this article, I explain what was required to integrate the Episodes page loading performance monitoring system with Drupal.
Episodes was written by Steve Souders, whom is well-known for his research on high performance web sites and has authored multiple books on this subject.


The work I am doing as part of bachelor thesis on improving Drupal’s page loading performance should be practical, not theoretical. It should have a real-world impact.

To ensure that that also happens, I wrote the Episodes module. This module integrates the Episodes framework for timing web pages (see the “Episodes” section in my “Page loading profiling tools” article) with Drupal on several levels — all without modifying Drupal core:

  • Automatically includes the necessary JavaScript files and settings on each appropriate page.
  • Automatically inserts the crucial initialization variables at the beginning of the head tag.
  • Automatically turns each behavior (in Drupal.behaviors) into its own episode.
  • Provides a centralized mechanism for lazy loading callbacks that perform the lazy loading of content. These are then also automatically measured.
  • For measuring the css, headerjs and footerjs episodes, you need to change a couple of lines in the page.tpl.php file of your theme. That is the only modification you have to make by hand. It is acceptable because a theme always must be tweaked for a given web site.
  • Provides basic reports with charts to make sense of the collected data.

I actually wrote two Drupal modules: the Episodes module and the Episodes Server module. The former is the actual integration and can be used without the latter. The latter can be installed on a separate Drupal web site or on the same. It provides basic reports. It is recommended to install this on a separate Drupal web site, and preferably even a separate web server, because it has to process a lot of data and is not optimized. That would have led me too far outside of the scope of this bachelor thesis.

You could also choose to not enable the Episodes Server module and use an external web service to generate reports, but for now, no such services yet exist. This void will probably be filled in the next few years by the business world. It might become the subject of my master thesis.

1. The goal

The goal is to measure the different episodes of loading a web page. Let me clarify that via a timeline, while referencing the HTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr">
  <head>
    <title>Sample Drupal HTML</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <link rel="shortcut icon" href="/misc/favicon.ico" type="image/x-icon" />
    <link type="text/css" rel="stylesheet" media="all" href="main.css" />
    <link type="text/css" rel="stylesheet" media="print" href="more.css" />
    <script type="text/javascript" src="main.js"></script>
    <script type="text/javascript">
    <!--//--><![CDATA[//><!--
    jQuery.extend(Drupal.settings, { "basePath": "/drupal/", "more": true });
    //--><!]]>
    </script>
    <!--[if lt IE 7]>
      <link type="text/css" rel="stylesheet" media="all" href="fix-ie.css />
    <![endif]-->
  </head>
  <body>
    <!--
      lots
      of
      HTML
      here
    -->
    <script type="text/javascript" src="more.js"></script>
  </body>
</html>

The main measurement points are:

  • starttime: time of requesting the web page (when the onbeforeunload event fires, the time is stored in a cookie); not in the HTML file
  • firstbyte: time of arrival of the first byte of the HTML file (the JavaScript to measure this time should be as early in the HTML as possible for highest possible accuracy); line 1 of the HTML file
  • domready: when the entire HTML document is loaded, but just the HTML, not the referenced files
  • pageready: when the onload event fires, this happens when also all referenced files are loaded
  • totaltime: when everything, including lazily-loaded content, is loaded (i.e. pageready + the time to lazy-load content)

Which make for these basic episodes:

  • backend episode = firstbyte - starttime
  • frontend episode = pageready - firstbyte
  • domready episode = domready - firstbyte, this episode is contained within the frontend episode
  • totaltime episode = totaltime - starttime, this episode contains the backend and frontend episodes

These are just the basic time measurements and episodes. It is possible to also measure the time it took to load the CSS (lines 8-9, this would be the css episode) and JavaScript files in the header (line 10, this would be the headerjs episode) and in the footer (line 27, this would be the footerjs episode), for example. It is possible to measure just about anything you want.

For a visual example of all the above, see the attached figure “Episodes add-on for Firebug” in the earlier blog post about Episodes.

2. Making episodes.js reusable

The episodes.js file provided at the Episodes example is in fact just a rough sample implementation, an implementation that indicates what it should look like. It contained several hardcoded URLs, does not measure the sensible default episodes, contains a few bugs. In short, it is an excellent and solid start, but it needs some work to be truly reusable.

There also seems to be a bug in Episodes when used in Internet Explorer 8. It is actually a bug in Internet Explorer 8: near the end of the page loading sequence, Internet Explorer 8 seems to be randomly disabling the window.postMessage() JavaScript function, thereby causing JavaScript errors. After a while of searching cluelessly for the cause, I gave up and made Internet Explorer 8 also use the backwards-compatibility script (episodes-compat.js), which overrides the window.postMessage() method. The problem had vanished. This is not ideal, but at least it works reliably now.
Finally, there also was a bug in the referrer matching logic, or more specifically, it only worked reliably in Internet Explorer and intermittently worked in Firefox, due to the differences between browsers in cookie handling. Because of this bug, many backend episodes were not being measured, and now they are.

I improved episodes.js to make it reusable, so that I could integrate it with Drupal without adding Drupal-specific code to it. I made it so that all you have to do is something like this:

<head>

<!-- Initialize EPISODES. -->
<script type="text/javascript">
  var EPISODES = EPISODES || {};
  EPISODES.frontendStartTime = Number(new Date());
  EPISODES.compatScriptUrl = "lib/episodes-compat.js";
  EPISODES.logging = true;
  EPISODES.beaconUrl = "episodes/beacon";
</script>

<!-- Load episodes.js. -->
<script type="text/javascript" src="lib/episodes.js" />

<!-- Rest of head tag. -->
<!-- ... -->

</head>

This way, you can initialize the variables to the desired values without customizing episodes.js. Line 6 should be as early in the page as possible, because it is the most important reference time stamp.

3. Episodes module: integration with Drupal

Here is a brief overview with the highlights of what had to be done to integrate the Episodes framework with Drupal.

  • Implemented hook_install(), through which I set a module weight of -1000. This extremely low module weight ensures the hook implementations of this module are always executed before all others.
  • Implemented hook_init(), which is invoked at the end of the Drupal bootstrap process. Through this hook I automatically insert the JavaScript into the <head> tag that is necessary to make Episodes work (see the “Making episodes.js reusable” section). Thanks to the extremely low module weight, the JavaScript code it inserts is the first tag in the <head> tag.
  • Also through this same hook I add Drupal.episodes.js, which provides the actual integration with Drupal. It automatically creates an episode for each Drupal “behavior”. (A behavior is written in JavaScript and adds interactivity to the web page.) Each time new content is added to the page through AHAH, Drupal.attachBehaviors() is called and automatically attaches behaviors to new content, but not to existing content. Through Drupal.episodes.js, Drupal’s default Drupal.attachBehaviors() method is overridden — this is very easy in JavaScript. In this overridden version, each behavior is automatically measured as an episode.
    Thanks to Drupal’s existing abstraction and the override I have implemented, all JavaScript code can be measured through Episodes without hacking Drupal core.
    A simplified version of what it does can be seen here:
Drupal.attachBehaviors = function(context) {
  url = document.location;

  for (behavior in Drupal.behaviors) {
    window.postMessage("EPISODES:mark:" + behavior, url);
    Drupal.behaviors[behavior](context);
    window.postMessage("EPISODES:measure:" + behavior, url);
  }
};
  • Some of the Drupal behaviors are too meaningless to measure, so it would be nice to be able to mark some of the behaviors as ignored. That is also something I implemented. Basically I do this by locating every directory in which one or more *.js files exist, create a scan job for each of these and queue them in Drupal’s Batch API. Each of these jobs scans each *.js file, looking for behaviors. Every detected behavior is stored in the database and can be marked as ignored through a simple UI that uses the Hierarchical Select module.
  • For measuring the css and headerjs episodes, it is necessary to make a couple of simple (copy-and-paste) changes to the page.tpl.php of the Drupal theme(s) you are using. These changes are explained in the README.txt file that ships with the Episodes module. This is the only manual change to code that can be done — it is recommended, but not required.
  • And of course a configuration UI (see the attached figures: “Episodes module settings form” and “Episodes module behaviors settings form”) using the Forms API. It ensures the logging URL (this is the URL through which the collected data is logged to Apache’s log files) exists and is properly configured (i.e. returns a zero-byte file).

4. Episodes Server module: reports

Only basic reports are provided, highlighting the most important statistics and visualizing them through charts. Advanced/detailed reports are beyond the scope of this bachelor thesis, because they require extensive performance research (to be able to handle massive datasets), database indexing optimization and usability research.

4.1 Implementation

  • First of all, the Apache HTTP server is a requirement as this application’s logging component is used for generating the log files. Its logging component has been proven to be scalable, so there is no need to roll our own.
    The source of this idea lies with Jiffy (see the “Jiffy” section in my “Page loading profiling tools” article).
  • The user must make some changes to his httpd.conf configuration file for his Apache HTTP server. As just mentioned, my implementation is derived from Jiffy’s, yet every configuration line is different.
  • The ingestor parses the Apache log file and moves the data to the database. I was able to borrow a couple of regular expressions from Jiffy’s ingestor (which is written in Perl) but I completely rewrote it to obtain clean and simple code, conform the Drupal coding guidelines. It detects the browser, browser version and operating system from the User Agent that was logged with the help of the Browser.php library.
    Also, IPs are converted to country codes using the ip2country Drupal module.
    This is guaranteed to work thanks to the included meticulous unit tests.
  • For the reports, I used the Google Chart API. You can see an example result in the attached figures “Episodes Server module: overall analysis”, “Episodes Server module: page loading performance analysis” and “Episodes Server module: episodes analysis”. It is possible to compare the page loading performance of multiple countries by simply selecting as many countries as you would like in the “Filters” fieldset.
  • And of course again a configuration UI (see the attached figure: “Episodes Server module settings form”) using the Forms API. It ensures the log file exists and is accessible for reading.

4.2 Desired future features

Due to lack of time, the basic reports are … well … very basic. It would be nice to have more charts and to be able to filter the data of the charts. In particular, these three filters would be very useful:

  1. filter by timespan: all time, 1 year, 6 months, 1 month, 1 week, 1 day
  2. filter by browser and browser version
  3. filter by (parts of) the URL

5 Insights

  • Episodes module
    • Generating the back-end start time on the server can never work reliably because the clocks of the client (browser) and server are never perfectly in sync, which is required. Thus, I simply kept Steve Souders’ onbeforeunload method to log the time when a next page was requested. The major disadvantage of this method is that it is impossible to measure the backend episode for each page load: it is only possible to measure the backend episode when the user navigates through our site (more specifically, when the referrer is the same as the current domain).
    • Even just measuring the page execution time on the server cannot work because of this same reason. You can accurately measure this time, but you cannot relate it to the measurements in the browser. I implemented this using Drupal’s hook_boot() and hook_exit() hooks and came to this conclusion.
    • On the first page load, the onbeforeunload cookie is not yet set and therefor the backend episode cannot be calculated, which in turn prevents the pageready and totaltime episodes from being calculated. This is of course also a problem when cookies are disabled, because then the backend episode can never be calculated. There is no way around this until the day that browsers provide something like document.requestTime.
  • Episodes server module
    • Currently the same database as Drupal is being used. Is this scalable enough for analyzing the logs of web sites with millions of page views? No. Writing everything to a SQLite database would not be better. The real solution is to use a different server to run the Episodes Server module on or even an external web service. Better even is to log to your own server and then send the logs to an external web service. This way you stay in control of all your data! Because you still have your log data, you can switch to another external web service, thereby avoiding vendor lock-in. The main reason I opted for using the same database, is ease of development.
      Optimizing the profiling tool is not the goal of this bachelor thesis, optimizing page loading performance is. As I already mentioned before, writing an advanced profiling tool could be a master thesis on its own.

6. Feedback from Steve Souders

I explained Steve Souders what I wanted to achieve through this bachelor thesis and the initial work I had already done on integrating Episodes with Drupal. This is how his reply started:

Wow.

Wow, this is awesome.

So, at least he thinks that this was a worthwhile job, which suggests that it will probably be worthwhile/helpful for the Drupal community as well.

Unfortunately for me, Steve Souders is a very busy man, speaking at many web-related conferences, teaching at Stanford, writing books and working at Google. He did not manage to get back to the questions I asked him.


This is a republished part of my bachelor thesis text, with thanks to Hasselt University for allowing me to republish it. This is section eight in the full text.

Previously in this series:

Comments

Noah's picture

Noah

It looks like this is going to be a great tool! I like how it picks up all the Drupal.behaviors automatically, and you even gave us unit tests- nice… Regarding the ability to scale, I think your idea of using a external web-service would be good idea. Think mollom, but for profiling ;-) Thanks for all your hard work Wim.

Wim Leers's picture

Wim Leers

Yes, exactly, like Mollom, but for profiling. Now, if I only had the time … :) It’s a good candidate for a master thesis I think? I’ll see. You’ll definitely hear more from me on that topic! :)