Project: Bachelor thesis: File Conveyor

published on April 6, 2012

File Conveyor
Time range: 
March 2009 to May 2009
As a: 

School year: third bachelor year, second semester.

See! Also see a complete overview of technologies used.

Highlights: I received a massive 19/20 score, and it was labeled as being of the level of a master thesis. I presented about it at DrupalCon DC, FOSDEM and DrupalCon Paris. It helped me choose a master thesis topic, got to use my code and helped land me an internship at Facebook!

In this final bachelor year, Hasselt University Computer Science students get to build one big project on their own, to show what they’re capable of.

Considering that I had been getting into WPO more and more, particularly applied to Drupal, I opted to do my bachelor thesis on “Improving Drupal’s page loading performance”1. CDNs were still a relative scarcity back then: few sites used it, most of the somewhat affordable ones used either custom protocols or (S)FTP to get files on them; all Origin Pull CDNs were big names that only offered their services for big bucks.

So, I opted to build a daemon in C++/Qt to sync files from a web site’s web server (the “origin”) to one or more remote destinations (either CDNs or static file servers; or any remote destination, really). This way, one could switch CDN providers easily, avoiding expensive lock-in. I’d also use inotify on Linux and FSEvents on OS X to be notified of file system events immediately, so that it’d be able to sync files ASAP, to serve them from the CDN instead of the origin ASAP. Before the files would be transmitted, you could process them first, e.g. to minify CSS/JS, losslessly optimize images, transcode videos — anything!

Later I changed my mind, I’d build it in Python instead, because the things it’d need to do were network-bound, not CPU- or memory-bound. Also because Python supposedly had nice abstractions readily available for everything. Which turned out to not be quite true: there was no cross-platform “file system event notification” abstraction (only pyinotify existed, nothing existed for FSEvents on OS X, let alone the equivalent on Windows…), nor was there a nice “remote file system” abstraction (for FTP, SFTP, SSH, Amazon S3, Rackspace Cloud Files2, etc.). Eventually I did find something for the latter: django-storages, which implied I had to include/depend on parts of Django. Crazy? Yes. Necessary, given the very short timeframe? Yes!

Once I had that up & running, I created the Drupal CDN integration module, which made it easy to integrate with the Python daemon I described above. That was fairly easy.

But then I wanted to be able to prove that I was actually having a positive performance impact. So I used WPO guru Steve Souders’ (back then he still worked at Yahoo!) Episodes library to do Real User Monitoring: it’s a JavaScript library that can measure how long the various episodes of the page loading process take. You can measure anything you want. So I created deep integration of that with Drupal as well: Drupal has this concept of “Drupal behaviors”, which are self-contained JavaScript logic that add “behaviors” to HTML. It meant that each Drupal behavior also received a corresponding episode, allowing you to track the performance of each Drupal behavior separately!

If you’ve read this far, you’ll probably also want to read more about it.

  1. I called it “page loading performance” because “WPO” (Web Performance Optimization) hadn’t been coined yet by then. ↩︎

  2. Then still called Mosso, before it was acquired by Rackspace. ↩︎