File Conveyor: design

published on February 15, 2010

In this extensive article, I explain the architecture of the “File Conveyor” daemon that I wrote to detect files immediately (through the file system event monitors on each OS, i.e. inotify on Linux), process them (e.g. recompress images, compress CSS/JS files, transcode videos …) and finally, sync them (FTP, Amazon S3, Amazon CloudFront and Rackspace CloudFiles are supported).

Previously in this series:


So now that we have the tools to accurately (or at least representatively) measure the effects of using a CDN, we still have to start using a CDN. Next, we will examine how a web site can take advantage of a CDN.

As explained in “Key Properties of a CDN, there are two very different methods for populating CDNs. Supporting pull is easy, supporting push is a lot of work. But if we want to avoid vendor lock-in, it is necessary to be able to transparently switch between pull and any of the transfer protocols for push. Suppose that you are using CDN A, which only supports FTP. when you want to switch to a cheaper, yet better CDN B, that would be a costly operation, because CDN B only supports a custom protocol.

To further reduce costs, it is necessary that we can do the preprocessing ourselves (be that video transcoding, image optimization or anything else). Also note that many CDNs do not support processing of files — but it can reduce the amount of bandwidth consumed significantly, and thereby the bill received every month.

That is why the meat of this thesis is about a daemon that makes it just as easy to use either push or pull CDNs and that gives you full flexibility in what kind of preprocessing you would like to perform. All you will have to do to integrate your web site with a CDN is:

  1. install the daemon
  2. tell it what to do by filling out a simple configuration file
  3. start the daemon
  4. retrieve the URLs of the synced files from an SQLite database (so you can alter the existing URLs to files to the ones for the CDN)

Goals

As said before, the ability to use either push or pull CDNs is an absolute necessity, as is the ability to process files before they are synced to the CDN. However, there is more to it than just that, so here is a full list of goals.

  • Easy to use: the configuration file is the interface and explain itself just by its structure
  • Transparency: the transfer protocol(s) supported by the CDN should be irrelevant
  • Mixing CDNs and static file servers
  • Processing before sync: image optimization, video transcoding …
  • Detect (and sync) new files instantly: through inotify on Linux, FSEvents on Mac OS X and the FindFirstChangeNotification API or ReadDirectoryChanges API on Windows (there is also the FileSystemWatcher class for .NET)
  • Robustness: when the daemon is stopped (or when it crashed), it should know where it left off and sync all added, modified and deleted files that it was still syncing and that have been added, modified and deleted while it was not running
  • Scalable: syncing 1,000 or 1,000,000 files — and keeping them synced — should work just as well
  • Unit testing wherever feasible
  • Design for reuse wherever possible
  • Low resource consumption (except for processors, which may be very demanding because of their nature)
  • No dependencies other than Python (but processors can have additional dependencies)
  • All the logic of the daemon should be contained in a single module, to allow for quick refactoring.

A couple of these goals need more explaining.

The transparency goal should speak for itself, but you may not yet have realized its impact. This is what will avoid high CDN provider switching costs, that is, it helps to avoid vendor lock-in.

Detecting and syncing files instantly is a must to ensure CDN usage is as high as possible. If new files would only be detected every 10 minutes, then visitors may be downloading files directly from the web server instead of from the CDN. This increases the load on the web server unnecessarily and also increases the page load time for the visitors.
For example, one visitor has uploaded images as part of the content he created. All visitors will be downloading the image from the web server, which is suboptimal, considering that they could have been downloading it from the CDN.

The ability to mix CDNs and static file servers makes it possible to either maximize the page loading performance or minimize the costs. Depending on your company’s customer base, you may either want to pay for a global CDN or a local one. If you are a global company, a global CDN makes sense. But if you are present only in a couple of countries, say the U.S.A., Japan and France, it does not make sense to pay for a global CDN. It is probably cheaper to pay for a North-American CDN and a couple of strategically placed static file servers in Japan and France to cover the rest of your customer base. Without this daemon, this is rather hard to set up. With it however, it becomes child’s play: all you have to do, is configure multiple destinations. That is all there is to it. It is then still up to you how you use these files, though. To decide from which server you will let your visitors download the files, you could look at the IP, or if your visitors must register, at the country they have entered in their profile. This also allows for event-driven server allocation. For example if a big event is being hosted in Paris, you could temporarily hire another server in Paris to ensure low latency and high throughput.

Other use cases

The daemon, or at least one or more of the modules that were written for it, can be reused in other applications. For example:

  • Back-up tool
  • Video transcoding server (e.g. to transcode videos uploaded by visitors to H.264 or Flash video)
  • Key component in creating your own CDN
  • Key component in a file synchronization tool for consumers

2. Configuration file design

Since the configuration file is the interface and I had a good idea of the features I wanted to support, I started by writing a configuration file. That might be unorthodox, but in the end, this is the most important part of the daemon. If it is too hard to configure, nobody will use it. If it is easy to use, more people will be inclined to give it a try.

Judge for yourself how easy it is by looking at the listing below. Beneath the config root node, there are 3 child nodes, one for each of the 3 major sections:

<?xml version="1.0" encoding="UTF-8"?>
<config>
  <!-- Sources -->
  <sources ignoredDirs="CVS:.svn">
    <source name="drupal" scanPath="/htdocs/drupal" documentRoot="/htdocs" basePath="/drupal/" />
    <source name="downloads" scanPath="/Users/wimleers/Downloads" />
  </sources>

  <!-- Servers -->
  <servers>
    <server name="origin pull cdn" transporter="symlink_or_copy">
      <location>/htdocs/drupal/staticfiles</location>
      <url>http://mydomain.mycdn.com/staticfiles</url>
    </server>
    <server name="ftp push cdn" transporter="ftp" maxConnections="5">
      <host>localhost</host>
      <username>daemontest</username>
      <password>daemontest</password>
      <url>http://localhost/daemontest/</url>
    </server>
  </servers>

  <!-- Rules -->
  <rules>
    <rule for="drupal" label="CSS, JS, images and Flash">
      <filter>
        <paths>modules:misc</paths>
        <extensions>ico:js:css:gif:png:jpg:jpeg:svg:swf</extensions>
      </filter>
      <processorChain>
        <processor name="image_optimizer.KeepFilename" />
        <processor name="yui_compressor.YUICompressor" />
        <processor name="link_updater.CSSURLUpdater" />
        <processor name="unique_filename.Mtime" />
      </processorChain>
     <destinations>
        <destination server="origin pull cdn" />
        <destination server="ftp push cdn" path="static" />
      </destinations>
    </rule>

    <rule for="drupal" label="Videos">
      <filter>
        <paths>modules:misc</paths>
        <extensions>flv:mov:avi:wmv</extensions>
        <ignoredDirs>CVS:.svn</ignoredDirs>
        <size conditionType="minimum">1000000</size>
      </filter>
      <processorChain>
        <processor name="unique_filename.MD5" />
      </processorChain>
      <destinations>
        <destination server="ftp push cdn" path="videos" />
      </destinations>
    </rule>

    <rule for="downloads" label="Mirror">
      <filter>
        <extensions>mov:avi</extensions>
      </filter>
      <destinations>
        <destination server="origin pull cdn" path="mirror" />
        <destination server="ftp push cdn" path="mirror" />
      </destinations>
    </rule>
  </rules>
</config>
  1. sources: indicate each data source in which new, modified and deleted files will be detected recursively. Each source has a name (that we will reference later in the configuration file) and of course a scanPath, which defines the root directory within which new/modified/deleted files will be detected. It can also optionally have the documentRoot and basePath attributes, which may be necessary for some processors that perform magic with URLs. sources itself also has an optional ignoredDirs attribute, which will subsequently be applied to all filter nodes. While unnecessary, this prevents needless duplication of ignoredDirs nodes inside filter nodes.
  2. servers: provide the settings for all servers that will be used in this configuration. Each server has a name and a transporter that it should use. The child nodes of the server node are the settings that are passed to that transporter.
  3. rules: this is the heart of the configuration file, since this is what determines what goes where. Each rule is associated with a source (via the for attribute), must have a label attribute and can consist (but does not have to!) of three parts:
    1. filter: can contain paths, extensions, ignoredDirs, pattern and size child nodes. The text values of these nodes will be used to filter the files that have been created, modified or deleted within the source to which this rule applies. If it is a match, then the rule will be applied (and therefor the processor chain and destination associated with it). Otherwise, this rule is ignored for that file. See the filter module explanation for details.
    2. processorChain: accepts any number of processor nodes through which you reference (via the name attribute) the processor module and the specific processor class within that processor module that you would like to use. They will be chained in the order you specify here.
    3. destinations: accepts any number of destination nodes through which you specify all servers to which the file should be transported. Each destination node must have a server attribute and can have a path attribute. The path attribute sets a parent path (on the server) inside which the files will be transported.

Reading the above should make less sense than simply reading the configuration file. If that is the case for you too, then I succeeded.

3. Python modules

All modules have been written with reusability in mind: none of them make assumptions about the daemon itself and are therefor reusable in other Python applications.

3.1. filter.py

This module provides the Filter class. Through this class, you can check if a given file path matches a set of conditions. This class is used to determine which processors should be applied to a given file and to which CDN it should be synced.

This class has just 2 methods: set_conditions() and matches(). There are 5 different conditions you can set. The last two should be used with care, because they are a lot slower than the first three. Especially the last one can be very slow, because it must access the file system.
If there are several valid options within a single condition, a match with any of them is sufficient (OR). Finally, all conditions must be satisfied (AND) before a given file path will result in a positive match.
The five conditions that can be set (as soon as one or more conditions are set, Filter will work) are:

  1. paths: a list of paths (separated by colons) in which the file can reside
  2. extensions: a list of extensions (separated by colons) the file can have
  3. ignoredDirs: a list of directories (separated by colons) that should be ignored, meaning that if the file is inside one of those directories, Filter will mark this as a negative match — this is useful to ignore data in typical CVS and .svn directories
  4. pattern: a regular expression the file path must match
  5. size
    1. conditionType: either minimum or maximum
    2. threshold: the threshold in bytes

This module is fully unit-tested and is therefor guaranteed to work flawlessly.

3.2. pathscanner.py

As is to be expected, this module provides the PathScanner class, which scans paths and stores them in a SQLite database. You can use PathScanner to detect changes in a directory structure. For efficiency, only creations, deletions and modifications are detected, not moves. This class is used to scan the file system for changes when no supported filesystem monitor is installed on the current operating system. It is also used for persistent storage: when the daemon has been stopped, the database built and maintained through/by this class is used as a reference, to detect changes that have happened before it was started again. This mean PathScanner is used during the initialization of the daemon, regardless of the available file system monitors.

The database schema is very simple: (path, filename, mtime). Directories are also stored; in that case, path is the path of the parent directory, filename is the directory name and mtime is set to -1. Modified files are detected by comparing the current mtime with the value stored in the mtime column.

Changes to the database are committed in batches, because changes in the filesystem typically occur in batches as well. Changes are committed to the database on a per-directory level. However, if many changes occurred in a single directory and if every change would be committed separately, the concurrency level would rise unnecessarily. By default, every batch of 50 changes inside a directory is committed.

This class provides you with 8 methods:

  • initial_scan() to build the initial database — works recursively
  • scan() to get the changes — does not work recursively
  • scan_tree() (uses scan()) to get the changes in an entire directory structure — obviously works recursively
  • purge_path() to purge all the metadata for a path from the database
  • add_files(), update_files(), remove_files() to add/update/remove files manually (useful when your application has more/faster knowledge of changes)

Special care had to be taken to not scan directory trees below directories that are in fact symbolic links. This design decision was made to mimic the behavior of file system monitors, which are incapable of following symbolic links.

This module does not have any tests yet, because it requires a lot of mock functions to simulate system calls. It has been tested manually thoroughly though.

3.3. fsmonitor.py

This time around, there is more to it than it seems. fsmonitor.py provides FSMonitor, a base class from which subclasses derive. fsmonitor_inotify.py has the FSMonitorInotify class, fsmonitor_fsevents.py has FSMonitorFSEvents and fsmonitor_polling.py has FSMonitorPolling.
Put these together and you have a single, easy to use abstraction for each major operating system’s file system monitor:

Windows support is possible, but has not been implemented yet due to time constraints. There are two APIs to choose between: FindFirstChangeNotification and ReadDirectoryChanges. There is a third, the FileSystemWatcher class, but this is only usable from within .NET and Visual C++, so it is an unlikely option because it is not directly accessible from within Python. This was already mentioned in the goals.

The ReadDirectoryChanges API is more similar to inotify in that it triggers events on the file level. The disadvantage is that this is a blocking API. FindFirstChangeNotification is a non-blocking API, but is more similar to FSEvents, in that it triggers events on the directory level. A comprehensive, yet concise comparison is available at “Watch a Directory for Changes”.

Implementation obstacles

To make this class work consistently, less critical features that are only available for specific file system monitors are abstracted away. And other features are emulated. It comes down to the fact that FSMonitor’s API is very simple to use and only supports 5 different events: CREATED, MODIFIED, DELETED, MONITORED_DIR_MOVED and DROPPED_EVENTS. The last 2 events are only triggered for inotify and FSEvents.
A persistent mode is also supported, in which all metadata is stored in a database. This allows you to even track changes when your program was not running.

As you can see, only 3 “real” events of interest are supported: the most common ones. This is because not every API supports all features of the other APIs.
inotify is the most complete in this regard: it supports a boatload of different events, on a file-level. But it only provides realtime events: it does not maintain a complete history of events. And that is understandable: it is impossible to maintain a history of every file system event. Over time, there would not be any space left to store actual data.
FSEvents on the other hand, works on the directory-level, so you have to maintain your own directory tree state to detect created, modified and deleted files. It only triggers a “a change has occurred in this directory” event. This is why for example there is no “file moved” event: it would be too resource-intensive to detect, or at the very least it would not scale. On the plus side, FSEvents maintains a complete history of events. PathScanner’s scan() method is used to detect the changes to the files in each directory that was changed.

Implementations that do not support file-level events (FSEvents and polling) are persistent by design. Because the directory tree state must be maintained to be able to trigger the correct events, PathScanner (see section the PathScanner.py section) is used for storage. They use PathScanner’s (add|update|remove)_files() functions to keep the database up-to-date. And because the entire directory tree state is always stored, you can compare the current directory tree state with the stored one to detect changes, either as they occur (by being notified of changes on the directory level) or as they have occurred (by scanning the directory tree manually).

For the persistent mode, we could take advantage of FSEvents’ ability to look back in time. However, again due to time constraints, the same approach is used for every implementation: a manual scanning procedure — using PathScanner — is started after the file system monitor starts listening for new events on a given path. That way, no new events are missed. This works equally well as using FSEvents’ special support for this; it is just slower. But it is sufficient for now. This was not implemented due to time constraints.
To implement this, one would simply need to get the highest mtime (modification time of a file) stored in the database, and then ask FSEvents to send the events from that starting time to our callback function, instead of starting from the current time.

3.4. persistent_queue.py and persistent_list.py

In order to provide data persistency , I wrote a PersistentQueue class and a PersistentList class. As their names indicate, these provide you with a persistent queue and a persistent list. They use again an SQLite database for persistent storage. For each instance you create, you must choose the table name and you can optionally choose which database file to write to. This allows you to group persistent data structures in a logical manner (i.e. related persistent data structures can be stored in the same database file, thereby also making it portable and easy to backup).

To prevent excessive file system access due to an overreliance on SQLite, I also added in-memory caching. To ensure low resource consumption, only the first X items in PersistentQueue are cached in-memory (a minimum and maximum threshold can be configured), but for PersistentList there is no such restriction: it is cached in-memory in its entirety. It is not designed for large datasets, but PersistentQeueue is.

PersistentQueue is used to store the events that have been triggered by changes in the file system via fsmonitory.py: (input_file, event) pairs are stored in a persistent queue. Because the backlog (especially the one after the initial scan) can become very large (imagine 1 million files needing to be synced), this was a necessity.
PersistentList is used to store the files that are currently being processed and the list of files that have failed to sync. These must also be stored persistently, because if the daemon is interrupted while still syncing files, these lists will not be empty, and data could be lost. Because they are persistent, the daemon can add the files in these lists to the queue of files to be synced again, and the files will be synced, as if the daemon was never interrupted.

The first question to arise is: “Why use SQLite in favor of Python’s built-in shelve module?” Well, the answer is simple: aside from the benefit of the ability to have all persistent data in a single file, it must also scale to tens of thousands or even millions of files. shelve is not scalable because its data is loaded into memory in its entirety. This could easily result in hundreds of megabytes of memory usage. Such excessive memory usage should be avoided at all costs when the target environment is a (web) server.

Your next question would probably be: “How can you be sure the SQLite database will not get corrupt?” The answer is: we can not. But the same applies to Python’s shelve module. However, the aforementioned advantages of SQLite give plenty of reasons to choose SQLite over shelve. Plus, SQLite is thoroughly tested, even against corruption. It is also used for very large datasets (it works well for multi-gigabyte databases but is not designed for terabyte-scale databases — see “Appropriate Uses For SQLite”) and by countless companies, amongst which Adobe, Apple, Google, Microsoft and Sun (see “Well-Known Users of SQLite”). So it is the best bet you can make.

Finally, you would probably ask “Why not use MySQL or PostgreSQL or …?”. Again the answer is brief: because SQLite requires no additional setup since it is serverless, as opposed to MySQL and PostgreSQL.

Both modules are fully unit-tested and are therefor guaranteed to work flawlessly.

3.5. Processors

processor.py

This module provides several classes: Processor, ProcessorChain and ProcessorChainFactory. Processor is a base class for processors, which are designed to be easy to write yourself. Processors receive an input file, do something with it and return the output file. The Processor class takes a lot of the small annoying tasks on its shoulders, such as checking if the file is actually a file this processor can process, calculating the default output file, checking if the processor (and therefor the entire chain of processors) should be run per server or not and a simple abstraction around an otherwise multi-line construction to run a command.
Upon completion, a callback will be called. Another callback is called in case of an error.
An example can found in the code listed below. For details, please consult File Conveyor’s documentation.

  class YUICompressor(Processor):
      """compresses .css and .js files with YUI Compressor"""


      valid_extensions = (".css", ".js")


      def run(self):
          # We do not rename the file, so we can use the default output file.

          # Remove the output file if it already exists, otherwise YUI
          # Compressor will fail.
          if os.path.exists(self.output_file):
              os.remove(self.output_file)

          # Run YUI Compressor on the file.
          yuicompressor_path = os.path.join(self.processors_path, "yuicompressor.jar")
          args = (yuicompressor_path, self.input_file, self.output_file)
          (stdout, stderr) = self.run_command("java -jar %s %s -o %s" % args)

          # Raise an exception if an error occurred.
          if not stderr == "":
              raise ProcessorError(stderr)

          return self.output_file

Processors are allowed to make any change they want to the file’s contents and are executed before a file is synced, most often to reduce the size of the file and thereby to decrease the page loading time.
They are also allowed to change the base name of the input file, but they are not allowed to change its path. This measure was taken to reduce the amount of data that needs to be stored to know which file is stored where exactly in the database of synced files (see the “Putting it all together: arbitrator.py” section). This is enforced by convention, because in Python it is impossible to truly enforce anything. If you do change the path, the file will sync just fine, but it will be impossible to delete the old version of a modified file, unless it results in the exact same path and base name each time it runs through the processor chain.

The Processor class accepts a parent logger which subclasses can optionally use to perform logging.

Then there is the ProcessorChain class, which receives a list of processors and then runs them as a chain: the output file of processor one is the input file of processor two, and so on. ProcessorChains run in their own threads. ProcessorChain also supports logging and accepts a parent logger.

There are two special exceptions Processor subclasses can throw:

  1. RequestToRequeueException: when raised, the ProcessorChain will stop processing this file and will pretend the processing failed. This effectively means that the file will be reprocessed later. A sample use case is the CSSURLUpdater class (described later on in this section), in which the URLs of a CSS file must be updated to point to the corresponding URLs of the files on the CDN. But if not all of these files have been synced already, that is impossible. So it must be retried later.
  2. DocumentRootAndBasePathRequiredException: when raised, the ProcessorChain will stop applying the processor that raised this exception to this file, be- cause the source to which this file belongs, did not have these attributes set and therefor it cannot be applied.
  3. Finally, ProcessorChainFactory: this is simply a factory that generates ProcessorChain objects, with some parameters already filled out.
filename.py

This processor module provides two processor classes: SpacesToUnderscores and SpacesToDashes. They respectively replace spaces with underscores and spaces with dashes in the base name of the file.

This one is not very useful, but it is a good simple example.

unique_filename.py

Also in this processor module, two processor classes are provided: Mtime and MD5. MTime appends the mtime (last modification time) as a UNIX time stamp to the file’s base name (preceded by an underscore). MD5 does the same, but instead of the mtime, it appends the MD5 hash of the file to the file’s base name.

This processor is useful if you want to ensure that files have unique filenames, so that they can be given far future Expires headers (see “Improving Drupal’s page loading performance”).

image_optimizer.py

This processor module is inspired by smush.it. It optimizes images losslessly, i.e. it reduces the filesize without touching the quality. The research necessary was not performed by me, but by Stoyan Stefanov, a Yahoo! web developer working for the Exceptional Performance team, and was thoroughly laid out in a series of blog posts at the Yahoo! user interface blog:

  1. “Image Optimization Part 1: The Importance of Images”
  2. “Image Optimization Part 2: Selecting the Right File Format”
  3. “Image Optimization Part 3: Four Steps to File Size Reduction”
  4. “Image Optimization Part 4: Progressive JPEG … Hot or Not?”

For GIF files, a conversion to PNG8 is performed using ImageMagick’s convert. PNG8 offers lossless image quality, as does GIF, but results in a smaller file size. PNG8 is also supported in all browsers, including IE6. The alpha channels of true color PNG (PNG24 & PNG32) are not supported in IE6.

PNG files are stored in so-called “chunks” and not all of these are required to display the image — in fact, most of them are not used at all. pngcrush is used to strip all the unneeded chunks. pngcrush is also applied to the PNG8 files that are generated by the previous step. I decided not to use the brute force method, which tries over a hundred different methods for optimization, but just the 10 most common ones. The brute force method would result in 30 seconds of processing versus less than a second otherwise.

JPEG files can be optimized in three complementary ways: stripping metadata, optimizing the Huffman tables and making them progressive. There are two variations to store a JPEG file: baseline and progressive. A baseline JPEG file is stored as one top-to-bottom scan, whereas a progressive JPEG file is stored as a series of scans, with each scan gradually improving the quality of the overall image. Stoyan Stefanov’s tests have pointed out that there is a 75% chance that the JPEG file is best saved as baseline when it is smaller than 10 KB. For JPEG files larger than 10 KB, it is 94% likely that progressive JPEG will result in a better compression ratio. That is why the third optimization (making JPEG files progressive) is only applied when the file is larger than 10 KB. All these optimizations are applied using jpegtran.

Finally, animated GIF files can be optimized by stripping the pixels from each frame that do not change from the previous to the next frame. I use gifsicle to achieve that.

There is one important nuance though: stripping metadata may also remove the copyright information, which may have legal consequences. So it is not recommended to strip metadata when some of the photos being hosted have been bought, which may be the situation for a newspaper web site, for example.

Now that you know how the optimizations are done, here is the overview of all processor classes that this processor module provides:

  1. Max optimizes image files losslessly (GIF, PNG, JPEG, animated GIF)
  2. KeepMetadata same as Max, but keeps JPEG metadata
  3. KeepFilename same as Max, but keeps the original filename (no GIF optimization)
  4. KeepMetadataAndFilename same as Max, but keeps JPEG metadata and the original filename (no GIF optimization)

Thanks to this processor module, it is possible to serve CSS files from a CDN while updating the URLs in the CSS file to reference the new URLs of these files, that is, the URLs of the synced files. It provides a sole processor class: CSSURLUpdater. This processor class should only be used when either of these conditions are true:

  • The base path of the URLs changes and the CSS file uses relative URLs that are relative to the document root to reference images (or other media).
    For example:

    http://example.com/static/css/style.css
    

    becomes

    http://cdn.com/example.com/static/css/style.css
    

    and its referenced file

    http://example.com/static/images/background.png
    

    becomes

    http://cdn.com/example.com/static/images/background.png
    

    after syncing. If the style.css file on the original server references background.png through the relative URL /static/images/background.png, then the CSSURLUpdater processor must be used to update the URL. Otherwise this relative URL would become invalid, since the correct relative URL for the CSS file on the CDN to reference has changed (because the base path has changed).

  • The base names of the referenced files changes.
    For example:

    http://example.com/static/css/style.css
    

    becomes

    http://cdn.com/example.com/static/css/style 1242440815.css
    

    and its referenced file

    http://example.com/static/images/background.png
    

    becomes

    http://cdn.com/static/images/background 1242440827.png
    

    after syncing. Then it must always use CSSURLUpdater. Otherwise the URL would become invalid, as the file’s base name has changed.

CSSURLUpdater uses the cssutils Python module to parse CSS files. This unfortunately also negatively impacts its performance, because it validates the CSS file while tokenizing it. But as this will become open source, others will surely improve this. A possibility is to use regular expressions instead to filter out the URLs.

The CSSURLUpdater processor requires to be run per-server (and therefor the entire processor chain which it is part of), because it wants the referenced files (typically images, but possibly also fonts) to be on the same server.
All it does is resolving relative URLs (relative to the CSS file or relative to the document root) to absolute paths on the file system, then looking up the corresponding URLs on the CDN and placing those instead in the CSS file. If one of the referenced files cannot be found on the file system, this URL remains unchanged. If one of the referenced files has not yet been synced to the CDN, then a RequestToRequeueException exception will be raised (see the processor.py section) so that another attempt will be made later, when hopefully all referenced files have been synced.
For details, see the daemon’s documentation.

yui_compressor.py

This is the processor module that could be seen in the sample code in the introductory processory.py section. It accepts CSS and JS files and runs the YUI Compressor on them, which are then compressed by stripping out all whitespace and comments. For JavaScript, it relies on Rhino to tokenize the JavaScript source, so it is very safe: it will not strip out whitespace where that could potentially cause problems. Thanks to this, it can also optimize more aggressively: it saves over 20% more than JSMIN. For CSS (which is supported since version 2.0) it uses a regular-expression based CSS minifier.

3.6. Transporters

transporter.py

Each transporter is a persistent connection to a server via a certain protocol (FTP, SCP, SSH, or custom protocols such as Amazon S3, any protocol really) that is running in its own thread. It allows you to queue files to be synced (save or delete) to the server.
Transporter is a base class for transporters, which are in turn very (very!) thin wrappers around custom Django storage systems. If you need support for another storage system, you should write a custom Django storage system first. Transporters’ settings are automatically validated in the constructor. Also in the constructor, an attempt is made to set up a connection to their target server. When that fails, an exception (ConnectionError) is raised. Files can be queued for synchronization through the sync_file(src, dst, action, callback, error_callback) method.
Upon completion, the callback function will be called. The error_callback function is called in case of an error.
Transporter also supports logging and accepts a parent logger.
A sample transporter can be found below. For details, please consult the daemon’s documentation.

class TransporterFTP(Transporter):

    name              = 'FTP'
    valid_settings    = ImmutableSet(["host", "username", "password", "url", "port", "path"])
    required_settings = ImmutableSet(["host", "username", "password", "url"])

    def __init__(self, settings, callback, error_callback, parent_logger=None):
        Transporter.__init__(self, settings, callback, error_callback, parent_logger)

        # Fill out defaults if necessary.
        configured_settings = Set(self.settings.keys())
        if not "port" in configured_settings:
            self.settings["port"] = 21
        if not "path" in configured_settings:
            self.settings["path"] = ""

        # Map the settings to the format expected by FTPStorage.
        location = "ftp://" + self.settings["username"] + ":"
        location += self.settings["password"] + "@" + self.settings["host"]
        location += ":" + str(self.settings["port"]) + self.settings["path"]
        self.storage = FTPStorage(location, self.settings["url"])
        try:
            self.storage._start_connection()
        except Exception, e:
            raise ConnectionError(e)

Django is a high-level Python Web framework that encourages rapid development and clean, pragmatic design. It is growing very strong in popularity. It is very different from Drupal, which aims to be both a CMS (install it and you have a working web site) and a framework. Django just aims to be a framework. It has APIs for many things, ranging from caching, database abstraction, forms to sessions, syndication, authentication and of course storage systems. I have extracted that single API (and its dependencies) from Django and am reusing it.

Now, why the dependency on Django’s Storage class? For three reasons:

  1. Since Django is a well-known, widely used open source project with many developers and is powering many web sites, it is fair to assume that the API is stable and solid. Reinventing the wheel is meaningless and will just introduce more bugs.
  2. Because the daemon relies on (unmodified!) Django code, it can benefit from bugfixes/features applied to Django’s code and can use custom storage systems written for Django. The opposite is also true: changes made by contributors to the daemon (and initially myself) can be contributed back to Django and its contributed custom storage systems.
  3. django-storages is a collection of custom storage systems, which includes these classes:
    • DatabaseStorage: store files in the database (any database that Django supports (MySQL, PostgreSQL, SQLite and Oracle)
    • MogileFSStorage; MogileFS is an open source distributed file system
    • CouchDBStorage; Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API
    • CloudFilesStorage; Mosso/Rackspace Cloud Files, an alternative to Amazon S3
    • S3Storage; uses the official Amazon S3 Python module
    • S3BotoStorage; uses the boto module to access Amazon S3 and Amazon CloudFront
    • FTPStorage; uses the ftplib Python module

For the last two, transporters are available. The first three are not so widely used and thus not yet implemented, although it would be very easy to support them, exactly because all that is necessary, is to write thin wrappers. For the third, a transporter was planned, but scrapped due to time constraints. The fourth is not very meaningful to use, since the fifth is better (better maintained and higher performance).
The CouchDBStorage custom storage class was added about a month after I selected django-storages as the package I would use, indicating that the project is alive and well and thus a good choice.
The CloudFilesStorage custom storage class is part of django-storages thanks to my mediations. An early tester of the daemon requested support for Cloud Files and interestingly enough, one day before, somebody had published a Python package (django-cumulus) that did just that. Within a week, a dependency on django-storages was added from django-cumulus and django-storages had Cloud Files support. This is further confirmation that I had made a good choice: it suggests that my choice for django-storages is beneficial for both projects.

So I clearly managed to make a big shortcut (although it had to be made working outside of Django itself) to achieve my goal: supporting CDNs that rely on FTP or origin pulling (see “Key Properties of a CDN), as well as the Amazon S3 and Amazon CloudFront CDNs.

However, supporting origin pull was trickier than would seem at first. Normally, you just rewrite your URLs and be done with it. However, I wanted to support processing files prior to syncing them to the CDN. And I want to keep following the “do not touch the original file” rule. With push, that is no problem, you just process the file, store the output file in a temporary directory, push the file and delete it afterwards. But what about pull?
I had to be creative here. Since files must remain available for origin pull (in case the CDN wants/needs to update its copy), all files must be copied to another publicly accessible path in the web site. But what about files that are not modified? Or have just changed filenames (for unique URLs)? Copying these means storing the exact same data twice. The answer is fortunately very simple: symbolic links. Although available only on UNIX, it is very much worth it: it reduces redundant data storage significantly. This was then implemented in a new custom storage system: SymlinkOrCopyStorage, which copies modified files and symlinks unmodified ones.

In total, I have contributed three patches to django-storages:

  1. FTPStorage: saving large files + more robust exists() (see the issue)

    • It enables the saving of large files by no longer reading all the chunks of the file in a single string. Instead it uses ftplib.storbinary() directly with a file pointer, which then handles the writing in chunks automatically.
    • It makes exists() more reliable: it has been tested with two different FTP servers and so far it works without problems with the following FTP servers, whereas it did not work with any of them before:

      1. Xlight FTP Server 3.2 (used by SimpleCDN)
      2. Pure-FTPd (used by Rambla)

      This improves the number of use cases where you can use the FTPStorage custom storage system.

  2. S3BotoStorage: set Content-Type header, fixed the setting of permissions, use HTTP and disable query authentication by default (see the issue)

    • The Content-Type header is set automatically via guessing based on the extension. This is done through mimetypes.guesstype. Right now, no Content-Type is set, and therefor the default binary mimetype is set: application/octet- stream. This causes browsers to download files instead of displaying them.
    • The ACL (i.e. file permissions) now actually gets applied properly to the bucket and to each file that is saved to the bucket.
    • Currently, URLs are generated with query-based authentication (which implies ridiculously long URLs will be generated) and HTTPS is used instead of HTTP, thereby preventing browsers from caching files. I have disabled query authentication and HTTPS, as this is the most common use case for serving files. This probably should be configurable, but that can be done in a revised patch or a follow-up patch.
    • It allows you to set custom headers through the constructor (which I really needed for my daemon).

      This greatly improves the usability of the S3BotoStorage custom storage system in its most common use case: as a CDN for publicly accessible files.

  3. SymlinkOrCopyStorage: new custom storage system (see the issue)

The maintainer was very receptive to these patches and replied a mere 23 minutes after I contacted him (via Twitter):

davidbgk@wimleers Impressive patches, I will merge your work ASAP. Thanks for contributing! Interesting bachelor thesis :) The patches were submitted on May 14, 2009. The first and third patch were committed on May 17, 2009. The second patch needs a bit more work (more configurable, less hard coded, which it already was though).

transporter_ftp.py

Provides the TransporterFTP class, which is a thin wrapper around FTPStorage, with the aforementioned patch applied.

transporter_s3.py

Provides the TransporterS3 class, which is a thin wrapper around S3BotoStorage, with the aforementioned patch applied.

transporter_cf.py

Provides the TransporterCF class, which is not a thin wrapper around S3BotoStorage, but around TransporterS3. In fact, it just implementes the alter_url() method to alter the Amazon S3 URL to an Amazon CloudFront URL (see “Key Properties of a CDN).

It also provides the create_distribution() function to create a distribution for a given origin domain (a domain for a specific Amazon S3 bucket). Please consult the daemon’s documentation for details.

Provides the TransporterSymlinkOrCopy class, which is a thin wrapper around SymlinkOrCopyStorage, which is a new custom storage system I contributed to django-storages, as mentioned before.

3.7. config.py

This module contains just one class: Config. Config can load a configuration file (parse the XML) and validate it. Validation does not happen through an XML schema, but through “manual” validation. The filter node is validated through the Filter class to ensure it is error free (a Filter object is created and the conditions from the filter node are set and when no exceptions are raised, the conditions are valid). All references (to sources and servers) are also validated. Its validation routines are pretty thorough, but by no means perfect.
Config also supports logging and accepts a parent logger.

This module should be unit tested, but is not — yet.

3.8. daemon_thread_runner.py

I needed to be able to run the application as a daemon. Great, but then how do you stop it? Through signals. That is also how for example the Apache HTTP server does it. To send a signal, you need to know the process’ pid (process id). So the pid must be stored in a file somewhere.

This module contains the DaemonThreadRunner class, which accepts an object and the name of the file that should contain the pid. The object should be a subclass of Python’s threading.Thread class. As soon as you start() the DaemonThreadRunner object, the pid will be written to the specified pid file name, the object will be marked as a daemon thread and started. While it is running, the pid is written to the pid file every sixty seconds, in case the file is deleted accidentally.

When an interrupt is caught (SIGINT for interruption, SIGTSTP for suspension and SIGTERM for termination), the thread (of the object that was passed) is stopped and DaemonThreadRunner waits for the thread to join and then deletes the file.

This module is not unit tested, because it makes very little sense to do so (there is not much code). Having used it hundreds of times, it did not fail once, so it is reliable enough.

4. Putting it all together: arbitrator.py

4.1. The big picture

The arbitrator is what links together all Python modules I have described in the previous section. Here is a hierarchical overview, so you get a better understanding of the big picture:

See the attached figure “File Conveyor — the big picture”.

Clearly, Arbitrator is what links everything together: it controls the 5 components: Config, FSMonitor, Filter, Processor and Transporter. There are three subclasses of FSMonitor to take advantage of the platform’s built-in file system monitor. Processor must be subclassed for every processor. Transporter must be subclassed for each protocol.

Now that you have an insight in the big picture, let us examine how exactly Arbitrator controls all components, and what happens before the main function.

4.2. The flow

First, an Arbitrator object is created and its constructor does the following:

  1. create a logger
  2. parse the configuration file
  3. verify the existence of all processors and transporters that are referenced from the configuration file
  4. connect to each server (as defined in the configuration file) to ensure it is working Then, the

Arbitrator object is passed to a DaemonThreadRunner object, which then runs the arbitrator in such a way that it can be stopped through signals. The arbitrator is then started. The following happens:

  1. setup
    1. create transporter pools (analogous to worker thread pools) for each server. These pools remain empty until transporters are necessary, because transporters are created whenever they are deemed necessary.
    2. collect all metadata for each rule
    3. initialize all data structures for the pipeline (queues, persistent queues and persistent lists)
    4. move files from the ‘files in pipeline’ persistent list to the ‘pipeline’ persistent queue
    5. move files from the ‘failed files’ persistent list to the ‘pipeline’ persistent queue
    6. create a database connection to the ‘synced files’ database
    7. initialize the file system monitor (FSMonitor)
  2. run
    1. start the file system monitor
    2. start the processing loop and keep it running until the thread is being stopped
      1. process the discover queue
      2. process the pipeline queue
      3. process the filter queue
      4. process the process queue
      5. process the transport queues (1 per server)
      6. process the db queue
      7. process the retry queue
      8. allow retry (move files from the ‘failed files’ persistent list to the ‘pipeline’ persistent queue)
      9. sleep 0.2 seconds
    3. stop the file system monitor
    4. process the discover queue once more to sync the final batch of files to the persistent pipeline queue
    5. stop all transporters
    6. log some statistics

That is roughly the logic of the daemon. It should already make some sense, but it is likely that it is not yet clear what all the queues are for. And how they are being filled and emptied. So now it is time to learn about the daemon’s pipeline.

4.3. Pipeline design pattern

This design pattern, which is also sometimes called “Filters and Pipes” (on which a fair bit has been written), is slightly under documented, but it is still a very useful design pattern. Its premise is to deliver an architecture to divide a large processing task into smaller, sequential steps (“Filters”) that can be performed independently — and therefor in parallel — which are finally connected via Pipes. The output of one step is the input of the next.

For all that follows in this subsection, you may want to look at the figure below while reading. Note that this figure does not contain every detail: it is intended to help you gain some insight into how the daemon works, not how every detail is implemented.

See the attached figure “File Conveyor — flowchart”.

In my case, files are discovered and are then put into the pipeline queue. When they actually move into the pipeline (at which point they are added to the ‘files in pipeline’ persistent list), they start by going into the filter queue, after being filtered they go into the process queue (possibly more than once), after being processed to the transport queue (again possibly more than once), after being transported to the db queue, after being stored in the database, they are removed from the ‘files in pipeline’ persistent list and we are done for this file. Repeat for every discovered file. This is the core logic of the daemon.

So many queues are used because there are so many stages in the pipeline. There is a queue for each stage in the pipeline, plus some additional ones because the persistent data structures use the pysqlite module, which only allows you to access the database from the same thread as the connection was created in. Because I (have to) work with callbacks, the calling thread may be different from the creating thread, and therefor there are several queues that exist solely for exchanging data between threads.
There is one persistent queue and two persistent lists. The persistent queue is the pipeline queue, which contains all files that are queued to be sent through the pipeline. The first persistent list is ‘files in pipeline’. It is used to ensure files still get processed if the daemon was killed (or crashed) while they were in the pipeline. The second persistent list is ‘failed files’ and contains all files for which either a processor in the processor chain or a transporter failed.
When the daemon is restarted, the contents of the ‘files in pipeline’ and ‘failed files’ lists are pushed into the pipeline queue, after which they are erased.

Queues are either filled through the Arbitrator (because it moves data from one queue to the next):

  • The pipeline queue is filled by the “process discover queue” method, which always syncs all files in the discover queue to the pipeline queue.
  • The filter queue is filled by the “process pipeline queue” method, which processes up to 20 files (this is configurable) in one run, or until there are 100 files in the pipeline (this is also configurable), whichever limit is hit first.
  • The process queue is filled by the “process filter queue” method, which processes up to 20 files in one run.

or through callbacks (in case data gets processed in a separate thread):

  • The discover queue is filled through FSMonitor’s callback (which gets called for every discovered file).
  • The transport queue is filled through a ProcessorChain’s callback or directly from the “process filter queue” method (if the rule has no processor chain associated with it). To know when a file has been synced to all its destinations, the ‘remaining transporters’ list gets a new key (the concatenation of the input file, the event and the string representation of the rule) and the value of that key is a list of all servers to which this file will be synced.
  • The db queue is filled through a Transporter’s callback. Each time this callback fires, it also carries information on which server the file has just been transported to. This server is then removed from the ‘remaining transporters’ list for this file. When no servers are left in this list, the sync is complete and the file can be removed from the ‘files in pipeline’ persistent list.

Because the ProcessorChain and Transporter callbacks only carry information about the file they have just been operating on, I had to find an elegant method to transfer the additional metadata for this file, which is necessary to let the file continue through the pipeline. I have found this in the form of currying. Currying is dynamically creating a new function that calls another function, but with some arguments already filled out. An example:

curried_callback = curry(self.processor_chain_callback, event=event, rule=rule)

The function self.processor_chain_callback accepts the event and rule arguments, but the ProcessorChain class has no way of accepting “additional data” arguments. So instead of rewriting ProcessorChain (and the exact same thing applies to Transporter), I simply create a curried callback, that will automatically fill out the arguments that the ProcessorChain callback by itself could never fill out.

Each of the “process X queue” methods acquires Arbitrator’s lock before accessing any of the queues. Before a file is removed from the pipeline queue, it is added to the ‘files in pipeline’ persistent list (this is possible thanks to PersistentQueue’s peek() method), and then it is removed from the pipeline queue. This implies that at no time after the file has been added to the pipeline queue, it can be lost. The worst case scenario is that the daemon crashes between adding the file to the ‘files in pipeline’ persistent list and removing it from the pipeline queue. Then it will end up twice in the queue. But the second sync will just overwrite the first one, so all that is lost, is CPU time.

The “allow retry” method allows failed files (in the ‘failed files’ persistent list) to be retried, by adding them back to the pipeline queue. This happens whenever the pipeline queue is getting empty, or every 30 seconds. This ensures processors that use the RequestToRequeueException exception can retry.

The only truly weak link is unavoidable: if the daemon crashes somewhere between having performed the callback from FSMonitor, adding that file to the discover queue and syncing the file from the discover queue to the pipeline queue (which is necessary due to the thread locality restriction of pysqlite).

5. Performance tests

I have performed fairly extensive tests on both Mac OS X and Linux. The application behaved identically on both platforms, despite the fact that different file system monitors are being used in the background. The rest of this cross-platform functioning without problems is thanks to Python.

All tests were performed on the local network, i.e. with a FTP server running on the localhost. Very small scale tests have been performed with the Amazon S3 and CloudFront transporters, and since they worked, the results should apply to those as well. It does not and should not matter which transporter is being used.

At all times, the memory usage remained below 17 MB on Mac OS X and below 7 MB on Linux (unless the update_linker processor module was used, in which case it leaks memory like a madman — the cssutils Python module is to blame). A backlog of more than 10,000 files was no problem. Synchronizing 10 GB of files was no problem. I also tried a lot of variations in the configuration and all of them worked (well, sometimes it needed some bug fixing of course). Further testing should happen in real-world environments. Even tests in which I forced processors or transporters to crash were completed successfully: no files were lost and they would be synced again after restarting the daemon.

6. Possible further optimizations

  • Files should be moved from the discover queue to the pipeline queue in a separate thread, to minimize the risk of losing files due to a crashed application before files are moved to the pipeline queue. In fact, the discover queue could be eliminated altogether thanks to this.
  • Track progress of transporters and allow them to be be stopped while still syncing a file.
  • Make processors more error resistent by allowing them to check the environment, so they can ensure third party applications, such as YUI Compressor or jpegtran are installed.
  • Figure out an automated way of ensuring the correct operating of processors, since they are most likely the cause of problems thanks to the fact that users can easily write their own processors.
  • Automatically copy the synced files DB every X seconds, to prevent long delays for read-only clients. This will only matter on sites where uploads happen more than once per second or so.
  • Reorganize code: make a proper packaged structure.
  • Make the code redistributable: as a Python egg, or maybe even as binaries for each supported platform.
  • Automatically stop transporters after a period of idle time.

7. Desired future features

  • Polling the daemon for its current status (number of files in the queue, files in the pipeline, processors running, transporters running, et cetera)
  • Support for Munin/Nagios for monitoring (strongly related to the previous feature)
  • Ability to limit network usage by more than just the number of connections: also by throughput.
  • Ability to limit CPU usage by more than just the number of simultaneous processors.
  • Store characteristics of the operations, such as TTS (Time-To-Sync), so that you can analyze this data to configure the daemon to better suit your needs.
  • Cache the latest configuration file and compare with the new on

This is a republished part of my bachelor thesis text, with thanks to Hasselt University for allowing me to republish it. This is section nine in the full text, in which it was called “Daemon” instead of “File Conveyor: design”.

Previously in this series:

Comments

Chris Cohen's picture

A fantastic write-up of a brilliantly simple idea, yet so complicated to implement! Thanks for sharing, and in such great detail. We have always been put off the idea of CDNs because of the complexity involved but something like this could well persuade us to give it a try.

Wim Leers's picture

Wim Leers

Thanks for the praise Chris!

While it’s “only” a copy from the text from my bachelor thesis text, this still took me many, many hours to put together. I was afraid it was going to be in vain because it’s so long, but I’m glad that at least one person found it very useful :)

It’s actually not very complicated to implement, it’s just a lot of work :) But with this design, it’s easy to replace the pieces. Heck, it’s even easy to rewrite the entire logic of the daemon, because it’s all in arbitrator.py, i.e. a single file!

So definitely give File Conveyor a try! :)

Steven K's picture

Steven K

Can you please point me in the direction of sample Amazon S3 server configuration. I’m not able to find the syntax anywhere.

From the looks of the transporter_s3.py the listed properties are required: access_key_id secret_access_key bucket_name

It looks like you’re using camel case in your config.xml, so is it save to assume the property names would look like: accessKeyId secretAccessKey bucketName

So would the server entry look like this? 234lkj234lk2j32 s098098sd0f9sd8f blahblahbucket

Also is it possible to transfer the files directly to cloudfront? Or tell cloudfront about the updated files on S3?

Thank you!

Wim Leers's picture

Wim Leers

The properties are not automatically camelcased, so foo_bar would map to <foo_bar>, not to <fooBar>.

And yes, you can transfer files directly to CloudFront (well, CloudFront is just a proxy for whatever is in S3, so technically files are still transferred to S3). If you look at fileconveyor/transporters/transporter_cf.py, you’ll see:

required_settings = ImmutableSet(["access_key_id", "secret_access_key", "bucket_name", "distro_domain_name"])

So the fourth property, the one you were missing, is distro_domain_name.

You can also see the full list of available settings there:

valid_settings = ImmutableSet(["access_key_id", "secret_access_key", "bucket_name", "distro_domain_name", "bucket_prefix"])