FP-Growth-powered association rule mining with support for constraints

published on June 1, 2011

The previous blog post covering my master thesis was about the libraries I wrote for detecting browsers and locations: QBrowsCap and QGeoIP.
On the very day that was published, I reached the first implementation milestone, which implied that it was already finding causes of slow page loads, but not over exactly specified periods of time, but rather over each chunk of 4,000 lines that was read from an Episodes log file. To achieve this, an implementation of the FP-Growth algorithm was completed, which was then modified to add support for item constraints.

FP-Growth

Thoroughly explaining the FP-Growth algorithm would lead us too far. Hence, I’ll include a brief explanation below. For details, I refer to the original paper, “Mining frequent patterns without candidate generation” by J. Han, J. Pei, Y. Yin and R. Mao which can easily be downloaded when searched for through Google Scholar.

In this high-level explanation, none of the technical details will be included. For a quick introduction with more details, I recommend reading the excellent slides by Florian Verhein. The goal of this blog post is not to provide a complete insight into how the FP-Growth algorithm works, but to provide a high-level insight into how my implementation of it works, which modifications I had to make and which important optimizations are present.

Before calculations begin, the user chooses a minimum support minSup. A higher minimum support will obviously lead to less ‘frequent itemsets’ being found.
The user also chooses a minimum confidenc minConf that will be used when performing rule mining. Analogously, a higher minimum confidence will lead to less association rules being found.

See the FPGrowth class in my implementation.

FP-Growth: initial pass

In an initial pass, the entire data set (a batch of transactions) is scanned to learn the support (i.e. frequency) of each unique item by storing them in some datastructure frequentSupportCounts and incrementing the corresponding support each time the item is encountered in the data set.
If you’re aiming for maximum efficiency (as I am), it is wise to also maintain ItemName → ItemID and ItemID â†’ ItemName mappings, so the same strings do not have to be stored many times, but only some small numeric identifier. I chose for a 32-bit identifier just to be futureproof, although a 16-bit identifier would have been sufficient for the data sets I tested with:

typedef uint32_t ItemID; // Supports 2^32 *different* items. Upgradable to  uint64_t.
typedef QString ItemName;

So then the type of frequentSupportCounts looks like this: QHash<ItemID, SupportCount>.
After this initial pass is done, items whose support σ (sigma) is smaller than minSup are dropped from frequentSupportCounts.
Thanks to frequentSupportCounts, it is now very easy to check if an item is frequent or not: if it doesn’t exist in frequentSupportCounts, it is infrequent, otherwise it is frequent, and its support count can be accessed immediately.

See FPGrowth::scanTransactions() in my implementation.

FP-Growth: FP-Tree

The FP-Growth algorithm then continues to build an FP-Tree, a Frequent Pattern Tree. This is a prefix tree (also called a trie) that effectively compresses the data that needs to be stored. An FP-Tree is designed to store ‘frequent patterns’, which is just another name for ‘frequent itemsets’. By ensuring that the order of items (ItemIDs, really, since we don’t work with the actual ItemNames but the corresponding ItemIDs to conserve memory usage) within the frequent patterns is always the same by ordering them by descending frequency (which we can thanks to frequentSupportCounts from the initial pass), all frequent itemsets containing the most frequent item A will always have A as the first item. (This of course requires that an itemset is not really a set, but a list, because by definition there is no order in a set.)
Hence, even if there are a million frequent itemsets that contain item A, there will only be one node in the FP-Tree for A: this is the compression happening. When multiple frequent itemsets correspond to a single node in the FP-Tree, their supports will be summed and stored in this node in the FP-Tree.

See slide 4 in Florian Verhein’s presentation for a graphical explanation.

See FPGrowth::buildFPTree() in my implementation, as well as the FPTree and FPNode classes.

FP-Growth: generate frequent itemsets

When the FP-Tree data structure has been built, everything is in place to efficiently mine frequent itemsets. Frequent itemsets are extracted in a bottom-up fashion, through a divide and conquer approach: it first looks for frequent itemsets ending in E (i.e. with the suffix E), then DE, etc. Then it looks for frequent itemsets ending in D, then CD, etc. After that, it looks for frequent itemsets ending in C, then BC and finally ABC.

This growing of smaller frequent itemsets into larger frequent itemsets is also where the name of the algorithm comes from: FP-Growth stands for Frequent Pattern Growth.

It can do this efficiently by only looking at parent nodes of the nodes corresponding to the current suffix’s first item. See slides 9—18 in Florian Verhein’s presentation (also attached) for a detailed explanation with many graphical aids.

Adding support for item constraints

Now, the goal of my master thesis is to automatically pinpoint causes of slow episodes. Hence, I want to find association rules of the form {A, B, episode:*, F, G, K, Z} → {duration:slow}, where episode:* can be episode:css, episode:headerjs, and so on. (If you’re wondering how it is determined that an episode is slow, please see my previous blog post’s section on EpisodesDurationDiscretizer.)

To find only the association rules of this form, we can add two ‘positive item constraints’: one on duration:slow and one on episode:* — this means that during the frequent itemset generation step of the FP-Growth algorithm, we only accept frequent itemsets that match these positive item constraints (see FPGrowth::generateFrequentItemsets, line 402).

Finally, the remaining search space is only searched for additional frequent itemsets if it has the potential to match these positive constraints (see FPGrowth::generateFrequentItemsets, line 421). This potential is determined by checking whether the constraints are matched by the current frequent itemset (which will be a suffix for future frequent itemsets) and the ‘prefix paths support counts’ simultaneously, that is, if either the frequent itemset matches the constraints or the prefix path support counts match the constraints. This makes sense, because prefix paths (and thus the corresponding prefix paths support counts) indicate possible future extensions of the current frequent itemset. Thus, we only continue the search if it is possible that some offspring of the current frequent itemset will be able to match the constraints, or in other words, if it has potential.

(The ‘prefix paths’ are the previously mentioned parent nodes, i.e., it is looking at all paths to the root node of the FP-Tree from all nodes in the FP-Tree that contain the first item of the frequent itemset. This first item is in fact the prefix that was prepended to the suffix, thus resulting in the current frequent itemset. Hence the name ‘prefix paths’ makes sense. ‘prefix paths support counts’, then, refers to all unique items’ support counts in these prefix paths.
Put more simply, ‘prefix paths support counts’ are the support counts of all possible items that may be added to the growing frequent itemset, and hence they represent the future search space.)

This part I had to figure out on my own, since I was unable to find a paper explaining how to make FP-Growth perform constrained frequent itemset mining. Fortunately, it was trivial.

See the Constraints class in my implementation.

Mining rules from the generated frequent itemsets

Association rule mining is performed by the Apriori algorithm.

While association rule mining over FP-Growth without constraints is trivial, when constraints are in play, it is not trivial.

To calculate the confidence of a rule {A, B} ⇒ {C} (where {A, B} is called the rule antecedent and {C} is called the rule consequent), one must use the following formula:

a = rule.antecedent
c = rule.consequent
f = a UNION c = frequent itemset

confidence(a ⇒ c) = support(a UNION c) / support(a)
                   = support(f) / support(a)

(Alternatively, you can read the more wordy explanation on Wikipedia.)

So to calculate the confidence of a rule, one needs two values: the support of the entire (frequent) itemset of which the rule consists (support(f)) and the support of the antecedent (support(a)). When the resulting confidence is smaller than minConf, the candidate association rule is dropped, otherwise it is added to the result.
Since we already have support(f), whether constraints are being used or not, because that is what we started from, getting this value is a non-issue. It is support(a) where the challenge lies. This will become clear in the next paragraphs.

Rule mining without constraints

When constraints are not being used, all frequent itemsets are calculated. This implies that frequent itemsets corresponding to antecedents are also generated; after all, they are frequent too!
This claim is easy to understand: if {A, B, C} is frequent, then {A, B} must be frequent too, since {A, B, C} is more specific than {A, B}, this must hold:

support({A, B}) ≥ support({A, B, C})

To retrieve the support of the antecedent support(a), all that needs to be done, is looking up a in the set of frequent itemsets generated by FP-Growth. One can then calculate the confidence of the candidate association rule and decide whether to accept or discard it.

Rule mining with constraints

While the aforementioned property still holds when constraints are being used, it is the requirement for a frequent itemset to match the constraints that prevents the frequent itemset from either being accepted or from being generated at all.
Therefore we need a way to calculate the support of antecedents efficiently. This is where things get interesting.

When the antecedent contains only one item, there is a nice trick we can apply: since we already have frequentSupportCounts, we can simply look it up directly here! In this case, no calculations are required at all.

Because we know exactly what we are looking for, we can quickly calculate the support for the given frequent itemset (the antecedent) from the FP-Tree. The algorithm is entirely analogous to the frequent itemset generation step, but this time we don’t have to generate all potential frequent itemsets: we can simply traverse the FP-Tree to get exactly the data we need. For full details, see FPGrowth::calculateSuportCountExactly().

If the resulting confidence from this calculation did not meet minimum support, the candidate association rule is discarded. In the other case, it is accepted.

Conclusion

My master thesis’ Analytics module is able to mine the association rules of a 51,927-line long sample log file per chunk of 4,000 lines at over 1,500 lines per second or over 16,500 transactions per second (and that includes the parsing and processing of EpisodesParser!) on my 2.66 GHz Core 2 Duo machine! :)

Add new comment

Comments

Tries are cool!

Jo Vermeulen ∞

A trie is such a useful data structure for this kind of problem! I also used it for my second year project (TRPR) once, where we used it to find patterns in DNA strings…

Tries are very cool indeed :)

Wim Leers ∞

DNA strings and tries in your second year? That sure sounds like you were working hard already in your second year… What did you build exactly for TRPR?

It was mainly a lot of fun :)

Jo Vermeulen ∞

Just ask your thesis advisor :)

We did this project under supervision of Prof. Vandenbussche to help people @ BIOMED identify certain chains in DNA sequences they extracted from the T-cell receptor (if I recall correctly). Basically, they had these long DNA strings, in which they needed to identify four main components (C, J, D and V chains, if I’m not mistaken). However, these chains could be suffixes or prefixes of the original string, and sometimes also contained some garbage (e.g. “AAAGGGCCC” could become “..AG?GCC”). Our program tried to identify the chains anyway, which made this job a lot easier for them, even if we got it wrong sometimes.

What I really liked about this project, was the fact that we got the opportunity to translate a real problem into working code that solved that problem (or at least made it a lot easier to tackle).