# FP-Growth-powered association rule mining with support for constraints

The previous blog post covering my master thesis was about the libraries I wrote for detecting browsers and locations: QBrowsCap and QGeoIP.

On the very day that was published, I reached the first implementation milestone, which implied that it was already finding causes of slow page loads, but not over exactly specified periods of time, but rather over each chunk of 4,000 lines that was read from an Episodes log file. To achieve this, an implementation of the FP-Growth algorithm was completed, which was then modified to add support for item constraints.

### FP-Growth

Thoroughly explaining the FP-Growth algorithm would lead us too far. Hence, I’ll include a brief explanation below. For details, I refer to the original paper, “Mining frequent patterns without candidate generation” by J. Han, J. Pei, Y. Yin and R. Mao which can easily be downloaded when searched for through Google Scholar.

In this high-level explanation, none of the technical details will be included. For a quick introduction with more details, I recommend reading the excellent slides by Florian Verhein. The goal of this blog post is not to provide a complete insight into how the FP-Growth algorithm works, but to provide a high-level insight into how *my implementation* of it works, which modifications I had to make and which important optimizations are present.

Before calculations begin, the user chooses a minimum support `minSup`

. A higher minimum support will obviously lead to less ‘frequent itemsets’ being found.

The user also chooses a minimum confidenc `minConf`

that will be used when performing rule mining. Analogously, a higher minimum confidence will lead to less association rules being found.

See the `FPGrowth`

class in my implementation.

#### FP-Growth: initial pass

In an initial pass, the entire data set (a batch of transactions) is scanned to learn the support (i.e. frequency) of each unique item by storing them in some datastructure `frequentSupportCounts`

and incrementing the corresponding support each time the item is encountered in the data set.

If you’re aiming for maximum efficiency (as I am), it is wise to also maintain `ItemName → ItemID`

and `ItemID â†’ ItemName`

mappings, so the same strings do not have to be stored many times, but only some small numeric identifier. I chose for a 32-bit identifier just to be futureproof, although a 16-bit identifier would have been sufficient for the data sets I tested with:

```
typedef uint32_t ItemID; // Supports 2^32 *different* items. Upgradable to uint64_t.
typedef QString ItemName;
```

So then the type of `frequentSupportCounts`

looks like this: `QHash<ItemID, SupportCount>`

.

After this initial pass is done, items whose support σ (sigma) is smaller than `minSup`

are dropped from `frequentSupportCounts`

.

Thanks to `frequentSupportCounts`

, it is now very easy to check if an item is frequent or not: if it doesn’t exist in `frequentSupportCounts`

, it is infrequent, otherwise it is frequent, and its support count can be accessed immediately.

See `FPGrowth::scanTransactions()`

in my implementation.

#### FP-Growth: FP-Tree

The FP-Growth algorithm then continues to build an *FP-Tree*, a **F**requent **P**attern Tree. This is a *prefix tree* (also called a trie) that effectively *compresses* the data that needs to be stored. An FP-Tree is designed to store ‘frequent patterns’, which is just another name for ‘frequent itemsets’. By ensuring that the order of items (`ItemID`

s, really, since we don’t work with the actual `ItemName`

s but the corresponding `ItemID`

s to conserve memory usage) *within* the frequent patterns is always the same by ordering them by descending frequency (which we can thanks to `frequentSupportCounts`

from the initial pass), all frequent itemsets containing the most frequent item `A`

will *always* have `A`

as the first item. (This of course requires that an itemset is not really a *set*, but a list, because by definition there is no order in a set.)

Hence, even if there are a million frequent itemsets that contain item `A`

, there will only be *one* node in the FP-Tree for `A`

: this is the compression happening. When multiple frequent itemsets correspond to a single node in the FP-Tree, their supports will be summed and stored in this node in the FP-Tree.

See slide 4 in Florian Verhein’s presentation for a graphical explanation.

See `FPGrowth::buildFPTree()`

in my implementation, as well as the `FPTree`

and `FPNode`

classes.

#### FP-Growth: generate frequent itemsets

When the FP-Tree data structure has been built, everything is in place to efficiently mine frequent itemsets. Frequent itemsets are extracted in a bottom-up fashion, through a divide and conquer approach: it first looks for frequent itemsets ending in `E`

(i.e. with the suffix `E`

), then `DE`

, etc. Then it looks for frequent itemsets ending in `D`

, then `CD`

, etc. After that, it looks for frequent itemsets ending in `C`

, then `BC`

and finally `ABC`

.

This growing of smaller frequent itemsets into larger frequent itemsets is also where the name of the algorithm comes from: FP-Growth stands for **F**requent **P**attern Growth.

It can do this efficiently by only looking at parent nodes of the nodes corresponding to the current suffix’s first item. See slides 9—18 in Florian Verhein’s presentation (also attached) for a detailed explanation with many graphical aids.

### Adding support for item constraints

Now, the goal of my master thesis is to automatically pinpoint causes of *slow* episodes. Hence, I want to find association rules of the form `{A, B, episode:*, F, G, K, Z} → {duration:slow}`

, where `episode:*`

can be `episode:css`

, `episode:headerjs`

, and so on. (If you’re wondering how it is determined that an episode is slow, please see my previous blog post’s section on `EpisodesDurationDiscretizer`

.)

To find *only* the association rules of this form, we can add two ‘positive item constraints’: one on `duration:slow`

and one on `episode:*`

— this means that during the frequent itemset generation step of the FP-Growth algorithm, we only accept frequent itemsets that match these positive item constraints (see `FPGrowth::generateFrequentItemsets`

, line 402).

Finally, the remaining search space is only searched for additional frequent itemsets if it has the *potential* to match these positive constraints (see `FPGrowth::generateFrequentItemsets`

, line 421). This potential is determined by checking whether the constraints are matched by the current frequent itemset (which will be a suffix for future frequent itemsets) *and* the ‘prefix paths support counts’ *simultaneously*, that is, if either the frequent itemset matches the constraints *or* the prefix path support counts match the constraints. This makes sense, because prefix paths (and thus the corresponding prefix paths support counts) indicate possible future extensions of the current frequent itemset. Thus, we only continue the search if it is possible that some offspring of the current frequent itemset will be able to match the constraints, or in other words, if it has potential.

(The ‘prefix paths’ are the previously mentioned parent nodes, i.e., it is looking at all paths to the root node of the FP-Tree from all nodes in the FP-Tree that contain the first item of the frequent itemset. This first item is in fact the prefix that was prepended to the suffix, thus resulting in the current frequent itemset. Hence the name ‘prefix paths’ makes sense. ‘prefix paths support counts’, then, refers to all unique items’ support counts in these prefix paths.

Put more simply, ‘prefix paths support counts’ are the support counts of all possible items that may be added to the growing frequent itemset, and hence they represent the future search space.)

This part I had to figure out on my own, since I was unable to find a paper explaining how to make FP-Growth perform *constrained* frequent itemset mining. Fortunately, it was trivial.

See the `Constraints`

class in my implementation.

### Mining rules from the generated frequent itemsets

Association rule mining is performed by the Apriori algorithm.

While association rule mining over FP-Growth without constraints is trivial, when constraints are in play, it is *not* trivial.

To calculate the confidence of a rule `{A, B} ⇒ {C}`

(where `{A, B}`

is called the rule *antecedent* and `{C}`

is called the rule *consequent*), one must use the following formula:

```
a = rule.antecedent
c = rule.consequent
f = a UNION c = frequent itemset
confidence(a ⇒ c) = support(a UNION c) / support(a)
= support(f) / support(a)
```

(Alternatively, you can read the more wordy explanation on Wikipedia.)

So to calculate the confidence of a rule, one needs two values: the support of the entire (frequent) itemset of which the rule consists (`support(f)`

) and the support of the antecedent (`support(a)`

). When the resulting confidence is smaller than `minConf`

, the candidate association rule is dropped, otherwise it is added to the result.

Since we already have `support(f)`

, whether constraints are being used or not, because that is what we started from, getting this value is a non-issue. It is `support(a)`

where the challenge lies. This will become clear in the next paragraphs.

#### Rule mining without constraints

When constraints are not being used, *all* frequent itemsets are calculated. This implies that frequent itemsets corresponding to antecedents are also generated; after all, they are frequent too!

This claim is easy to understand: if `{A, B, C}`

is frequent, then `{A, B}`

must be frequent too, since `{A, B, C}`

is more *specific* than `{A, B}`

, this must hold:

```
support({A, B}) ≥ support({A, B, C})
```

To retrieve the support of the antecedent `support(a)`

, all that needs to be done, is looking up `a`

in the set of frequent itemsets generated by FP-Growth. One can then calculate the confidence of the candidate association rule and decide whether to accept or discard it.

#### Rule mining with constraints

While the aforementioned property still holds when constraints *are* being used, it is the requirement for a frequent itemset to match the constraints that prevents the frequent itemset from either being accepted or from being generated at all.

Therefore we need a way to calculate the support of antecedents efficiently. This is where things get interesting.

When the antecedent contains only one item, there is a nice trick we can apply: since we already have `frequentSupportCounts`

, we can simply look it up directly here! In this case, no calculations are required at all.

Because we know exactly what we are looking for, we can quickly calculate the support for the given frequent itemset (the antecedent) from the FP-Tree. The algorithm is entirely analogous to the frequent itemset generation step, but this time we don’t have to generate *all* potential frequent itemsets: we can simply traverse the FP-Tree to get exactly the data we need. For full details, see `FPGrowth::calculateSuportCountExactly()`

.

If the resulting confidence from this calculation did not meet minimum support, the candidate association rule is discarded. In the other case, it is accepted.

### Conclusion

My master thesis’ `Analytics`

module is able to mine the association rules of a 51,927-line long sample log file per chunk of 4,000 lines at **over 1,500 lines per second or over 16,500 transactions per second** (and that *includes* the parsing and processing of `EpisodesParser`

!) on my 2.66 GHz Core 2 Duo machine! :)

## Comments

## Tries are cool!

A trie is such a useful data structure for this kind of problem! I also used it for my second year project (TRPR) once, where we used it to find patterns in DNA strings…

## Tries are very cool indeed :)

DNA strings and tries in your second year? That sure sounds like you were working hard already in your second year… What did you build exactly for TRPR?

## It was mainly a lot of fun :)

Just ask your thesis advisor :)

We did this project under supervision of Prof. Vandenbussche to help people @ BIOMED identify certain chains in DNA sequences they extracted from the T-cell receptor (if I recall correctly). Basically, they had these long DNA strings, in which they needed to identify four main components (C, J, D and V chains, if I’m not mistaken). However, these chains could be suffixes or prefixes of the original string, and sometimes also contained some garbage (e.g. “AAAGGGCCC” could become “..AG?GCC”). Our program tried to identify the chains anyway, which made this job a lot easier for them, even if we got it wrong sometimes.

What I really liked about this project, was the fact that we got the opportunity to translate a real problem into working code that solved that problem (or at least made it a lot easier to tackle).