FP-Stream-powered association rule mining over data streams with support for constraints
The last blog post I wrote about my master thesis was on June 1st. The final blog post has been long overdue. To the (very few) readers interested in the technical details, I apologize for the long delay in writing about the last part.
That last blog post was about FP-Growth. This one is about FP-Stream. Whereas FP-Growth can analyze static data sets for patterns, FP-Stream is capable of finding patterns over data streams. FP-Stream relies on the FP-Growth for significant parts, but it’s considerably more advanced. So, in essence, this phase only adds the capability to mine over a stream of data. While that may sound like it is not much, the added complexity of achieving this turns it into a fairly large undertaking.
In the case of WPO Analytics, we’re clearly working with a data stream, since there’s an endless stream of visitors to your website (unless it’s dead or offline, of course). Therefor, there is also a data stream of Episodes performance measurements (one set of Episodes measurements per page viewed by a visitor), thus we need FP-Stream to get to the ultimate goal of this master thesis: WPO Analytics. The FP-Stream implementation was the second implementation milestone (on May 23), but was continuously refined/improved until a couple of weeks after that. Like FP-Growth, FP-Stream had to be extended to support item constraints.
Tilted Time Window
A key data structure required for an FP-Stream implementation is a tilted time window. One can opt for either a natural tilted time window model or a logarithmic tilted time window. In the context of my thesis, a natural tilted time window model makes more sense, since it allows you to mine frequent itemsets over the last week, the last month, and so on, whereas a logarithmic tilted time window model would only allow for the last hour, the last 2 hours, the last 4, 8, 16, 32 hours, and so on. These windows are clearly harder to interpret the results for in the context of WPO analytics than a natural tilted time window.
I opted for a natural tilted time window with a precision of a quarter of an hour that would keep the data of up to 1 year ago. Given granularities of a quarter, an hour, a day, a month and a year (thus, there is a “quarter” (that’s 15 minutes for American readers) granularity, an “hour” granularity, and so on), that results in a grand total of 4 + 24 + 31 + 12 + 1 = 72 units of time. For each such unit, there is a bucket in the TiltedTimeWindow
. That is, there are 4 quarter buckets, 24 hour buckets, 31 day buckets, 12 month buckets and 1 year bucket. The first quarter bucket corresponds to the last quarter, the second quarter bucket corresponds to the last but one quarter (i.e. the quarter a quarter ago), and so on.
The FP-Stream paper also describes how to prune data that will no longer be needed for the resulting frequent itemsets to be sufficiently accurate. This continuously (with every new batch of transactions that arrives) ensures that stale data is deleted from memory, and thus keeping memory consumption low.
PatternTree
Another key data structure is the Pattern Tree, which includes a TiltedTimeWindow
in each of its nodes. You may recall from the FP-Growth implementation that there was another tree data structure, called FP-Tree. Well, in this phase of the implementation I reused the class I developed for the nodes in the FP-Tree (FPNode
), but refactored it into a template class. For FP-Growth’s FP-Tree, I thus use FPNode<SupportCount>
and for FP-Stream’s PatternTree
, I used FPNode<TiltedTimeWindow>
.
The support of patterns (frequent itemsets) stored in a PatternTree
instead of a FPTree
should be interpreted differently. The patterns can be read in the same way, but each node now only contains the support for the pattern defined by that node (a pattern is defined as the items encountered when traversing the tree from the root node to a given node), instead of a cumulative support that also includes the support of the frequent itemsets beneath it (i.e., its supersets). This class was trivial to implement, since most of the complex logic resides in the TiltedTimeWindow
class.
FP-Stream
As was the case for FP-Growth, thoroughly explaining the FP-Stream algorithm would lead us too far. I’d recommend reading the original paper, “Mining Frequent Patterns in Data Streams at Multiple Time Granularities” by C. Giannella; J. Han; J. Pei; X. Yan; P. S. Yu. To grasp the essence, I’d recommend reading only 3.6, which covers the actual algorithm at a high level and is only one page long.
Besides the data stream itself, the only inputs are the following (note the overlap of the required parameters with the FP-Growth algorithm): the user chooses a minimum support minSup
, a minimum confidence minConf
, but now also a maximum support error ε
(which is a poorly chosen name, “initial minimum support” or “Pattern Tree minimum support” would have been more apt).
FP-Stream: initial batch
The first batch of data (the data is divided into batches per minimum time window, thus in our case, a batch is created for every 15 minutes worth of data) is treated differently than the rest: it is used as an initialization step. An empty f_list
is created and passed to an FPGrowth
instance, which mines frequent itemsets that have ε
as their minimum support. The FPGrowth
instance then applies the FP-Growth algorithm (with support for constraints, as in phase 1) to this initial batch, thereby creating an ordering of the items by decreasing frequencies and storing this in f_list
, which will be reused for subsequent batches. All frequent itemsets that are found by the FP-Growth algorithm (note that since all frequent itemsets are stored, we can call FPGrowth
in blocking (synchronous) mode) are then stored in the PatternTree.
(The FP-Stream paper explains this particularly poorly; instead of simply stating that the FP-Growth algorithm is used for the initial batch, it provides a rough, inaccurate and suboptimal description of FP-Growth.)
See the FPStream
class in my implementation.
FP-Stream: subsequent batches
The initial batch is very uninteresting, since it is essentially identical to an execution of the FP-Growth algorithm. Now that we have arrived at the subsequent batches, the FP-Stream algorithm becomes much more interesting.
As a subsequent batch is received, an FPGrowth
instance is created (and passed ε
as the minimum support and the previously created f_list
). In the first pass of the FP-Growth algorithm, the transactions are scanned and frequent items that are not yet in f_list
are added to it, in descending order (i.e. new frequent items are sorted descendingly and then appended to f_list
, thus maintaining f_list
’s previous order, only extending it).
Each time FPGrowth
encounters a new frequent itemset, the following happens:
- its constraints are checked, the result is stored in the boolean
frequentItemsetMatchesConstraints
- it is then checked by
FPGrowth::considerFrequentItemsupersets()
whether there are supersets that can be mined, i.e., if a conditional FP-Tree can be found on which the mining can continue; if there is not,NULL
is returned, otherwise it is checked whether the combination of the currently found frequent itemset plus the items in the conditional FP-Tree have the potential to match the constraints; if this is not the case,NULL
is returned, otherwise the conditional FP-Tree is built and returned
Now, the signal FPGrowth::minedFrequentItemset()
is emitted, and it includes the following parameters:
- the frequent itemset that was found
frequentItemsetMatchesConstraints
- the conditional FP-Tree (which may either be
NULL
or point to an FP-Tree)
This signal is received in the slot FPStream::processFrequentItemset()
, which is as an exact implementation of the “FP-Streaming” algorithm (“Incremental update of the PatternTree structure with incoming stream data”) in the FP-Stream paper, with minor modifications to add support for item constraints — more about that later in this blog post.
Frequent itemsets that match the constraints are inserted into the PatternTree
by FPStream::processFrequentItemset()
. When they already exist in the PatternTree
, the corresponding TiltedTimeWindow
is updated: a new quarter bucket is filled, and when the quarter granularity’s 4 buckets are full, they’re summarized into an hour bucket, and so on.
Finally, the user can ask to retrieve the frequent itemsets over any desired time range, after which the PatternTree
will be traversed and each visited node’s TiltedTimeWindow
will be asked to return the total support for that time range (which maps to a range of buckets in the TiltedTimeWindow
).
Adding support for item constraints
FP-Stream was not designed with constraint matching in mind. A thorough search session through related literature only lead to the discovery of a single paper on the subject (“Efficient Mining of Constrained Frequent Patterns from Streams” by C. K. Leung and Q. I. Khan), but this paper unfortunately only provided trivial extensions that I had already figured out on my own, whereas the truly difficult thing was to also get the support of antecedents, to be able to perform association rule mining.
In FP-Stream, whether supersets of an itemset are considered to be mined for through FP-Growth and then included in the Pattern Tree, depends solely on two factors:
- the itemset must be subfrequent (meaning that it must have at least
ε
support, instead ofσ
) - the corresponding node in the
PatternTree
must not have an emptyTiltedTimeWindow
after conducting tail pruning
However, when adding support for constraints, it becomes obvious that these factors are not sufficient. It is possible that:
- a frequent itemset is not accepted: if it doesn’t match the constraints, it is not accepted
- a frequent itemset that is not accepted because it doesn’t match the constraints, does not imply that supersets are not examined: after all, supersets may still be frequent, and more importantly, they may be able to match the constraints (the supersets are said to have potential)
the above two remarks imply that there are three cases in which either something was found (a frequent itemset that matches the contraints), something may be found upon further mining (there is a possibility that in the superset of this itemset, there is also or still something to be found), or both — in the table below you’ll see that these are cases 1, 2 and 3, whereas case 4 represents the dead end, where absolutely nothing could be found and no further work is required:
case frequent itemset conditional FP-tree explanation 1 NOT NULL
NULL
frequent itemset found, but nothing left to explore 2 NOT NULL
NOT NULL
frequent itemset found and supersets may contain more frequent itemsets 3 NULL
NOT NULL
frequent itemset does not match constraints, but supersets may contain more frequen itemsets that do match the constraints 4 NULL
NULL
dead end
As explained in the section about subsequent batches, when the FP-Growth algorithm has found a frequent iemset, the FPGrowth::minedFrequentItemset()
signal is emitted, and it includes the following parameters:
- the frequent itemset (pattern) that was found
frequentItemsetMatchesConstraints
- the conditional FP-Tree (which may either be
NULL
or point to an FP-Tree)
Now, how is support for constraints integrated with FP-Stream’s “Incremental update of the PatternTree structure with incoming stream data” algorithm?
There are two major branches in this algorithm:
- the pattern already exists in the Pattern Tree
If the pattern is already in the Pattern Tree, it is too late. Hence, we do not need to make any changes here. - the pattern does not yet exist in the Pattern Tree
In this case, the FP-Stream paper states that the frequent itemset should be added. However, I again added the additional requirement that the pattern should also match the constraints.
When the pattern does not match the constraints, I added the following logic to the algorithm: if the conditional FP-Tree that was passed with the signal does not equalNULL
(meaning that the search space still has potential to match the constraints — again, as explained in the section about subsequent batches), then its supersets will also be calculated.
The overall rationale is to ensure that the PatternTree
only stores patterns (frequent itemsets) that match the constraints, but at the same time it is ensured that the search for those patterns is not stopped too early (by means of the additional exploring of supersets in the second branch, when there is potential).
This all seemed reasonable to do, and does correctly only generate frequent itemsets that match the constraints, but as we will see in the next subsection, it was not without (unforeseen) consequences.
Rule mining with constraints
You may recall the identically titled section in the blog post about FP-Growth. The problem described in this subsection is strongly remiscent (and of course, correlated) to the problem described in that previous subsection. The solution, however, is completely different.
When I finally got FPStream
working, there was another problem. I had always thought that once I got FPStream
working, the hard part would be over. But instead, it turned out that there was one small oversight with major repercussions that me nor my thesis advisor had noticed. That small oversight was the fact that it is in fact impossible to calculate the exact support for rule antecedents, since they cannot match the constraints. Would there be a work-around like there was one for an implementation of FP-Growth that has support for constraints?
Given a hypothetical candidate association rule X ⇒ Y
, we need support(X)
and support(X ∪ Y)
to calculate the confidence of the association rule. Since support(X) ≥ support(X ∪ Y)
by definition, it must follow that if X ∪ Y
is stored in the PatternTree
, then X
must be stored in the PatternTree
as well. However, when add constraint matching to the picture, this no longer holds!
But there, a simple, yet very elegant solution exists. Its only downside is that it will imply the storage of more data (but still less data than in the case where no constraint matching is used before storing the data in the PatternTree
, and constraint matching is only used when mining association rules, i.e. after mining frequent itemsets).
This solution is: if a (sub)frequent itemset’s superset has the potential to match the constraints, then store it in the PatternTree
anyway.
Let us again review the 2 branches that we altered in the previous subsection; now we will alter them in a different way that will allow for all antecedents to be found:
- the pattern already exists in the Pattern Tree
If the pattern is already in the Pattern Tree, it is too late. Hence, we still do not need to make any changes here. - the pattern doe snot yet exist in the Pattern Tree
This time, I add the frequent itemset not only when it matches the constraints, but also when the conditional FP-Tree does not equalNULL
. The reasoning behind this is that possible antecedents should also be stored in the Pattern Tree (i.e. when the constraints aren’t matched, but the conditonal FP-Tree does not equalNULL
and thus has potential). The supersets aren’t evaluated though, but since the antecedent is already stored, its direct supersets will be evaluated in the next batch (if they occur in that batch). This is exactly how the original algorithm works.
This approach follows the “spirit” of the original algorithm more closely and succeeds in adding support for constraints, while still allowing for association rule mining.
This is how the above can be integrated with existing FP-Stream implementations:
let F be a frequent itemset found by the regular FP-Growth algorithm;
let T be the the conditional FP-Tree for F;
let C be the constraints that must be matched for a frequent itemset to be accepted;
let P be the Pattern Tree;
let N be the node for F in P;
if (N==NULL)
then {
if (F matches C || T != NULL)
then {
add F to P;
}
}
So now, from the perspective of FPStream
, antecedents will be stored in the PatternTree
and thus we will be able to calculate the confidence of candidate association rules.
However, one important fact was still forgotten: it depends on the order in which frequent itemsets are mined by FPGrowth
whether antecedents are also mined! This problem was previously encountered while adding support for constraints to the FP-Growth algorithm — see the blog post about FP-Growth.
The solution there was to simply quickly calculate the support of an antecedent, which was perfectly possible thanks to the availability of the FPTree
(from which this can easily be read). We cannot apply that same tactic here, because it is impossible to keep every FPTree
of every FPGrowth
instance in memory (remember that one FPGrowth
instance is created for each batch, and upon completion this instance is deleted).
Let us consider the same (brief) example again. Suppose {duration:slow, episodes:css, location:EU}
was generated in the following order:
{duration:slow}
↓
{duration:slow, episodes:css}
↓
{duration:slow, episodes:css, location:EU}
where each intermediate frequent itemset was also added to the set of frequent itemsets. Then, clearly, the frequent itemset that corresponds to the candidate association rule antecedent, {episodes:css, location:EU}
, was not generated, and thus its support is not readily available.
Since we cannot keep every FP-Tree ever built in memory, there is only one solution: ensure that the frequent itemsets are generated in such an order that it is guaranteed that all possible antecedents will have been generated as well. That is the key insight.
We know that the FP-Growth algorithm generates frequent itemsets in a bottom-up fashion: it starts at the bottom of the FP-Tree, then adding each possible prefix and recursively repeating this until the root node is reached. Then, to ensure that all antecedents are generated, the logical thing to do is to make sure that the association rule consequent items are the last prefixes to be encountered. Since FP-Growth works in a bottom-up fashion, we must simply ensure that association rule consequent items are the items at the very top of the FP-Tree.
(Note that this approach only works when there is a very limited set of association rule consequent items. In our case, this set contains only one item: {duration:slow}
.)
Implementing this was trivial: it only required a minor modification to FPGrowth::optimizeTransaction()
: it still orders items in the transaction by descending frequency, but it ensures that positive rule consequent constraints end up at the front of the transaction. Transactions are always passed through this method before they are inserted into the FPTree
and thus this is all that needs to be changed.
Now, the example order in which the above example is generated is as follows:
{episodes:css}
↓
{episodes:css, location:EU}
↓
{duration:slow, episodes:css, location:EU}
End result
We now have an application that is capable of parsing Episodes log files, mapping each Episodes log line to many transactions, mining the subfrequent itemsets from these transactions using the FP-Growth algorithm, inserting these in a Pattern Tree when the FP-Stream algorithm deems this fit and then retrieving the frequent itemsets over any given time range.
The end result at any point in time is a PatternTree
that contains all subfrequent itemsets (i.e. all itemsets with frequency ≥ε
). We can now ask for any time range (i.e. any range [x,y] | x ≤ y | x,y ∈ [0,71]
) supported by the buckets present in all TiltedTimeWindow
s to retrieve the frequent itemsets (i.e. all subfrequent itemsets with frequency ≥σ
).
Now, if we look at the debug output of some of the batches being processed, we can learn a lot of things:
Processed batch of 246 lines!
Transactions generated: 2395. (9.73577 transactions/event)
Avg. transaction length: 11. (26345 items in total)
Events occurred between 2010-11-14 06:45:09 and 2010-11-14 06:59:56.
PatternTree size: 1045
ItemIDNameHash size: 298
f_list size: 277
Processed batch of 258 lines!
Transactions generated: 2488. (9.64341 transactions/event)
Avg. transaction length: 11. (27368 items in total)
Events occurred between 2010-11-14 07:00:01 and 2010-11-14 07:14:59.
PatternTree size: 1261
ItemIDNameHash size: 383
f_list size: 338
Processed batch of 135 lines!
Transactions generated: 1282. (9.4963 transactions/event)
Avg. transaction length: 11.0062. (14110 items in total)
Events occurred between 2010-11-14 07:15:00 and 2010-11-14 07:29:52.
PatternTree size: 889
ItemIDNameHash size: 444
f_list size: 404
Noticeable properties are:
- each batch contains the transactions generated from the Episodes log lines over a 15-minute window, this is evidenced by the timestamps
- the number of page views can vary strongly between each 15-minute window, and thus the number of transactions per corresponding batch varies equally strong
- the
PatternTree
size increases most of the time, but sometimes the effects of pruning can be very clear: in the 3rd batch, the size decreases significantly - the
ItemIDNameHash
variable’s size (which maintains the mapping from efficient item identifiers to their full string equivalents) is a measure of the number of unique items encountered so far in the data stream - the
f_list
size can only increase, but as it gets to know most items, the growth rate will decelerate (note that this is by definition smaller thanItemIDNameHash
’s size, sincef_list
does not include infrequent items andItemIDNameHash
does)
Performance
While it was fairly easy to describe the performance characteristics of the EpisodesParser
and Analytics
(in phase 1) modules, it has now (Analytics
in phase 2) become relatively hard.
After all, there is no single desirable output anymore: the desired output (association rules) depends on the desired time range. It is clear though that the association rule mining itself is still very fast. However, given an Episodes log file of e.g. 50,000 lines, it is clearly far less efficient to mine these for association rules using FP-Stream, even if only due to the fact that many FPGrowth
instances need to be created (one for every batch that corresponds to a 15-minute window).
The consequence is that there is a negligible performance difference between different sizes of batches. In the test data set that I’m using, there typically are only between 100 and 300 Episodes log lines for each 15-minute window, resulting in about 1,000 to 3,000 transactions. Let us assume the average is 1,500 transactions. The difference in processing time for 1,500 transactions versus, say, 12,000 (the 8-fold!) transactions is not so large: less than a second for 1,500 transactions and still less than a second for 12,000 transactions (see the conclusion of the blog post about FP-Growth: FPGrowth
can handle over 16,500 transactions per second).
Due to the overhead incurred by having FP-Stream deciding for each individual frequent itemset whether mining should be continued or not, this number will be lower in practice, but the point is nevertheless clear.
Debug mode versus release mode
It’s worth mentioning that in this test case of 50,000 lines, the processing takes 6 to 7 minutes in debug mode, but only 2 minutes in release mode. Clearly, release mode is far more efficient!
Memory consumption
While performing the calculations for a ±50,000 lines long Episodes log file, memory consumption reaches an all-time high of ±46 MB, but upon completion it drops to ±28 MB, which corresponds to the memory consumed by QBrowsCap’s and QGeoIP’s in-memory caches, the Qt libraries, plus some data cached by the Analytics
module, plus the PatternTree
.
When you compare this to the memory consumption of EpisodesParser
and Analytics
at the end of phase 1, it is clear that the memory consumption by the PatternTree
is very small (a few megabytes); and that there are likely no memory leaks whatsoever. (Running the application through valgrind also reveals no memory leaks.)
Conclusion
My master thesis’ Analytics
module is able to mine the association rules of a stream of Episodes log files at over 12,000 transactions per second on my 2.66 GHz Core 2 Duo machine! :)