QBrowsCap & QGeoIP: detecting browsers and locations

published on March 1, 2011

In December and January, I’ve continued working on my master thesis, while simultaneously preparing for my exams in January (which I passed without problems).
In a previous blog post, I had indicated that I ran into problems while parsing dates: Qt uses the system locale for this, but on Mac OS X there turned out to be a severe performance problem with that functionality. I solved that by developing QCachingLocale, which is a class that introduces a caching layer to prevent said performance degradations.

Further parsing

Now, parsing the date was of course only one tiny part of the problem: I also had to parse the episodes information embedded in each Episodes log file line (which is trivial), as well as map the IP address to a physical location and an ISP and map the user-agent string to a platform and actual browser.
Finally, we also want to map the episode duration to either duration:slow, duration:acceptable or duration:fast. This is called ‘discretization’: continuous values (in our case: durations) are mapped to discrete values.

QBrowsCap: user-agent string → platform + browser

After spending a very long time looking for a C++ library to map user-agent strings to their corresponding browser name and version (and platform), I had to give up. No such library existed.
Because it is impossible to write a single, standardized routine that parses this information from the user-agent string, I had to rely on BrowsCap, the Browsers Capabilities project. This is the same dataset the PHP language relies on to identify browsers.

I have developed a C++ library that relies on BrowsCap to do just that — I baptized it ‘QBrowsCap’ because it is specifically optimized to be used with Qt-based projects.
QBrowsCap makes it easy to download this dataset, and keep it up-to-date. It parses the browscap.csv file and stores it in a SQLite database, which allows for faster mapping of user-agent strings (BrowsCap relies on ‘globbing’, and SQLite has built-in support for this). To maximize performance, it uses an in-memory hash table. QBrowsCap has also been made thread-safe, to allow for concurrent (i.e. by using a MapReduce-approach, implemented in C++/Qt with Qt’s QtConcurrent) user-agent details lookup by multiple threads (therefor allowing greater user-agent details lookup speeds).

Sample result

The user-agent string Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) is mapped to

("ua:WinXP",
 "ua:WinXP:IE",
 "ua:WinXP:IE:6",
 "ua:WinXP:IE:6:0")

QGeoIP: IP address → location + ISP

Unfortunately, no library was available for Qt to map IP addresses to physical locations either. Fortunately, I did find a C library which I made easier to use by wrapping it in a Qt-friendly manner — I called the end result ‘QGeoIP’. QGeoIP also simplifies the build process of the C library by using Qt’s build system instead of an arcane Makefile. QGeoIP uses MaxMind’s libGeoIP.
I encountered one major problem with using libGeoIP though: GeoIP_delete() does not actually free the memory consumed by libGeoIP’s in-memory cache. I tried debugging this, but I can’t get it to work. Maybe it’s an OS X-specific issue? I did not try it on Linux. Likely related to these memory release issues, there is the problem that it seems to be impossible to make QGeoIP work in a thread-safe manner, thus unfortunately not allowing for concurrent IP to location + ISP mapping by using multiple threads.

Sample result

The IP address 218.56.155.59 is mapped to

("location:AS",
 "location:AS:China",
 "location:AS:China:Shandong",
 "location:isp:China:AS4837 CNCGROUP China169 Backbone")

EpisodesDurationDiscretizer: continuous episode durations → discrete speeds

The Episodes timing information contains <episode name>:<episode duration> pairs. It’s far more difficult to perform association rule mining to continuous data than to discrete data; therefor we discretize the continous data in these pairs: the episode durations. As explained in the introduction, we want to discretize these continuous episode durations to discrete speeds: duration:slow, duration:acceptable or duration:fast.
To do this, I wrote the EpisodesDurationDiscretizer class (.h/.cpp) class, which accepts a .csv file that defines the mappings. Such a .csv file looks like this:

domready,fast,150,acceptable,1000,slow
frontend,fast,100,acceptable,1500,slow
headerjs,fast,100,acceptable,1000,slow
footerjs,fast,100,acceptable,1000,slow
css,fast,100,acceptable,500,slow
DrupalBehaviors,fast,100,acceptable,200,slow
tabs,fast,10,acceptable,20,slow
ToThePointShowHideChangelog,fast,10,acceptable,20,slow

As you probably derived yourself, the first column contains the episode name, the second column contains the “speed name” for the fastest discretization, which goes from 0 ms to the value in the third column. As many discretizations as desired can be defined. In our case, there are three discretization levels for each episode. For example, these are the three discretization levels for the domready episode durations:

  1. “fast”: 0—150 ms
  2. “acceptable”: 151—1000 ms
  3. “slow”: 1001—∞ ms
Sample result

The Episodes timing information

css:203,headerjs:94,footerjs:500,domready:843,tabs:110,ToThePointShowHideChangelog:15,DrupalBehaviors:141,frontend:1547

is mapped to

(("episode:css", "duration:acceptable"),
 ("episode:headerjs", "duration:fast"),
 ("episode:footerjs", "duration:acceptable"),
 ("episode:domready", "duration:acceptable"),
 ("episode:tabs", "duration:slow"),
 ("episode:ToThePointShowHideChangelog", "duration:acceptable"),
 ("episode:DrupalBehaviors", "duration:acceptable"),
 ("episode:frontend", "duration:slow"))

End result

Now that we can map meaningless strings and numbers to meaningful items, we can apply association rule mining. But more on that in a future blog post.

We will end here with looking at the result of a single line as it gets parsed and processed by my master thesis’ code.
We begin with:

"218.56.155.59 [Sunday, 14-Nov-2010 06:27:03 +0100] "?ets=css:203,headerjs:94,footerjs:500,domready:843,tabs:110,ToThePointShowHideChangelog:15,DrupalBehaviors:141,frontend:1547" 200 "http://driverpacks.net/driverpacks/windows/xp/x86/chipset/10.09" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" "driverpacks.net"

This gets parsed and processed into:

("episode:css", "duration:acceptable", "url:http://driverpacks.net/driverpacks/windows/xp/x86/chipset/10.09", "location:AS", "location:AS:China", "location:AS:China:Shandong", "location:isp:China:AS4837 CNCGROUP China169 Backbone", "ua:WinXP", "ua:WinXP:IE", "ua:WinXP:IE:6", "ua:WinXP:IE:6:0") 
("episode:headerjs", "duration:fast", "url:http://driverpacks.net/driverpacks/windows/xp/x86/chipset/10.09", "location:AS", "location:AS:China", "location:AS:China:Shandong", "location:isp:China:AS4837 CNCGROUP China169 Backbone", "ua:WinXP", "ua:WinXP:IE", "ua:WinXP:IE:6", "ua:WinXP:IE:6:0") 
("episode:footerjs", "duration:acceptable", "url:http://driverpacks.net/driverpacks/windows/xp/x86/chipset/10.09", "location:AS", "location:AS:China", "location:AS:China:Shandong", "location:isp:China:AS4837 CNCGROUP China169 Backbone", "ua:WinXP", "ua:WinXP:IE", "ua:WinXP:IE:6", "ua:WinXP:IE:6:0") 
("episode:domready", "duration:acceptable", "url:http://driverpacks.net/driverpacks/windows/xp/x86/chipset/10.09", "location:AS", "location:AS:China", "location:AS:China:Shandong", "location:isp:China:AS4837 CNCGROUP China169 Backbone", "ua:WinXP", "ua:WinXP:IE", "ua:WinXP:IE:6", "ua:WinXP:IE:6:0") 
("episode:tabs", "duration:slow", "url:http://driverpacks.net/driverpacks/windows/xp/x86/chipset/10.09", "location:AS", "location:AS:China", "location:AS:China:Shandong", "location:isp:China:AS4837 CNCGROUP China169 Backbone", "ua:WinXP", "ua:WinXP:IE", "ua:WinXP:IE:6", "ua:WinXP:IE:6:0") 
("episode:ToThePointShowHideChangelog", "duration:acceptable", "url:http://driverpacks.net/driverpacks/windows/xp/x86/chipset/10.09", "location:AS", "location:AS:China", "location:AS:China:Shandong", "location:isp:China:AS4837 CNCGROUP China169 Backbone", "ua:WinXP", "ua:WinXP:IE", "ua:WinXP:IE:6", "ua:WinXP:IE:6:0") 
("episode:DrupalBehaviors", "duration:acceptable", "url:http://driverpacks.net/driverpacks/windows/xp/x86/chipset/10.09", "location:AS", "location:AS:China", "location:AS:China:Shandong", "location:isp:China:AS4837 CNCGROUP China169 Backbone", "ua:WinXP", "ua:WinXP:IE", "ua:WinXP:IE:6", "ua:WinXP:IE:6:0") 
("episode:frontend", "duration:slow", "url:http://driverpacks.net/driverpacks/windows/xp/x86/chipset/10.09", "location:AS", "location:AS:China", "location:AS:China:Shandong", "location:isp:China:AS4837 CNCGROUP China169 Backbone", "ua:WinXP", "ua:WinXP:IE", "ua:WinXP:IE:6", "ua:WinXP:IE:6:0")

As you can see, this single Episodes log file line results in eight transactions. The careful reader will have noticed this matches the number of episodes in the original Episodes log file line. More specifically, each episode gets its own transaction, along with its corresponding discretized speed and all request metadata (URL, location, ISP, platform, browser). (Note that this is a simple example; in the actual implementation, the HTTP status code is also included if it’s not a 200 status code and a ua:isMobile item is included in the transaction if it’s a mobile user agent.)
This is because we want to find associations for specific episodes’ speeds. Hence we need a transaction for each episode with its speed, plus all possible environmental factors that can cause this particular speed. On these resulting transactions, we can then apply association rule mining.

Conclusion

Both QBrowsCap and QGeoIP are unlicensed (as is the master thesis in its whole), so feel free to use them in your applications and contribute back! They also include unit tests and tiny sample applications that can easily be built with Qt Creator.

When integrated with my master thesis’ EpisodesParser, I’m able to achieve over 4,000 parsed & processed lines per second on my 2.66 GHz Core 2 Duo machine, resulting in ±40,000 transactions. Not bad! :)

Comments

Wim Leers's picture

Wim Leers

BrowserScope’s code base is too big to be manageable by a single person without prior deep knowledge (that’s me) and especially to port it all to C++. If it were sufficiently documented to make it clear which parts change relatively frequently (because they depend on new user agents appearing) and which parts don’t, then it might have been doable to port it. But in its current state, I’m afraid it’s just too hacky. (And calling Python code from within my C++ code … that would just have been too ugly.)

I did consider it, but evaluated the option as being too unwieldy, convoluted, non-futureproof (BrowsCap updates are easy, BrowserScope’s UA parsing updates have to be done by hand).

When BrowserScope’s UA parsing becomes more feasible, it’ll be very easy to just replace QBrowsCap with QBrowserScopeUA! :)

Jurgen Goelen's picture

Jurgen Goelen

We are using the boomerang library from Yahoo in our webpages to measure the page load times. (currently running on our staging servers). Our apache logs get collected (near real time) using Flume agents and get stored in an OpenTSDB instance.

We have also written our own framework to parse the UA, GeoIP and ISP info from the log lines.

Looking forward to your datamining work!

Wim Leers's picture

Wim Leers

Thank you so much for your excellent comment, Jurgen!

I had never heard of Flume nor OpenTSDB. Both are very interesting.

If I understand it correctly after my quick skim, Flume uses Apache Hadoop to parallelize the workload. In fact, I’m doing something like that: I’m using QtConcurrent to apply a MapReduce approach within my own codebase (i.e. without external dependencies — I updated the blog post to reflect this), but without the “Reduce” step: no reduction is necessary, only regular processing. I had thought about using a full-fledged Apache Hadoop-setup, but considered it too much work to add yet another dependency (and learn to install, configure and work with it). It appears like Flume requires less set-up, so it may be a viable option in the future.

OpenTSDB appears to be insanely useful and unbelievable powerful. But again, it’d be yet another dependency, which in itself apparently has boatloads of dependencies — it requires a fairly elaborate Java environment. And I personally severely dislike working with Java: I’ve only had bad experiences with it over the past 5 years. If I’d integrate with OpenTSDB, it seems the scope of my master thesis implementation would grow too unwieldy.

Nevertheless, had I known about these right from the beginning, I might have built upon them. Right now, it’s too late to start changing it all.

Is your framework to parse the UA, GeoIP and ISP open source? Did you write it yourself, or did you fork it from another open source project? And possibly most importantly: what are you using it for? For the next site of the K.U. Leuven, maybe?

Looking forward to further feedback from you! :)

Jurgen Goelen's picture

Jurgen Goelen

P.S.: Have a look at the following framework for mining your log data: http://mahout.apache.org/ ;-)

Wim Leers's picture

Wim Leers

Thanks for the additional links with introductory videos :) And once more, you stun me with a piece of software I did not know about: I’d never heard about Apache Mahout before… Nor did I find it in my search 1.5 year ago for frameworks that I could leverage.

You can find my current association rule mining code here on GitHub.

Avea's picture

Avea

Consider this a bug report since I’m too lazy to sign up at GitHub.

QGeoIP.cpp line 76 calls GeoIP_time_zone_by_country_and_region using r->country_code and r->region before checking if r is null on line 78.

In addition, line 120 attempts to open the databases using the GEOIP_MMAP_CACHE flag, which is unavailable on Windows. This results in a segfault at the first lookup attempt, so a different flag or OS checking may be wise.

Wim Leers's picture

Wim Leers

Your bug report is much appreciated!

However, as you can tell by looking at the code on GitHub, it’s been two years since I wrote that code. I’d be happy to apply a patch or merge your fork, but I’m not going to fix this myself.

I did open an issue on GitHub with this exact information though: https://github.com/wimleers/QGeoIP/issues/1 — hopefully that will help or even entice future users of the code :)