In part one of this series, we looked at some of the challenges of reporting on the modern web. Advertising, visitor tracking, CDNs, widgets and APIs all contribute to the problem of cluttering employee Internet reports and making it difficult to find out what sites a user actually visited.
In this part of the series, we look at how we can derive some sense from all the noise the modern web creates.
Making Employee Internet Reports Make Sense Again
So how do we make sense of all this and create meaningful web reports that actually reflect peoples online activity? A knee jerk reaction would be to create a domain filter that removes all the advertising, tracking, CDNs and widgets from your reports. But then what would you be left with? Perhaps a 10 KB hit to the original website? That’s not very useful.
A better solution is to utilize both Referrer URLs and Mime Types to group resources under the original requesting site.
Referrer URL is the URL someone was on before they accessed the current URL. They’re typically used to discover how a visitor found your website. For example, if the Referrer URL is a Google search page, then they found your site through a Google search.
However Referrer URLs can also be used to find the original site that requested a web resource. For example, when browsing facebook, a friends profile picture will be downloaded from the CDN fbexternal-a.akamaihd.net, but the Referrer URL is still set to facebook.com as shown below.
So wouldn’t it be cool if web reports displayed the Referrer URL instead of the requesting URL for Advertising, Tracking, CDNs and Widgets? Yes. It would be very cool.
But how do we work out if a web resource is one of these things?
Again, a knee jerk reaction would be to create a domain list, or even use URL Categories if you’re analyzing logs from a secure web gateway or UTM that does URL Filtering. But creating and maintaining a list like this would be a nightmare! Fortunately there is a better way.
If you look at all the advertising, tracking, and CDN hits, you’ll notice that most of the hits are images, scripts, css files, streaming media content and so on. Fortunately there is a field in most log files called Mime Type that identifies the type of resource. Here are some Mime Types for common web resources:
For normal original web pages, the Mime Type is usually one of the following four types:
- text/html; charset=iso-8859-1
Note: The charset=… part is not technically part of the Mime Type, however Forefront TMG logs it as such. When analyzing Forefront TMG, we therefore need to take these strings into account.
The facebook profile picture from my example above has a Mime Type of image
So in theory, we can make a more sensible looking web report with the following assumption:
Anything with a Mime Type other than text/plain, text/html, text/html;charset=utf-8, or text/html; charset=iso-8859-1 is a web resource (image, script etc).
For web resources, display the Referrer URL and for everything else (original html pages), display the original requesting URL.
Of course, this will also display the referrer URL for web resources hosted on the original site, but that doesn’t really matter, as the referrer URL will still be the original requesting site.
So how do we go about doing this in WebSpy Vantage? Continue on to part three in this series to find out!