WebSpy Vantage 3.0

Making Sensible Employee Internet Reports for the Modern Web (Part 1)

Update: The technique described in this series of blog articles has since been improved upon, and integrated into WebSpy Vantage 3.0 via the Origin Domain summary that is present in when analyzing any log files that contain URLs. We’ve called this feature Site Clean. It is also available in our separate Fastvue Reporter applications. See further details about our unique Site Clean engine.

The modern web presents many challenges for  web reporting. You have probably come across the issue in your employee Internet reports where it is difficult to work out the sites someone was actually visiting, as opposed to what their browser visited.

To demonstrate the issue, I went and checked Facebook.com and also watched a video on Techcrunch.com. Then I imported my latest Forefront TMG logs into a new WebSpy Vantage storage and ran a report. Here are the results for the regular Site Domains section:

Notice how the two sites I actually visited appear 7th and 9th down the list, and my top site by a long way is something called 5min.com.

5min.com is the Content Delivery Network (or CDN) that Techcrunch.com uses for streaming videos. The report is technically accurate as my browser did download most content from this site. But if you told your employees they needed to stop browsing 5min.com they would have no idea what you’re talking about!

The second site is akamaihd.net. This is the CDN that Facebook uses to deliver most of its content. Then you have wordpress.com, google.com. fbcnd.net, wp.com, fyre.co, google-analytics.com, 2mdn.net and a bunch of other sites that I certainly did not browse to.

So what is all this junk and how do we make sense of it?

Understanding the Modern Web

Once upon a time in the early days of the Internet, web developers would build a website and host everything on the same domain. The web page would be served from mysite.com along with all the images, scripts, style sheets and so on. Web reporting was easy in these days. If you see mysite.com in your web report, chances are, you actually entered mysite.com into your browser.


Then came the advertising servers. Instead of hosting advertising banners on mysite.com, it was easier to simply link to the original advert on an advertising server. This also enabled the advertising company to track clicks and distribute money to the original site owner. But now your browser is requesting a resource (the advertisement) from a different domain. So instead of simply seeing mysite.com, you’re now seeing mysite.com as well as something like adserver.net in your reports.

In my example above, you can see several advertising sites; doubleclick.net, m2dn.net, gravity.com, doubleverify.com and so on.

Visitor Tracking

The emerging online advertising industry quickly spawned visitor tracking sites. In order to serve you more relevant and tasty advertisements, advertising companies needed to track the number of times you return to a site, and get an idea of the content you’re interested in. This is also useful for web site administrators to get an idea of where visitors are coming from, what they’re interested in and how long they spend on the site. This is easily achieved by simply adding a script to your web site from a site tracking company.

But now your browser is also requesting a script from another domain. So now your web report is showing mysite.com, adserver.net and visitortrack.com (these are example/fictitious domains).

The most common example you’ll see in your web reports is google-analytics.com (also seen in my example above), but there are many others such as chartbeat.com, woopra.com and mixpanel.com.

Content Delivery Networks (CDNs)

As web developers put more information on websites such as videos and other rich media, speed became a very important element in a website’s success. If your site is hosted on a server in New York, but your visitors are in Perth, Australia (incidentally the most isolated capital city in the world, and where yours truly is from originally), then it will take a long time for your site to download, and your visitors will get frustrated and leave.

CDNs came along to solve this problem.

We can see some of these CDNs in my example at the top of this article. When you browse Facebook, most of the content is actually delivered from a site called akamaihd.net. Akamai is a very popular CDN with an infrastructure that allows them to host content at locations all around the world, allowing sites like Facebook to deliver content to you faster as it is not traversing the globe to get to you.

Unfortunately, this means your web browser is now requesting most of a website’s content from a different domain to the one you typed in the address bar. This is great technology for web developers, but bad for anyone trying to understand where users are going to on the Internet  by analyzing their log data.

Social Sharing and Widgets

With the rise in social media, a great way for web developers to distribute their content is to make it easy for visitors to share it through their social networks. Like this on facebook, tweet this on twitter, share this on LinkedIn. They’re everywhere!

Just like advertising, visitor tracking, and CDN content, each of these widgets is hosted on Facebook, or Twitter, or LinkedIn, not on the actual site you’re looking at.

So now your log files and your web reports are showing people accessing social media sites, when they’re probably just reading a tech blog.


You’re likely to see all of the four types of sites (Advertising, Visitor Tracking, CDNs and Social Sharing Widgets) in all of your logs and reports, but the problem extends well beyond the four categories above. Another modern trend for Web Applications is to have a public Application Programming Interface (API). This allows web developers to programmatically access functionality on other sites and applications within their own site.

Examples include Stripe.com which offers an API for credit card payments, MailChimp.com which offers an API to add and remove visitors from mailing lists, and salesforce.com that offers an API to edit accounts and contacts in their CRM.

Again, these APIs result in more resources being requested by your browser from domains that are not the same as what you typed into the address bar.

Making sense of it all

So how do we make sense of all this and create meaningful employee internet reports that actually reflect peoples online activity?  Continue to part two of this series to find out!

See also: