Optimizing Log File Size For Analysis And Reporting

Firewalls and proxies generate a lot of log data. Multiple gigabytes per day are commonplace now. The log files themselves are generally simple flat text files. Their size comes from the sheer volume of entries, not from being rich data types.

The log file size not only consumes disk space during logging, storing, and archiving, but it also causes processing overhead when importing, analysing, and generating reports.

As a result, there is value in making sure that you are logging the right level of information. This article will help you find the right balance between logging too little and too much.

LogTrim1

Field Types

The log fields included in your log files will differ per device. Some devices allow you to customize the output by selecting certain fields, while others give you no options at all. Some devices may log descriptive full-text data, where others may use a single integer number or hex code.

In the image below, you can see how even a single system does not log data types uniformly.

LogSample

This means that depending on the system, there may be differing advantages in excluding specific fields.

Example 1: Website Category

“Web Gateway A” may log a website category as Edge Content Servers/Infrastructure [35 chars], while “Web Gateway B” may log the category’s internal reference number 05 [2 Chars].  In this case, excluding the log field for Web Gateway A would save 35 bytes per entry, while performing the same selection on Web Gateway B would save only 2 bytes per entry.

Since this is a useful field to include for Human Resources reporting, you may have to include it in logging. Even if you could omit it, the saving would be insignificant – especially using “Web Gateway B”.

Example 2: User agent

User-agent strings are always logged as a text field that can be nice and short, or very long. Here are some real user-agent values from a proxy log, along with their characters or bytes:

  • Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0) [65 chars]
  • SeaPort/3.0  [11 chars]
  • [1 char]
  • Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 [110 Chars]
  • Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; Win64; x64; Trident/4.0; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C; .NET4.0E) [163 chars]
  • Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 [103 chars]

You can see that the value of this field can vary from 1 byte all the way to 160 and beyond.

Since the user agent indicates which application or browser was used, it is useful for technical reporting but not necessarily for Human Resources reporting (although it can be useful to filter out background apps and focus purely on web browsing activity).

Excluding the field from logging will give you a notable size benefit without impacting your HR centric reports, but from an auditing perspective, it may not be an acceptable loss of detail.

Of course, these choices are relative to how you and your organisation value the different fields of information.

Deciding What to Keep and Discard

As you can see, not all fields are equal with regards to size or value. Some fields have to be logged, while others are optional.

They key here is to understand what fields you need for your reporting requirements now, and also what might be useful in the future.

Imagine you have a security breach in your organisation and you need to do some forensic investigation to figure out what happened. The field that had no value for HR reporting suddenly becomes extremely important from a technical perspective.

You can never recover log data that was never logged, but you can exclude log data when importing and analysing it.

We can store and archive logs efficiently with compression and deduplication. A prudent log-retention and archiving strategy is also advised (and often required by legislation).

My recommendation is always to log more that you need, but only import into your reporting/analysis tool the data that is required for reporting.

WebSpy Vantage Storage Import

Generally speaking, the smaller a WebSpy Vantage storage is, the faster reports will run. When importing your raw log data into WebSpy Vantage storage, you have a few options to reduce what is being imported.

In this section I will cover the key differences between filtering and field selecting. Both decrease the amount of data being imported into the storage, but they work very differently.

Field Selection

Think of Field Selection like selecting the columns in a log file, not the rows.

Selecting the fields you want to import into storage can only be done when you create the storage. After this has occurred, the fields are set and cannot be changed. If you would like to add an additional field, you need to re-import your logs from scratch, so it’s important to get the field selection right at the start.

Also keep in mind that when fields are excluded from import, they cannot be referenced in a report or alias.

Field Selection

Import filtering

Think of Filtering like selecting the rows in a log file, not the columns.

Filtering can be done on any of the fields that are imported into storage. Filters can be added or removed from a Storage at any time but the changes only affect future imports, not data that has already been imported.

More complex filters on import can slow down the import process, but if you are generating multiple reports from the same storage, it is a one-time pain for a repeating gain.

When filtering during Storage import, the filters should be specific enough to reduce the storage size without being excessively restrictive.

Keep in mind that you can apply more specific filters when running Reports.

Import Filter Examples

A good example of an import filter is to include only allowed traffic and to exclude traffic that failed or was denied.

Another filter that makes sense is specifying the destination network as external. This simple combination limits storage data to only traffic that was successfully retrieved from the Internet.

ImportFilter

Combining Field Selection and Filtering

Each method can be very effective, but combining them can help reduce the size of your storage dramatically.

Below is a table showing the various effects of these different approaches.  The same two log files were imported into 4 separate storages, each with different import settings.  The original log files are 2 395 239 274 bytes in size

Stats

 

You can see how excluding fields does not change the number of records, but results in less data imported per record. Filtering on the other hand does not affect the amount of data per record, but it imports less records. In all cases, the methods reduced the overall size of the storage and also reduced the time to import.

Analysis and Reporting

You can now analyze the storages to see how they compare. Regardless of the filter you have applied, the available fields determine which summaries and aggregates are available.

Here is a summary analysis using all of the available fields.

Big-Summaries

This analysis is generated on the storage where we only selected certain fields.

Small-Summaries

You can therefore think of it in the following terms: the fewer fields you have, the simpler your reports have to be. Having said that, just because a report is simple does not mean it does not contain enough information. By the same measure, a report’s value will not increase if it includes information that is not required.

If you are manually trimming down your logs, you need to keep in mind that WebSpy Vantage’s base templates may not work for you since they are typically configured with product defaults in mind.

Fortunately, WebSpy Vantage gives you control over editing and trimming the existing template. In this example, we had to trim out aggregates that are not available in the selective import.

EditNode

 

This common report was run across all four storages created previously, and the results are:

Stats-Report

As you would expect, the smallest storage took the least time to import, and also the least time to generate the report – and the report contains exactly the same information that the larger storage produced!

You can see a linear performance gain as we trim and optimise the log import process.

A word of caution: If you need just one report that requires a field not included in the trimmed storage, you would have to start all over again with a new storage and import from scratch. So don’t be too aggressive in trimming the log files to save a few minutes of import time.

Conclusion

By now you should be comfortable with the concepts and methods of safely reducing the amount of raw log data generated.

You also know how to selectively import only what you need into WebSpy Vantage and how this is specific to the devices in use, your log retention policies, and reporting requirements.

By using these strategies, you can improve the efficiency of your logging and reporting infrastructure carefully and systematically.

See also:

About the Author:

Based in Cape Town, South Africa, Etienne is an IT Professional working in various environments building, testing and maintaining systems for a large national retail chain. An IT professional since 1996, Etienne has worked in various environments and is certified by (ISC)2, Comptia, Dell and Microsoft. Etienne is the technical blogger and primary technical consultant for FixMyITsystem.com a solutions provider company based in Cape Town with a global client base.