Splunk on a diet

How to get log ingestion under control and tame the monster that is licensing cost.

Why cut back on license cost?

An issue we’ve seen at many organizations that utilize Splunk is a growing licensing cost with little return on additional investment. Year-over-year, IT budgets get increased to match licensing cost while the value add remains the same. Every organization is different, but there seem to be a few main causes for this:

First, a new organization that is adding more assets may not factor in the cost added when 1000 new machines begin sending a heavy log volume to your indexers. Sure, the hardware that’s ingesting the logs might get upgraded to keep up – but has the cost of the ingestion at the license level been considered?

Second, a growing organization may not have the resources to monitor what’s being ingested. Requests for access get approved, requests for inputs get approved, but the backend of Splunk only gets tuned when a fire needs to be put out. Between incidents, those administering Splunk don’t have the time to keep an eye on what’s being sent to the indexers. In a short amount of time, the organization goes from ingesting 1 GB/Day to 5 GB/Day. Then that 5 GB/Day becomes 10 GB/Day and no one has an idea who, or what, is responsible.

Third, a stable organization might not have a process or procedure to ensure that each log type being ingested has a defined, value-based purpose. Server Administrators are told where to point the logs and get their alerts back, but are all of the sent log types being used? Application Developers know they’re getting logs in and can hunt down root of the issues they’re trying to fix, but how much noise is being ingested in the meantime? And all the while, Splunk Administrators have no policy or procedure-based backing to make sure their environment is a well-trimmed, lean searching machine.

This guide has tips that will help any of these organizations to get their log ingestion under control. In your organization, the process should be as follows:

  1. Assess what the largest sources of log ingestion and noise are.
  2. Meet with stakeholders to sort the “wheat” logs from the “chaff” logs.
  3. Establish rigorous processes to keep your indexes lean and valuable.

Assessing the situation

In order to get a good idea of what’s going on, the best first step is to analyze which source types are bringing in the highest average ingestion volume per day. I would recommend using a dashboard or scheduled report to keep an eye on this, as the process of analyzing and meeting with stakeholders may take some time. Your Monitoring Console will have high-level statistics, but in order to dig into source statistics you may want to build a new one and clone some panels from the “License Usage” dashboards to get you started.

You can find your License Usage in /$SplunkInstallLocation/var/log/license_usage.log. If operating in a The events may not be visible depending on your account’s permissions level, so for the purposes of this guide assume an Administrator-level role or higher is required. Below is a sample license usage event.

01-01-2020 12:12:12.240 -0400 INFO  LicenseUsage - type=Usage s="Fentron-Sample-License-Usage.log" st="License Usage" h="SampleHost" o="" idx="SampleIndex" i="9AAAA99-A99A-99A9-A999-9A9A9A9A9A9A" pool="Pool_Name" b=13734973 poolsz=524288000
Field NameWhat it is
typeLog Type within License Usage(All same here)
sSource file being indexed
stSourcetype being indexed
hHost sending indexed data
idxIndex receiving data
iHash of index receiving data
poolLicense pool receiving data(Usually QA vs. Prod)
bBytes being counted toward license
poolszLicense Pool Daily Quota(in Bytes)
Field names in log and what they mean

Splunk’s built in Data Models can provide some statistics using these events, but I prefer to run field-extractions to have more control over the resulting statistics. Below is a regex you can use with the above format to extract the fields required for this step:

.*type=(?<type>[\w]{5})\ss="(?<indexed_source_file>[\S]*)"\sst="(?<indexed_sourcetype>[\s\S]*)"\sh="(?<indexed_host>[\S]*)"\so=""\sidx="(?<licensed_idx>[\S]*)"\si="(?<licensed_idx_hash>[\S]*)"\spool="(?<license_pool>[\S]*)"\sb=(?<bytes>[0-9]*)\spoolsz=(?<pool_size>[0-9]*)
Extracted Field NameSample Value
typeUsage
indexed_source_fileFentron-Sample-License-Usage.log
indexed_sourcetypeLicense Usage
indexed_hostSampleHost
licensed_idxSampleIndex
licensed_idx_hash9AAAA99-A99A-99A9-A999-9A9A9A9A9A9A
license_poolPool_Name
bytes13734973
pool_size524288000
Field names in log and what they mean

Now that you have your relevant fields extracted, it’s time to run some functions. Using the Monitoring Console’s Search app, start your query with the following statement to return license usage events:

index=_internal source=/PATH/TO/license_usage.log 

From there, you can use eval and stats functions to return license information. For example, the following query calculates your license quota usage % by day, within a given time frame:

index=_internal source="/PATH/TO/license_usage.log" | eval date=date_month+" "+date_mday+", "+date_year | stats sum(bytes) as "TotalBytes" by date, pool_size | eval %QuotaUsed=round((TotalBytes/pool_size)*100, 2) | table date, %QuotaUsed

Use these as a starting point and modify to pull back your desired information. Once you have an idea of which source types are sending the heaviest log volume, you can find the largest sources of log noise by using Splunk’s built-in Patterns Tab. I’d advise starting with the largest setting, to find the most broad patterns across any given source type. Then, as you continue through Step 2, you can call back to this tab and aim for more granular patterns. This will not only help with finding the noisiest patterns, but will also help the individuals you’re meeting with to understand their own data better.

Some key points to note:

  • Average by day over a time frame that keeps pace with the organization’s development cycles. Using a small time frame(1 week or less) for an organization that pushes changes monthly or quarterly can lead to inaccurate or poorly targeted tuning.
  • Work from Index level, to source level, to sourcetype level. It’s best not to miss the forest for the trees.
  • Examine your Production and QA environments with the same rigor. What happens in QA might eventually be pushed to Production, so you can think of QA as an alarm for unusually high log volume.

Value and Noise

You’ve got all the data – now comes the fun part. You’ll want to start scheduling meetings as soon as you have the needed information, to make sure that the data you’re giving them doesn’t change with a new release, update, or change. If team members are too busy to meet quickly, don’t worry! The dashboards/reports you set up in Step 1 will provide you updated information and you can follow the steps at any time to get up-to-date information.

When meeting with the teams responsible for noisy logs, you’re looking to get a few pieces of information about the source logs:

  1. What purpose does the source log serve?
    • This will help to guide conversation and give you a better idea of what data the team is handling.
  2. What information are they looking to get out of the log file?
    • This will help you establish a shared goal and make the team more conscious of value-driven log analysis.
  3. What value do the noisiest log patterns provide, in relation to the information above?
    • If value is already provided, the log patterns should be left alone. Don’t try to fix what isn’t broken.
    • If value can be found, but the logs aren’t used – See if the team could benefit from alerts, dashboards, or reports being built around these logs.
    • If no value can be provided by the noisiest log patterns – Request that the team shift the noisy log patterns to a log file that isn’t being forwarded. Indexed logs with no value burden your search heads with unnecessary data to parse and burden your budget with unnecessary license cost.

Processes prevent creep

All of the above steps are great for a one-time review but in order to make a lasting impact on licensing cost, there need to be durable processes to ensure noise creep doesn’t set in after the first run through. These processes should be created with value, lean-ness, and rigidity in mind.

Work with IT stakeholders to implement an approval process with the following points in mind:

  • Purpose: Approval should require a defined, tangible metric or purpose for inputting these logs into Splunk. Throwing logs into your environment without any initial goal will lead to a large amount of noise creep.
  • Specificity: If possible, verify the requested input log isn’t a generic system log that includes a large amount of valueless noise. Having a specific log, for purposeful events only, will reduce the amount of noise input into Splunk.
  • Review: After a log is approved, there should be a regular audit to run back through approvals and verify the logs are still needed in Splunk. Often times, teams will continue sending logs long past their needed lifetime. Review and removal of these logs will help eliminate legacy noise from your environment.

My hope is that at least one of these tips help you and your organization save money and resources while speeding up your Splunk environment. May all your future Splunk searches be fast and relevant.

Thanks for reading!

If you’re curious about what Fentron can do for your organization, contact sales@fentron.com and we’ll get back to you as soon as possible.