Home > Cisco, Splunk > Splunk – Cisco Logging Nirvana

Splunk – Cisco Logging Nirvana

Wow, was it really November 2009 when I last wrote one of these…

I though it was about time that I put pen to paper about the worlds most essential IT tool, namely Splunk.

For those out there who haven’t come across this yet, then you absolutely have to check it out. People start out viewing Splunk as a syslog server, but it’s so much more. Hopefully over the next few posts I’ll be able to give some more detail on how to get the most out of the tool.

But for now, let’s look at it from a network perspective.

I’m currently doing work for a large sporting venue in the UK. They have a substantial network deployment, based predominantly on Cisco switching, so this is where we’ll start. Believe me, once you get into the realms of managing 6,000 switch ports across 42 rack locations, you need something to help you manage that deployment.

I’m not going to give you a guide on how to install Splunk, there’s plenty of info on their site regarding this. Just 2 initial points:

- Install it on Linux
- Install it on a 64-bit architecture

Both of these are key to getting the most performance out of the tool (IMHO).

Cisco switches record a wealth of information in their logs, so to start with, you need to get this heading in the direction of your Splunk server. This is an example of the config to use:

logging trap [level]
logging [ip of Splunk Server]

So something like:

logging trap 6
logging 10.1.1.1

The logging ‘level’ controls what type of messages you’ll receive. There are 7 different levels:

0 emergency
1 alerts
2 critical
3 errors
4 warnings
5 notifications
6 informational
7 debugging

The way this works is that you get log entires at the level you specify and below. So if you go for level 3, errors, you’ll also get critical, alerts and emergency. I’d suggest going for level 6, as Splunk will crunch as much information for you as you need, whilst you probably don’t want debug messages cluttering up the place.

So what sort of information will you get? This will depend entirely on what services your switches are providing. Have a look at this Message Guide for a Cisco 3750 and you’ll get the idea.

In the environment that I’m working at, we’re running a number of Cisco features, so we want to know about:

DHCP Snooping Events
Error Disable Events
Port Security Events
HSRP Events
Switch Stack Events (These are really important…)
Power Inline Errors

You may not need all of these, but let me run through a few to try and illustrate why you might…

DHCP Snooping

This is a feature that stops unauthorised DHCP Servers from operating on your network. Whilst the feature stops rouge DHCP servers from sending packets, I still want to know that it’s happening and where, so that I can address the issue.

These are the sorts of messages that you’ll see:

%DHCP_SNOOPING-5-DHCP_SNOOPING_UNTRUSTED_PORT: DHCP_SNOOPING drop message on untrusted port, message type: DHCPNAK, MAC sa: 001b.2fee.3f4c

Error Disable Events

These relate to a switchport shutting down after an unwanted ‘network event’. So a good example is a switch being connected to a port that’s configured for a single device. This is especially important if you’re bypassing the standard Spanningtree states by using the Cisco ‘Portfast’ feature.

These is what you might see:

%PM-4-ERR_DISABLE: bpduguard error detected on Fa4/0/8, putting Fa4/0/8 in err-disable state (******)
%PM-4-ERR_DISABLE: link-flap error detected on Fa1/0/29, putting Fa1/0/29 in err-disable state

Port Security Events

The Port Security feature allows you to limit MAC addresses and volumes of MAC addresses seen on a switchport. We use this as another way to ensure that users aren’t connecting multiple devices through a hub. Partially for security, but also to identify area of our campus where we may need to provide additional outlets and capacity. You can also use it to stop people unplugging devices and replacing them with their own.

You would see events like these:

%PORT_SECURITY-2-PSECURE_VIOLATION: Security violation occurred, caused by MAC address 000f.b077.0086 on port FastEthernet1/0/34

HSRP Events

If you’re running in a resilient environment, you’ll probably want to know if one of your gateways changes state. Even on internal VLANs, HSRP events are a key indication that badness is happening. We’ve also seen issues where 3rd parties have configured static IPs that conflict with our gateways.

This is what you might see:

%HSRP-5-STATECHANGE: Vlan123 Grp 1 state Standby -> Active
%HSRP-4-BADAUTH: Bad authentication from 10.1.1.1, group 1, remote state Standby

Switch Stack Events

Now these are vital… If you’re running the type of switching that ‘stacks’, it will most likely have a single management IP. If a single switch inside the stack fails, reboots, etc, the management IP remains active as it migrates to another switch. Great for availability; really bad for monitoring software that uses IP as the test mechanism! So you must, must, must capture these events if you use ‘stacked’ switching.

This is the sort of thing:

%STACKMGR-4-SWITCH_REMOVED: Switch 3 has been REMOVED from the stack
%STACKMGR-4-STACK_LINK_CHANGE: Stack Port 2 Switch 1 has changed to state DOWN
%STACKMGR-4-STACK_LINK_CHANGE: Stack Port 2 Switch 1 has changed to state UP

Power Inline Errors

As PoE is becoming more prevalent, we needed a way to ensure that PoE devices are only connected to switchports that we’re expecting. This is especially true if you’re using 48-port switches that only have partial PoE support – i.e. they won’t support the full 15.4w on all 48 ports.

We’re looking out for these things:

%ILPOWER-5-INVALID_IEEE_CLASS: Interface Fa1/0/3: has detected invalid IEEE class: 7 device. Power denied

So. Now we’ve got a list of things to look out for. If you’ve got your devices logging to your Splunk server then you’re already halfway there.

The next step that will help you when it comes to querying is to start classifying your events into different ‘sourcetypes’. Cisco messages give us a nice standardised format to wok with, which really helps with this task.

You’ll notice from the example above that the messages are in the this format:

%[Type]-[Syslog Level]-[Message]: [Message Information]

The important part here is, that for 99% of the messages this format is consistent, which lends itself very well to regular expression (regex) extraction. For me, breaking events into their ‘Functional’ type is the most important aspect. So grouping together all the Inline Power events (ILPOWER) or all the Stack Manager events (STACKMGR).

In Splunk, the way to do this is to edit your ‘props.conf’ and ‘transforms.conf’ files. These are in the Splunk installation directory under:

/splunk/etc/system/local/

(If they don’t exist, then simply create them)

In the ‘props.conf’ file create an entry like this:

[source::udp:514]
TRANSFORMS-sourcetype-cisco = sourcetype-cisco

Now, there may already be a

[source::udp:514]

entry, so just add the second line. This is basically telling Splunk to check the ‘transforms.conf’ file and we want to make changes to the data received as we get it.

In the ‘transforms.conf’ file create an entry like this:

[sourcetype-cisco]
DEST_KEY = MetaData:Sourcetype
REGEX = :\s%([^-]*)-[0-7]{1}-[^:]*:
FORMAT = sourcetype::syslog-cisco-$1

Here we’re extracting the first part of the Cisco syslog message, appending it to the name ‘cisco-syslog-’, and writing this as the ‘sourcetype’ key.

(This might not be the ‘best’ regular expression – but I’m learning too and it works for me!)

So for example, this message:

%PORT_SECURITY-2-PSECURE_VIOLATION: Security violation occurred, caused by MAC address 000f.b077.0086 on port FastEthernet1/0/34

Will get categorised as

sourcetype = cisco-syslog-PORT_SECURITY

Why is this important I hear you ask? Again, it’s just a first step in the process of crunching through your data. A good example is that I can now get a table of my Cisco events and very easily see what type they are.

Here’s a quick query:

sourcetype=cisco-syslog-* | stats count by sourcetype | sort -count

Or even:

sourcetype=cisco-syslog-* | timechart count by sourcetype

As you can see, I now have a table of my cisco events by volume. But by categorising them, I can visually see exactly what types are happening. All I’m really doing is making my life easier when it comes to querying my data. But isn’t that the point of a good IT tool; to make your life easier..?

Advertisement
Categories: Cisco, Splunk Tags: , , , ,
  1. Frank
    December 10, 2010 at 7:27 am | #1

    I stumbled upon this post, and I had to comment that it’s suboptimal. It’s not a search-time extraction. Splunk really discourages the use of index time extractions because they are ultimately inflexible and they will bloat you index.

    Sourcetype is for the type of log (ie cisco-syslog). Eventtype should be used if you want to decorate a log, perhaps with Common Information Model (CIM) tags. For your purposes, it’s really not necessary because you can pull the pieces you need out with field extractions.

    Hope this helps,

    -Frank

  2. December 13, 2010 at 7:32 am | #2

    Hi Frank,

    Thanks for the feedback.

    The fact that there’s numerous ways to accomplish the same task in Splunk, is one of my favourite aspects of the product.

    I’d be interested to read and learn more about way Splunk discourage the use of index time extract. Are there and document / links that you could point me towards?

    Thanks again,

    Graham.

  1. November 23, 2010 at 3:35 am | #1

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.