Splunk – Cisco Change Tracking

November 23, 2010 1 comment

One of the main challenges lots of companies have is Change Control and Change Tracking. This is especially true in a large Cisco deployment as you’ll most likely have more than one network admin capable of making changes.

I was looking at this on behalf of a client. They have a small permanent team, but also have to allow limited configuration access from 3rd parties during busy periods in the year.

Now, Cisco provides a very useful log message for whenever any configuration command is entered. This is:

%PARSER-5-CFGLOG_LOGGEDCMD

To enable these, you’ll need to enter the following into your device configuration (this is based on IOS 12 on a Catalyst 3750):

archive
 log config
  logging enable
  logging size 200
  notify syslog contenttype plaintext
  hidekeys

Checkout the full Cisco article on ‘Configuration Change Notification and Logging‘ for more detail.

Another command that you’ll need to add is this:

service sequence-numbers

This has the effect of adding a sequence number at the start of every log entry. (We’ll touch on why this is important later on.)

Now, to test that this is all working, enter configuration mode on your cisco device and try a few commands. Exit out of this mode and do a ‘show log’. You should get entries like this:

000001: Nov 01 13:34:18.035 BST: %PARSER-5-CFGLOG_LOGGEDCMD: User:gmor logged command:interface FastEthernet1/0/18
000002: Nov 01 13:34:28.218 BST: %PARSER-5-CFGLOG_LOGGEDCMD: User:gmor logged command:description [WebCam]
000003: Nov 01 13:34:37.983 BST: %PARSER-5-CFGLOG_LOGGEDCMD: User:gmor logged command:switchport access vlan 104
000004: Nov 01 13:35:01.933 BST: %PARSER-5-CFGLOG_LOGGEDCMD: User:gmor logged command:no shutdown
000005: Nov 01 13:35:05.534 BST: %PARSER-5-CFGLOG_LOGGEDCMD: User:gmor logged command:exit

You can also use this command to see the archive log: show archive log config all

idx sess user@line Logged command
283 21 gmor@vty0 |interface FastEthernet1/0/18
284 21 gmor@vty0 | description [WebCam]
285 21 gmor@vty0 | switchport access vlan 104
286 21 gmor@vty0 | no shutdown
287 21 gmor@vty0 | exit

So, if that’s working, then that’s half the battle won. This is great for single devices, but for a large estate, we need to be able to analyse results from multiple devices. This sounds like an excellent use for Splunk.

I’m going to make the assumption that you’ve got your Splunk instance up and running, with your Cisco devices logging to it. If not, have a quick read of this article first: Splunk – Cisco Logging Nirvana.

With all of my log messages going to Splunk, I can now do a quick search for all of my configuration syslog entries. Something like this should do the job and be specific enough:

%PARSER

(Remember in Splunk to always try to constrain you time range. There’s no point making Splunk search across ‘All time’ when you know you’re only looking for information that happened in the last 24 hours)

This should give me:

Aug 25 13:35:06 switchname.null.com 508: 000005: Aug 25 13:35:05.534 BST: %PARSER-5-CFGLOG_LOGGEDCMD: User:gmor logged command:exit
Aug 25 13:35:02 switchname.null.com 507: 000004: Aug 25 13:35:01.933 BST: %PARSER-5-CFGLOG_LOGGEDCMD: User:gmor logged command:no shutdown
Aug 25 13:34:38 switchname.null.com 506: 000003: Aug 25 13:34:37.983 BST: %PARSER-5-CFGLOG_LOGGEDCMD: User:gmor logged command:switchport access vlan 104
Aug 25 13:34:28 switchname.null.com 505: 000002: Aug 25 13:34:28.218 BST: %PARSER-5-CFGLOG_LOGGEDCMD: User:gmor logged command:description [WebCam]
Aug 25 13:34:18 switchname.null.com 504: 000001: Aug 25 13:34:18.035 BST: %PARSER-5-CFGLOG_LOGGEDCMD: User:gmor logged command:interface FastEthernet1/0/18

The first thing which should be obvious it that the sequence is reversed. I.e. I tend to read from top to bottom, so it’s confusing to see that the first command I entered was ‘exit’ followed by ‘no shutdown’. No problem, this is an easy fix using the Splunk search language:

%PARSER | reverse

Now we should get this:

Aug 25 13:34:18 switchname.null.com 504: 000001: Aug 25 13:34:18.035 BST: %PARSER-5-CFGLOG_LOGGEDCMD: User:gmor logged command:interface FastEthernet1/0/18
Aug 25 13:34:28 switchname.null.com 505: 000002: Aug 25 13:34:28.218 BST: %PARSER-5-CFGLOG_LOGGEDCMD: User:gmor logged command:description [WebCam]
Aug 25 13:34:38 switchname.null.com 506: 000003: Aug 25 13:34:37.983 BST: %PARSER-5-CFGLOG_LOGGEDCMD: User:gmor logged command:switchport access vlan 104
Aug 25 13:35:02 switchname.null.com 507: 000004: Aug 25 13:35:01.933 BST: %PARSER-5-CFGLOG_LOGGEDCMD: User:gmor logged command:no shutdown
Aug 25 13:35:06 switchname.null.com 508: 000005: Aug 25 13:35:05.534 BST: %PARSER-5-CFGLOG_LOGGEDCMD: User:gmor logged command:exit

This is better, but it’s still difficult to read exactly what’s been entered and by whom. So the next step is to extract some of the key information into fields. For this, we’ll use the Splunk ‘rex’ command. This is the Regular Expression extractor. The key information that I want from each message is:

User
Command
Sequence #

To get each of these fields, I’ll use the following ‘rex’ extractions:

User = rex “User:(?<user>[\S]*)”
Command = rex “command:(?<command>[^\$]*)”
Sequence # = rex “\s[\d]*:\s(?<sequence>[\d]*)”

Combining this into a search string, I would now use:

%PARSER | rex “User:(?<user>[\S]*)” | rex “command:(?<command>[^\$]*)” | rex “\s[\d]*:\s(?<sequence>[\d]*)” | reverse

Now on the left-side of your results, you should see some new fields:

Whilst this search looks complicated, now that we know the field extractions work, we can actually move these expressions into Splunk’s ‘props.conf’ file.

If you’re categorising your Cisco syslog messages into sourcetypes (as described here), then you could created the following in your /splunk/etc/system/local/props.conf file:

[syslog-cisco-PARSER]
EXTRACT-user = User:(?<user>[\S]*)
EXTRACT-command = command:(?<command>[^\$]*)
EXTRACT-sequence = \s[\d]*:\s(?<sequence>[\d]*)

After doing this, restart Splunk to ensure that the new configuration is re-read.

Now we can go back to using our simple search expression:

%PARSER | reverse

And we should still be getting the ‘user, command and sequence’ fields on the left-hand side of the results.

So what now? Well, it’s entirely up to you. There are a number of reports that I think are very useful.
 

1. Table of Configuration Commands by Host

%PARSER | sort -host, +sequence | table _time host user sequence command

Here we’re sorting first by the ‘host’ field, then by ‘sequence’. The reason for using sequence numbers is that commands can get jumbled if you just rely on the ‘_time’ field. We found that when Administrators were pasting configs into devices, the 1 second time stamping resolution in Splunk wasn’t enough.

This search should generate a table similar to this:


 

2. Timechart of Configuration Commands by Host

%PARSER | timechart span=1h limit=20 count by host

This is a great visual cue to put onto a dashboard, to see which devices are being changed the most:

(If the host names look slightly strange, it’s because they’ve been ‘scrubbed‘ to protect the innocent…)
 

3. Timechart of Configuration Commands by User

%PARSER | timechart span=1h limit=20 count by user

Again, another very simple way of seeing who is making changes. This one’s great for spotting people who shouldn’t be making changes as their names clearly stand out:


 

For the client in question, this approach gave them a whole new level of visibility into the who, what and when associated with running the network.
 

PS

For the adventurous amongst you, if you’re running ASA Firewalls, you could do a similar things here too. The messages that you’re looking for is:

%ASA-5-111008

So a search similar to this should give you what you need:

%ASA-5-111008 | rex “User\s'(?<user>[^’]+)” | rex “executed\sthe\s'(?<command>[^’]+)” | sort +_time | table _time host user command

(You’ll also notice that there’s no sequence number that you can use to order the results, so you may get a couple of commands out of sequence if they have the same timestamp)

Enjoy!

Splunk – Cisco Logging Nirvana

November 17, 2010 4 comments

Wow, was it really November 2009 when I last wrote one of these…

I though it was about time that I put pen to paper about the worlds most essential IT tool, namely Splunk.

For those out there who haven’t come across this yet, then you absolutely have to check it out. People start out viewing Splunk as a syslog server, but it’s so much more. Hopefully over the next few posts I’ll be able to give some more detail on how to get the most out of the tool.

But for now, let’s look at it from a network perspective.

I’m currently doing work for a large sporting venue in the UK. They have a substantial network deployment, based predominantly on Cisco switching, so this is where we’ll start. Believe me, once you get into the realms of managing 6,000 switch ports across 42 rack locations, you need something to help you manage that deployment.

I’m not going to give you a guide on how to install Splunk, there’s plenty of info on their site regarding this. Just 2 initial points:

– Install it on Linux
– Install it on a 64-bit architecture

Both of these are key to getting the most performance out of the tool (IMHO).

Cisco switches record a wealth of information in their logs, so to start with, you need to get this heading in the direction of your Splunk server. This is an example of the config to use:

logging trap [level]
logging [ip of Splunk Server]

So something like:

logging trap 6
logging 10.1.1.1

The logging ‘level’ controls what type of messages you’ll receive. There are 7 different levels:

0 emergency
1 alerts
2 critical
3 errors
4 warnings
5 notifications
6 informational
7 debugging

The way this works is that you get log entires at the level you specify and below. So if you go for level 3, errors, you’ll also get critical, alerts and emergency. I’d suggest going for level 6, as Splunk will crunch as much information for you as you need, whilst you probably don’t want debug messages cluttering up the place.

So what sort of information will you get? This will depend entirely on what services your switches are providing. Have a look at this Message Guide for a Cisco 3750 and you’ll get the idea.

In the environment that I’m working at, we’re running a number of Cisco features, so we want to know about:

DHCP Snooping Events
Error Disable Events
Port Security Events
HSRP Events
Switch Stack Events (These are really important…)
Power Inline Errors

You may not need all of these, but let me run through a few to try and illustrate why you might…

DHCP Snooping

This is a feature that stops unauthorised DHCP Servers from operating on your network. Whilst the feature stops rouge DHCP servers from sending packets, I still want to know that it’s happening and where, so that I can address the issue.

These are the sorts of messages that you’ll see:

%DHCP_SNOOPING-5-DHCP_SNOOPING_UNTRUSTED_PORT: DHCP_SNOOPING drop message on untrusted port, message type: DHCPNAK, MAC sa: 001b.2fee.3f4c

Error Disable Events

These relate to a switchport shutting down after an unwanted ‘network event’. So a good example is a switch being connected to a port that’s configured for a single device. This is especially important if you’re bypassing the standard Spanningtree states by using the Cisco ‘Portfast’ feature.

These is what you might see:

%PM-4-ERR_DISABLE: bpduguard error detected on Fa4/0/8, putting Fa4/0/8 in err-disable state (******)
%PM-4-ERR_DISABLE: link-flap error detected on Fa1/0/29, putting Fa1/0/29 in err-disable state

Port Security Events

The Port Security feature allows you to limit MAC addresses and volumes of MAC addresses seen on a switchport. We use this as another way to ensure that users aren’t connecting multiple devices through a hub. Partially for security, but also to identify area of our campus where we may need to provide additional outlets and capacity. You can also use it to stop people unplugging devices and replacing them with their own.

You would see events like these:

%PORT_SECURITY-2-PSECURE_VIOLATION: Security violation occurred, caused by MAC address 000f.b077.0086 on port FastEthernet1/0/34

HSRP Events

If you’re running in a resilient environment, you’ll probably want to know if one of your gateways changes state. Even on internal VLANs, HSRP events are a key indication that badness is happening. We’ve also seen issues where 3rd parties have configured static IPs that conflict with our gateways.

This is what you might see:

%HSRP-5-STATECHANGE: Vlan123 Grp 1 state Standby -> Active
%HSRP-4-BADAUTH: Bad authentication from 10.1.1.1, group 1, remote state Standby

Switch Stack Events

Now these are vital… If you’re running the type of switching that ‘stacks’, it will most likely have a single management IP. If a single switch inside the stack fails, reboots, etc, the management IP remains active as it migrates to another switch. Great for availability; really bad for monitoring software that uses IP as the test mechanism! So you must, must, must capture these events if you use ‘stacked’ switching.

This is the sort of thing:

%STACKMGR-4-SWITCH_REMOVED: Switch 3 has been REMOVED from the stack
%STACKMGR-4-STACK_LINK_CHANGE: Stack Port 2 Switch 1 has changed to state DOWN
%STACKMGR-4-STACK_LINK_CHANGE: Stack Port 2 Switch 1 has changed to state UP

Power Inline Errors

As PoE is becoming more prevalent, we needed a way to ensure that PoE devices are only connected to switchports that we’re expecting. This is especially true if you’re using 48-port switches that only have partial PoE support – i.e. they won’t support the full 15.4w on all 48 ports.

We’re looking out for these things:

%ILPOWER-5-INVALID_IEEE_CLASS: Interface Fa1/0/3: has detected invalid IEEE class: 7 device. Power denied

So. Now we’ve got a list of things to look out for. If you’ve got your devices logging to your Splunk server then you’re already halfway there.

The next step that will help you when it comes to querying is to start classifying your events into different ‘sourcetypes’. Cisco messages give us a nice standardised format to wok with, which really helps with this task.

You’ll notice from the example above that the messages are in the this format:

%[Type]-[Syslog Level]-[Message]: [Message Information]

The important part here is, that for 99% of the messages this format is consistent, which lends itself very well to regular expression (regex) extraction. For me, breaking events into their ‘Functional’ type is the most important aspect. So grouping together all the Inline Power events (ILPOWER) or all the Stack Manager events (STACKMGR).

In Splunk, the way to do this is to edit your ‘props.conf’ and ‘transforms.conf’ files. These are in the Splunk installation directory under:

/splunk/etc/system/local/

(If they don’t exist, then simply create them)

In the ‘props.conf’ file create an entry like this:

[source::udp:514]
TRANSFORMS-sourcetype-cisco = sourcetype-cisco

Now, there may already be a

[source::udp:514]

entry, so just add the second line. This is basically telling Splunk to check the ‘transforms.conf’ file and we want to make changes to the data received as we get it.

In the ‘transforms.conf’ file create an entry like this:

[sourcetype-cisco]
DEST_KEY = MetaData:Sourcetype
REGEX = :\s%([^-]*)-[0-7]{1}-[^:]*:
FORMAT = sourcetype::syslog-cisco-$1

Here we’re extracting the first part of the Cisco syslog message, appending it to the name ‘cisco-syslog-‘, and writing this as the ‘sourcetype’ key.

(This might not be the ‘best’ regular expression – but I’m learning too and it works for me!)

So for example, this message:

%PORT_SECURITY-2-PSECURE_VIOLATION: Security violation occurred, caused by MAC address 000f.b077.0086 on port FastEthernet1/0/34

Will get categorised as

sourcetype = cisco-syslog-PORT_SECURITY

Why is this important I hear you ask? Again, it’s just a first step in the process of crunching through your data. A good example is that I can now get a table of my Cisco events and very easily see what type they are.

Here’s a quick query:

sourcetype=cisco-syslog-* | stats count by sourcetype | sort -count

Or even:

sourcetype=cisco-syslog-* | timechart count by sourcetype

As you can see, I now have a table of my cisco events by volume. But by categorising them, I can visually see exactly what types are happening. All I’m really doing is making my life easier when it comes to querying my data. But isn’t that the point of a good IT tool; to make your life easier..?

Categories: Cisco, Splunk Tags: , , , ,

vSwitch Load Balancing – Stop Getting it Wrong!

November 3, 2009 Leave a comment

OK, so this is one of my pet-hates.

The subject of Network Load Balancing comes up all the time when I’m visiting clients and it’s always a cause of some pain.

James’ article on Spotting The Red Flags made me think about what triggers the alarm bells ringing in the network world.

For me, if often starts with the Switch configuration, when I see something like this:

interface GigabitEthernet0/1
description ESX Server NIC0
switchport mode access
channel-group 1 mode on

Some of you might be thinking, well, there’s nothing wrong with that. In fact you’d be right. “Technically” there’s nothing wrong with that IF the vSwitch on the ESX Server is using the right load-balancing method.

BUT, it’s a red flag, because 9 times out of 10, the vSwitch configuration is using the default ‘Route based on originating port ID’ method.

And listen closely folks, as I’m not going to say it again, this method DOES NOT REQUIRE SWTICH CONFIGURATION. (Sorry for shouting, but it really, really bugs me).

There is only one ESX vSwitch load-balancing method which requires switch-side configuration and that’s ‘Route based on IP hash’.

From page 7 of the ESX Configuration Guide:

“NOTE: IP-based teaming requires that the physical switch be configured with etherchannel. For all other options, etherchannel should be disabled.”

I get even more scared when I see this:

interface GigabitEthernet0/1
description ESX Server NIC0
switchport mode access
channel-group 1 mode [auto|desirable|active|passive]

This is because any of those channel-group modes imply the use of PAgP or LACP. Let’s put an end to this debate once and for all – “VMware ESX Server does not support dynamic PAgP or LCAP.”

These dynamic link aggregation protocol are absolutely not supported on ESX, so don’t use them on the switch-side!

If you’re an ESX Admin or a Network Admin, please get together and make sure that you’re both on the same page. Time and time again, we find that there’s a disconnect between what each team thinks the other is doing.

I would strongly encourage everyone to read Ken Clien’s series of Blogs called ‘The Great vSwitch Debate‘. Especially if you’re new to the world of VMware and ESX. Part 3 focuses on the different Load Balancing methods, so check it out.

For the Network folks, this is a great reference:

VMware Infrastructure 3 in a Cisco Network Environment

For the ESX folks, these are also handy:

ESX Server host requirements for link aggregation

Sample configuration of EtherChannel / Link aggregation with ESX 3.x and Cisco/HP switches

Have a read, build some understanding and stop getting it wrong! (Please…)

Categories: VMware Tags: , ,

It’s the Bloody Defaults

October 16, 2009 2 comments

If you’re a DBA having problems with Oracle Data Guard dropping connections for no apparent reason, the first point of call is usually the Network Team.

Especially if you’re seeing errors like this:

“RFS[7]: Possible network disconnect with primary database
Dataguard log shipping is failing”

The Network Team will hopefully do all the investigations they can. Checking:

  • ICMP ping repsonses
  • ICMP trace routes
  • Routing
  • Interface speed/duplex mismatches on the hosts, switches, routers, firewall

They’ll then hopefully start looking at tcpdumps or packet captures from switches.

Then they’ll scratch their heads and go, “I can’t see anything wrong…”

Both teams will then start hypothesising on more convoluted possibilities. Is it an MTU issues? Is the traffic going over the Internet / VPN and it’s ‘just the way it is’?

…No

 …We’re all guilty.

  …We should all be fired.

The first thing to ask yourself is this, “Does the traffic pass through a Firewall?”

Then ask, “Is it a Cisco Firewall?”

If the answer is yes to both, then the place to start looking is Application Inspection. By default, both the Cisco PIX and Cisco ASA have SQL*Net Application Inspection turned on. This is great for SQL*Net traffic, but the problem is that Oracle Data Guard uses the same TCP Port – 1521

So that means that Data Guard traffic is also subject to the same Inspection and time and time again, this causes issues.

Look for this in your PIX 6.x configs:

fixup protocol sqlnet 1521

Or this in your PIX/ASA 7.x or 8.x configs:

policy-map global_policy
 class inspection_default
  inspect sqlnet

If you see either of these, the raise the red flag.

Now, you could disable the Inspection, but this is usually a global change. So the simple answer:

Always run Oracle Data Guard on a different TCP port

Life should then be good again.

Sometimes I really hate the Cisco defaults…

Categories: Cisco Tags: , , ,

ASA Quick’n’Dirty Web Filtering

October 7, 2009 Leave a comment

A subject that comes up again and again with some of our smaller clients is that of Web Filtering.

Whilst there are a whole host of solutions out there, the client’s requirements are often very straightforward:

“I don’t want my staff wasting time on the Internet!”

I’m sure we could all spend hours (or days) debating the pros and cons of allowing unrestricted Internet access. But, let’s be honest, we’ve all spent some of our work hours browsing when we should have been working (sorry James!).

Whilst researching the options, I came across this excellent article by René Jorissen:

Cisco ASA: DNS Reply Filtering

I love this solution.

It tackles the problem at the right point in the stack; the beginning, with the initial DNS request.

If all you’re looking to do is stop users (and it would be all users) from accessing specific sites or services, then this could be the solution for you. By dropping DNS replies for the specific sites, you knee-cap the connection at the start.

Yes, it’s not perfect.

There’s no flexibility in relation to specific users, groups, machines or times of day. But that’s not the point. This is a simple solution to meet a simple requirement.

So when a client comes to you and says, “I want to block that bloody [insert current social networking fad of the month] site,” you can now make it happen.

No cost, no fuss.

I like these kind of solutions…

Categories: Cisco Tags: ,

VMware Fusion 2.0.6 Released

October 2, 2009 Leave a comment

Just a quick one.

VMware have released the latest update for Fusion:

This is free to existing licensed users.

You can find the Release Notes here: http://www.vmware.com/support/fusion2/doc/releasenotes_fusion_206.html

  • Fixes multiple issues when running VMware Fusion 2.0.x on Mac OS X Snow Leopard (32-bit kernel mode)
  • Provides improved 3D performance on Macs with NVIDIA graphics cards running Mac OS X 10.6
  • Contains fixes for more than 20 bugs

So, if you’re a OS X Snow Leopard user (like myself) check it out!

Categories: VMware Tags: , , ,

State of the Nation

October 2, 2009 2 comments

For me, one of the important aspects of managing a VMWare environment, is to quickly be able to ‘see’ what’s happening and react accordingly.

Specifically, I want to know what my Guest VMs are up to. Luckily ESX provides with this information in the form of ‘State Transition’ events.

In ESX 4 these are logged to the hostd log (/var/log/vmware/hostd.log). They look a bit like this:

[2009-09-28 11:22:02.048 F64D2B90 info ‘vm:/vmfs/volumes/4ac2ea2f-95f7957e/LNX-TEST03/LNX-TEST03.vmx’] State Transition (VM_STATE_ON -> VM_STATE_OFF)
[2009-09-28 11:22:02.340 F64D2B90 info ‘vm:/vmfs/volumes/4ac2ea2f-95f7957e/LNX-TEST03/LNX-TEST03.vmx’] State Transition (VM_STATE_OFF -> VM_STATE_RECONFIGURING)
[2009-09-28 11:22:02.554 F64D2B90 info ‘vm:/vmfs/volumes/4ac2ea2f-95f7957e/LNX-TEST03/LNX-TEST03.vmx’] State Transition (VM_STATE_RECONFIGURING -> VM_STATE_OFF)
[2009-09-28 11:22:07.670 F5A6DB90 info ‘vm:/vmfs/volumes/4ac2ea2f-95f7957e/LNX-TEST03/LNX-TEST03.vmx’] State Transition (VM_STATE_OFF -> VM_STATE_RECONFIGURING)
[2009-09-28 11:22:07.954 F5A6DB90 info ‘vm:/vmfs/volumes/4ac2ea2f-95f7957e/LNX-TEST03/LNX-TEST03.vmx’] State Transition (VM_STATE_RECONFIGURING -> VM_STATE_OFF)

This is a ‘power off’ sequence for shutting down a VM Guest.

Or more fun things like:

[2009-09-28 11:18:26.399 F6450B90 info ‘vm:/vmfs/volumes/4ac2ea2f-95f7957e/VM-CACTI01/VM-CACTI01.vmx’] State Transition (VM_STATE_ON -> VM_STATE_EMIGRATING)
[2009-09-28 11:18:52.732 F634CB90 info ‘vm:/vmfs/volumes/4ac2ea2f-95f7957e/VM-CACTI01/VM-CACTI01.vmx’] State Transition (VM_STATE_EMIGRATING -> VM_STATE_OFF)
[2009-09-28 11:18:52.744 F634CB90 info ‘vm:/vmfs/volumes/4ac2ea2f-95f7957e/VM-CACTI01/VM-CACTI01.vmx’] State Transition (VM_STATE_OFF -> VM_STATE_UNREGISTERING)
[2009-09-28 11:18:53.034 F634CB90 info ‘vm:/vmfs/volumes/4ac2ea2f-95f7957e/VM-CACTI01/VM-CACTI01.vmx’] State Transition (VM_STATE_UNREGISTERING -> VM_STATE_GONE)

Which was a VMotion event from the perspective of the ESX server that was the original host. (I love the terminology by the way – ‘EMIGRATING’ / ‘GONE’).

Now, connecting to your ESX Servers and searching for these specific events would be a bit tedious. So one approach is to send them all to a remote Syslog server.

There are already a number of articles on the web on how to set-up remote sysloging for ESX, so I wont re-invent the wheel. Here’s what worked for me:

  • SSH to your ESX Server
  • Edit the end of the /etc/syslog.conf file to include the line
    • *.*      @[Your remote syslog server name / IP]
  • Restart the Syslog service using the command:
    • /etc/init.d syslog restart
  • Allow outbound Syslog through the ESX Firewall
    • esxcfg-firewall -o 514,udp,out,syslog
  • Restart the ESX Firewall
    • esxcfg-firewall -l

This should be enough to get the basic syslog information going where you want it to. But, it wont give you information from the /var/log/vmware/hostd.log

To get this, I did the following:

  • SSH to your ESX Server
  • Edit the file /etc/vmware/hostd/config.xml to look like this:

    <log>
    <directory>/var/log/vmware/</directory>
    <name>hostd</name>
    <outputToConsole>true</outputToConsole>
    <level>info</level>
    </log>

  • Restart the ESX Management Agents with this command:
    • /etc/init.d/mgmt-vmware restart

Pay close attention to the ‘level’ tag. By default it reads ‘verbose’. I tried this first and managed to generate about 4000 messages a minute. This was a bit more than I needed…

So what next?

Well, that’s really up to you. Now that you can capture these events, use your favourite Syslog server to either alert or report on whatever you’re interested in.

This sort of information is vital in environments where there are multiple teams administering the VMWare infrastructure. It’s not always easy to know what your colleagues are working on, so capturing this information can provide a quick oversight so that everyone is kept in the picture.

For example, once you’ve got the right Transition event, you can map these to a timeline, to very quickly see what’s happing to your VMs:

VM Power Off / On Timeline

Or summarise them into a table:

VM Power Off / On Table Summary

Or make the names on the table a little more readable:

VM Power On / Off Summary Table Tweaked

Once you’ve captured the information, it’s really up to you!

And if your Syslog server doesn’t let you produce reports like the ones above, then you need to start Splunking

G.
(More information on how to use the exceptional Splunk Server in the next few weeks)

Also, check out these links for a bit more depth on how to configure the Syslog component in VMWare:

Categories: VMware Tags: , , , ,
Follow

Get every new post delivered to your Inbox.