Dragon Quest 64: Software System Forensics

While a good deal of my career has revolved around green-field development projects, as of recent I've been more heavily involved in existing systems. Since then, I've acquired a new set of skills that help in acquiring knowledge about how the system is operating. As a recent contractor, I've found myself frequently given assignments of the form "we need to understand this component of the system....go research that". More often than not, all subject-matter experts are long gone so you're primarily on your own. Luckily, there are a series of *nix tools that can assist in this discovery process. Let's hit on a few;

While source code is the life-blood of software systems, outside the developer community most team members are more familiar with system components and likely couldn't point you to the source code. Could be a shell script, Perl, Python, Java,.....and while many managers/team-leads likely couldn't aim you toward the source tree of any given system component, they likely can refer you to a process name (or snippet) and that alone can get you rolling on your investigation.

Let's say your big-bossman points you to a data processing feed called spooler1 that is currently running on the system;

Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie

%Cpu(s): 5.7 us, 2.0 sy, 0.5 ni, 91.4 id, 0.4 wa, 0.0 hi, 0.0 si, 0.0 st

KiB Mem : 8167120 total, 781680 free, 1680440 used, 5705000 buff/cache

KiB Swap: 8385532 total, 8371072 free, 14460 used. 5928716 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

29905 user 20 0 14008 4400 2708 R 20.0 0.1 0:00.22 spooler1

Not a lot to go on, but you may be surprised what you can gleen from a running process.

lsof -- list open files

Let's start with the lsof utility in our investigational journey. This utility does precisely what the name says....it lists all open files that the process has running.

$ lsof -p 29905

lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing

Output information may be incomplete.

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME

logMaker 12600 user cwd DIR 8,33 4096 70975491 /home/user/blog/SystemForensics

logMaker 12600 user rtd DIR 8,1 4096 2 /

logMaker 12600 user txt REG 8,1 1037528 41418771 /bin/bash

logMaker 12600 user mem REG 8,1 2981280 20186616 /usr/lib/locale/locale-archive

logMaker 12600 user mem REG 8,1 1868984 45355547 /lib/x86_64-linux-gnu/libc-2.23.so

logMaker 12600 user mem REG 8,1 14608 45355429 /lib/x86_64-linux-gnu/libdl-2.23.so

logMaker 12600 user mem REG 8,1 167240 45355621 /lib/x86_64-linux-gnu/libtinfo.so.5.9

logMaker 12600 user mem REG 8,1 162632 45355450 /lib/x86_64-linux-gnu/ld-2.23.so

logMaker 12600 user mem REG 8,1 26258 20449837 /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache

logMaker 12600 user 0u CHR 136,20 0t0 23 /dev/pts/20

logMaker 12600 user 1w REG 8,33 46450 53624834 /home/user/blog/SystemForensics/spooler.log

logMaker 12600 user 2u CHR 136,20 0t0 23 /dev/pts/20

logMaker 12600 user 3r FIFO 0,12 0t0 76487063 pipe

logMaker 12600 user 255r REG 8,33 857 53624833 /home/user/blog/SystemForensics/spooler1

There is a host of information available from this command, but for now, let's focus on two key things (highlighted in blue); the current working directory and the log file. While not relevant to our above example, the lsof utility will provide evidence of relevant libraries and/or network ports. For example, if you observe open network ports and the process linking in Corba libraries....you can get a gauge of what you're in for. A good starting point when trying to understand a system element is understanding the 'goes-intas, and the goes-outtas', so library-dependencies input and output files can set the stage for investigation tasks.

In our example, we now know where this utility is run from and that it generates a log file.

While not every process generates a log file, if a log file exists you've likely stumbled upon a plethora of information. No amount of source code browsing will ever compare to spending time with a quality log file in gaining knowledge of what's going on. Likely you're looking into a system component because it isn't operating as expected, and hopefully the log file will provide evidence of that and give you a hint at where you should focus your next steps.

Simple Log File Analysis

If you're fortunate enough to have access to sophisticated log file analysis tools by all means....use them. For many of us we're stuck with the general system utilities and in many cases you can go far with these simple utilities. I've applied these techniques on some pretty sophisticated log files on existing systems but many companies would consider the logs as proprietary, so we'll use a sample log file format for our purposes.

If we peek at the first few lines of our log file we can get a feel for what we can do with it;

$ head spooler.log

2020-06-10 15:04:02.85|processing Table-E record

2020-06-10 15:04:04.45|processing Table-C record

2020-06-10 15:04:05.15|processing Table-D record

2020-06-10 15:04:05.75|processing Table-C record

2020-06-10 15:04:06.96|processing Table-D record

2020-06-10 15:04:07.66|processing Table-E record

2020-06-10 15:04:08.96|processing Table-D record

2020-06-10 15:04:09.86|Spooler::init() Connection established with peer pub.server.org port 9000

2020-06-10 15:04:11.06|processing Table-B record

2020-06-10 15:04:12.46|processing Table-E record

A well-planned log file will have a well-defined format, hopefully with a time-stamp and a series of events. In our example, we have 6 categories of events {connection requests, and 5-table record processing events}, but only a fraction are evident from the head of the log file. Let's see how we can get a comprehensive list of types of events.

With a focus on discovering a comprehensive list of event types we first see that the timestamp is ever-changing, so we need to ignore the timestamp and extract the rest of the line and see what we have.

$ cat spooler.log | cut -f 2- -d '|' | head -10

processing Table-E record

processing Table-C record

processing Table-D record

processing Table-C record

processing Table-D record

processing Table-E record

processing Table-D record

Spooler::init() Connection established with peer pub.server.org port 9000

processing Table-B record

processing Table-E record

Whelp, that shows promise. Let's sort uniquely the result and see if that gives us what we want;

$ cat spooler.log | cut -f 2- -d '|' | sort -u

processing Table-A record

processing Table-B record

processing Table-C record

processing Table-D record

processing Table-E record

Spooler::init() Connection established with peer pub.server.org port 9000

BAM! Looks good. Now what? The frequency of an event can provide you with some insight into how the system is behaving. For example, repeated 'Connection established...' instances may hint at an unreliable network connection or unstable companion process. Let's look at how many instances for each type of event. We can do this by sorting the events followed by counting the instances of each type. The sort+uniq utilities will do the trick;

$ cat spooler.log | cut -f 2- -d '|' | sort | uniq -c

1422 processing Table-A record

3593 processing Table-B record

1427 processing Table-C record

3572 processing Table-D record

3625 processing Table-E record

761 Spooler::init() Connection established with peer pub.server.org port 9000

From this we can see we have B,D,E records providing the highest volume of entries, a smaller number of A,C records and ~700 instances of connection retries.

While this provides a nice summary of the types of events coming in as well as the quantities, often time-based summaries are essential. Perhaps we're expecting burst-transfers in the off-hours and a smaller volume during the day. We can bring the timestamps back into the equation for a look.

Data analysis is a discovery process, you often find a need to repeatedly refine your search criteria from your starting point.

Let's breakdown a log entry so we can see how we can perform some time-based categorization;

2020-06-10 19:04:23.10|processing Table-B record

date -- char [1..10]

time -- char[12..22]

event -- char[24..]

The cut utility allows us to extract specific fields from each log line. Let's say we are interested in the number of events organized by date, we'd want to preserve the data + event type like this:

$ cat spooler.log | cut -b 12-14,24-

15:processing Table-E record

15:processing Table-C record

...

A bit ugly, but by extracting the hour+':'+event type, you've got the beginnings of what we want. I purposely left the HH:MM colon delimiter in place to separate the hour from the event type;

$ cat spooler.log | cut -b 12-14,24- | sort | uniq -c

326 15:processing Table-A record

847 15:processing Table-B record

340 15:processing Table-C record

831 15:processing Table-D record

829 15:processing Table-E record

180 15:Spooler::init() Connection established with peer pub.server.org port 9000

347 16:processing Table-A record

914 16:processing Table-B record

332 16:processing Table-C record

896 16:processing Table-D record

932 16:processing Table-E record

174 16:Spooler::init() Connection established with peer pub.server.org port 9000

377 17:processing Table-A record

895 17:processing Table-B record

350 17:processing Table-C record

876 17:processing Table-D record

893 17:processing Table-E record

202 17:Spooler::init() Connection established with peer pub.server.org port 9000

348 18:processing Table-A record

868 18:processing Table-B record

373 18:processing Table-C record

904 18:processing Table-D record

912 18:processing Table-E record

190 18:Spooler::init() Connection established with peer pub.server.org port 9000

24 19:processing Table-A record

69 19:processing Table-B record

32 19:processing Table-C record

65 19:processing Table-D record

59 19:processing Table-E record

15 19:Spooler::init() Connection established with peer pub.server.org port 9000

We can now see that how event counts change over the course of the hours. Bursty processing loads would be evident by spending some time with this raw data, or better yet, plotting quantities over time can provide you with real insight into your system. High volume of atypical events can give you a clue to problems; for example, say you observed a high volume of retries/restarts in the wee hours of the morning may imply controlled restarts or potential problems.

Let's prepare to power down on this post. With a handful of lesser-known utilities we can start with a process id, locate an available log file and perform some high-level metrics on it in relatively short order. I've found these utilities particularly useful in these past few months and have acquired a newly-found proficiency using them out of pure necessity. Hopefully it is useful to you fellow readers.

Cheers.

Dragon Quest 64

Monday, July 13, 2020

Software System Forensics

lsof -- list open files

No comments:

Post a Comment

Blog Archive