While a good deal of my career has revolved around green-field development projects, as of recent I've been more heavily involved in existing systems. Since then, I've acquired a new set of skills that help in acquiring knowledge about how the system is operating. As a recent contractor, I've found myself frequently given assignments of the form "we need to understand this component of the system....go research that". More often than not, all subject-matter experts are long gone so you're primarily on your own. Luckily, there are a series of *nix tools that can assist in this discovery process. Let's hit on a few;
While source code is the life-blood of software systems, outside the developer community most team members are more familiar with system components and likely couldn't point you to the source code. Could be a shell script, Perl, Python, Java,.....and while many managers/team-leads likely couldn't aim you toward the source tree of any given system component, they likely can refer you to a process name (or snippet) and that alone can get you rolling on your investigation.
Let's say your big-bossman points you to a data processing feed called spooler1 that is currently running on the system;
Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie
%Cpu(s): 5.7 us, 2.0 sy, 0.5 ni, 91.4 id, 0.4 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 8167120 total, 781680 free, 1680440 used, 5705000 buff/cache
KiB Swap: 8385532 total, 8371072 free, 14460 used. 5928716 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29905 user 20 0 14008 4400 2708 R 20.0 0.1 0:00.22 spooler1
Not a lot to go on, but you may be surprised what you can gleen from a running process.
lsof -- list open files
Let's start with the lsof utility in our investigational journey. This utility does precisely what the name says....it lists all open files that the process has running.
$ lsof -p 29905
lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
Output information may be incomplete.
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
logMaker 12600 user cwd DIR 8,33 4096 70975491 /home/user/blog/SystemForensics
logMaker 12600 user rtd DIR 8,1 4096 2 /
logMaker 12600 user txt REG 8,1 1037528 41418771 /bin/bash
logMaker 12600 user mem REG 8,1 2981280 20186616 /usr/lib/locale/locale-archive
logMaker 12600 user mem REG 8,1 1868984 45355547 /lib/x86_64-linux-gnu/libc-2.23.so
logMaker 12600 user mem REG 8,1 14608 45355429 /lib/x86_64-linux-gnu/libdl-2.23.so
logMaker 12600 user mem REG 8,1 167240 45355621 /lib/x86_64-linux-gnu/libtinfo.so.5.9
logMaker 12600 user mem REG 8,1 162632 45355450 /lib/x86_64-linux-gnu/ld-2.23.so
logMaker 12600 user mem REG 8,1 26258 20449837 /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache
logMaker 12600 user 0u CHR 136,20 0t0 23 /dev/pts/20
logMaker 12600 user 1w REG 8,33 46450 53624834 /home/user/blog/SystemForensics/spooler.log
logMaker 12600 user 2u CHR 136,20 0t0 23 /dev/pts/20
logMaker 12600 user 3r FIFO 0,12 0t0 76487063 pipe
logMaker 12600 user 255r REG 8,33 857 53624833 /home/user/blog/SystemForensics/spooler1
There is a host of information available from this command, but for now, let's focus on two key things (highlighted in blue); the current working directory and the log file. While not relevant to our above example, the lsof utility will provide evidence of relevant libraries and/or network ports. For example, if you observe open network ports and the process linking in Corba libraries....you can get a gauge of what you're in for. A good starting point when trying to understand a system element is understanding the 'goes-intas, and the goes-outtas', so library-dependencies input and output files can set the stage for investigation tasks.
In our example, we now know where this utility is run from and that it generates a log file.
While not every process generates a log file, if a log file exists you've likely stumbled upon a plethora of information. No amount of source code browsing will ever compare to spending time with a quality log file in gaining knowledge of what's going on. Likely you're looking into a system component because it isn't operating as expected, and hopefully the log file will provide evidence of that and give you a hint at where you should focus your next steps.
Simple Log File Analysis
If you're fortunate enough to have access to sophisticated log file analysis tools by all means....use them. For many of us we're stuck with the general system utilities and in many cases you can go far with these simple utilities. I've applied these techniques on some pretty sophisticated log files on existing systems but many companies would consider the logs as proprietary, so we'll use a sample log file format for our purposes.
If we peek at the first few lines of our log file we can get a feel for what we can do with it;
$ head spooler.log
2020-06-10 15:04:02.85|processing Table-E record
2020-06-10 15:04:04.45|processing Table-C record
2020-06-10 15:04:05.15|processing Table-D record
2020-06-10 15:04:05.75|processing Table-C record
2020-06-10 15:04:06.96|processing Table-D record
2020-06-10 15:04:07.66|processing Table-E record
2020-06-10 15:04:08.96|processing Table-D record
2020-06-10 15:04:09.86|Spooler::init() Connection established with peer pub.server.org port 9000
2020-06-10 15:04:11.06|processing Table-B record
2020-06-10 15:04:12.46|processing Table-E record
A well-planned log file will have a well-defined format, hopefully with a time-stamp and a series of events. In our example, we have 6 categories of events {connection requests, and 5-table record processing events}, but only a fraction are evident from the head of the log file. Let's see how we can get a comprehensive list of types of events.
With a focus on discovering a comprehensive list of event types we first see that the timestamp is ever-changing, so we need to ignore the timestamp and extract the rest of the line and see what we have.
$ cat spooler.log | cut -f 2- -d '|' | head -10
processing Table-E record
processing Table-C record
processing Table-D record
processing Table-C record
processing Table-D record
processing Table-E record
processing Table-D record
Spooler::init() Connection established with peer pub.server.org port 9000
processing Table-B record
processing Table-E record
Whelp, that shows promise. Let's sort uniquely the result and see if that gives us what we want;
$ cat spooler.log | cut -f 2- -d '|' | sort -u
processing Table-A record
processing Table-B record
processing Table-C record
processing Table-D record
processing Table-E record
Spooler::init() Connection established with peer pub.server.org port 9000
BAM! Looks good. Now what? The frequency of an event can provide you with some insight into how the system is behaving. For example, repeated 'Connection established...' instances may hint at an unreliable network connection or unstable companion process. Let's look at how many instances for each type of event. We can do this by sorting the events followed by counting the instances of each type. The sort+uniq utilities will do the trick;
$ cat spooler.log | cut -f 2- -d '|' | sort | uniq -c
1422 processing Table-A record
3593 processing Table-B record
1427 processing Table-C record
3572 processing Table-D record
3625 processing Table-E record
761 Spooler::init() Connection established with peer pub.server.org port 9000
From this we can see we have B,D,E records providing the highest volume of entries, a smaller number of A,C records and ~700 instances of connection retries.
While this provides a nice summary of the types of events coming in as well as the quantities, often time-based summaries are essential. Perhaps we're expecting burst-transfers in the off-hours and a smaller volume during the day. We can bring the timestamps back into the equation for a look.
Data analysis is a discovery process, you often find a need to repeatedly refine your search criteria from your starting point.
Let's breakdown a log entry so we can see how we can perform some time-based categorization;
2020-06-10 19:04:23.10|processing Table-B record
date -- char [1..10]
time -- char[12..22]
event -- char[24..]
The cut utility allows us to extract specific fields from each log line. Let's say we are interested in the number of events organized by date, we'd want to preserve the data + event type like this:
$ cat spooler.log | cut -b 12-14,24-
15:processing Table-E record
15:processing Table-C record
...
A bit ugly, but by extracting the hour+':'+event type, you've got the beginnings of what we want. I purposely left the HH:MM colon delimiter in place to separate the hour from the event type;
$ cat spooler.log | cut -b 12-14,24- | sort | uniq -c
326 15:processing Table-A record
847 15:processing Table-B record
340 15:processing Table-C record
831 15:processing Table-D record
829 15:processing Table-E record
180 15:Spooler::init() Connection established with peer pub.server.org port 9000
347 16:processing Table-A record
914 16:processing Table-B record
332 16:processing Table-C record
896 16:processing Table-D record
932 16:processing Table-E record
174 16:Spooler::init() Connection established with peer pub.server.org port 9000
377 17:processing Table-A record
895 17:processing Table-B record
350 17:processing Table-C record
876 17:processing Table-D record
893 17:processing Table-E record
202 17:Spooler::init() Connection established with peer pub.server.org port 9000
348 18:processing Table-A record
868 18:processing Table-B record
373 18:processing Table-C record
904 18:processing Table-D record
912 18:processing Table-E record
190 18:Spooler::init() Connection established with peer pub.server.org port 9000
24 19:processing Table-A record
69 19:processing Table-B record
32 19:processing Table-C record
65 19:processing Table-D record
59 19:processing Table-E record
15 19:Spooler::init() Connection established with peer pub.server.org port 9000
We can now see that how event counts change over the course of the hours. Bursty processing loads would be evident by spending some time with this raw data, or better yet, plotting quantities over time can provide you with real insight into your system. High volume of atypical events can give you a clue to problems; for example, say you observed a high volume of retries/restarts in the wee hours of the morning may imply controlled restarts or potential problems.
Let's prepare to power down on this post. With a handful of lesser-known utilities we can start with a process id, locate an available log file and perform some high-level metrics on it in relatively short order. I've found these utilities particularly useful in these past few months and have acquired a newly-found proficiency using them out of pure necessity. Hopefully it is useful to you fellow readers.
Cheers.