Monday, July 13, 2020

Software System Forensics

While a good deal of my career has revolved around green-field development projects, as of recent I've been more heavily involved in existing systems.  Since then, I've acquired a new set of skills that help in acquiring knowledge about how the system is operating.  As a recent contractor, I've found myself frequently given assignments of the form "we need to understand this component of the system....go research that".  More often than not, all subject-matter experts are long gone so you're primarily on your own.  Luckily, there are a series of *nix tools that can assist in this discovery process.  Let's hit on a few;

While source code is the life-blood of software systems, outside the developer community most team members are more familiar with system components and likely couldn't point you to the source code.  Could be a shell script, Perl, Python, Java,.....and while many managers/team-leads likely couldn't aim you toward the source tree of any given system component, they likely can refer you to a process name (or snippet) and that alone can get you rolling on your investigation.

Let's say your big-bossman points you to a data processing feed called spooler1 that is currently running on the system;

Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.7 us,  2.0 sy,  0.5 ni, 91.4 id,  0.4 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  8167120 total,   781680 free,  1680440 used,  5705000 buff/cache
KiB Swap:  8385532 total,  8371072 free,    14460 used.  5928716 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                   
29905 user  20   0   14008   4400   2708 R  20.0  0.1   0:00.22 spooler1

  Not a lot to go on, but you may be surprised what you can gleen from a running process.

lsof -- list open files

Let's start with the lsof utility in our investigational journey.  This utility does precisely what the name lists all open files that the process has running.

$ lsof -p 29905
lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
      Output information may be incomplete.
logMaker 12600 user  cwd    DIR   8,33     4096 70975491 /home/user/blog/SystemForensics
logMaker 12600 user  rtd    DIR    8,1     4096        2 /
logMaker 12600 user  txt    REG    8,1  1037528 41418771 /bin/bash
logMaker 12600 user  mem    REG    8,1  2981280 20186616 /usr/lib/locale/locale-archive
logMaker 12600 user  mem    REG    8,1  1868984 45355547 /lib/x86_64-linux-gnu/
logMaker 12600 user  mem    REG    8,1    14608 45355429 /lib/x86_64-linux-gnu/
logMaker 12600 user  mem    REG    8,1   167240 45355621 /lib/x86_64-linux-gnu/
logMaker 12600 user  mem    REG    8,1   162632 45355450 /lib/x86_64-linux-gnu/
logMaker 12600 user  mem    REG    8,1    26258 20449837 /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache
logMaker 12600 user    0u   CHR 136,20      0t0       23 /dev/pts/20
logMaker 12600 user    1w   REG   8,33    46450 53624834 /home/user/blog/SystemForensics/spooler.log
logMaker 12600 user    2u   CHR 136,20      0t0       23 /dev/pts/20
logMaker 12600 user    3r  FIFO   0,12      0t0 76487063 pipe
logMaker 12600 user  255r   REG   8,33      857 53624833 /home/user/blog/SystemForensics/spooler1

There is a host of information available from this command, but for now, let's focus on two key things (highlighted in blue); the current working directory and the log file.  While not relevant to our above example, the lsof utility will provide evidence of relevant libraries and/or network ports.  For example, if you observe open network ports and the process linking in Corba can get a gauge of what you're in for.  A good starting point when trying to understand a system element is understanding the 'goes-intas, and the goes-outtas', so library-dependencies input and output files can set the stage for investigation tasks.

In our example, we now know where this utility is run from and that it generates a log file.

While not every process generates a log file, if a log file exists you've likely stumbled upon a plethora of information.  No amount of source code browsing will ever compare to spending time with a quality log file in gaining knowledge of what's going on.  Likely you're looking into a system component because it isn't operating as expected, and hopefully the log file will provide evidence of that and give you a hint at where you should focus your next steps.

Simple Log File Analysis
If you're fortunate enough to have access to sophisticated log file analysis tools by all means....use them.  For many of us we're stuck with the general system utilities and in many cases you can go far with these simple utilities.  I've applied these techniques on some pretty sophisticated log files on existing systems but many companies would consider the logs as proprietary, so we'll use a sample log file format for our purposes.

If we peek at the first few lines of our log file we can get a feel for what we can do with it;
$ head spooler.log 
2020-06-10 15:04:02.85|processing Table-E record
2020-06-10 15:04:04.45|processing Table-C record
2020-06-10 15:04:05.15|processing Table-D record
2020-06-10 15:04:05.75|processing Table-C record
2020-06-10 15:04:06.96|processing Table-D record
2020-06-10 15:04:07.66|processing Table-E record
2020-06-10 15:04:08.96|processing Table-D record
2020-06-10 15:04:09.86|Spooler::init() Connection established with peer port 9000
2020-06-10 15:04:11.06|processing Table-B record
2020-06-10 15:04:12.46|processing Table-E record

A well-planned log file will have a well-defined format, hopefully with a time-stamp and a series of events.  In our example, we have 6 categories of events {connection requests, and 5-table record processing events}, but only a fraction are evident from the head of the log file.  Let's see how we can get a comprehensive list of types of events.

With a focus on discovering a comprehensive list of event types we first see that the timestamp is ever-changing, so we need to ignore the timestamp and extract the rest of the line and see what we have.

$ cat spooler.log | cut -f 2- -d '|' | head -10
processing Table-E record
processing Table-C record
processing Table-D record
processing Table-C record
processing Table-D record
processing Table-E record
processing Table-D record
Spooler::init() Connection established with peer port 9000
processing Table-B record
processing Table-E record

Whelp, that shows promise.  Let's sort uniquely the result and see if that gives us what we want;

$ cat spooler.log | cut -f 2- -d '|' | sort -u
processing Table-A record
processing Table-B record
processing Table-C record
processing Table-D record
processing Table-E record
Spooler::init() Connection established with peer port 9000

BAM!  Looks good.  Now what?  The frequency of an event can provide you with some insight into how the system is behaving.  For example, repeated 'Connection established...' instances may hint at an unreliable network connection or unstable companion process.  Let's look at how many instances for each type of event.  We can do this by sorting the events followed by counting the instances of each type.  The sort+uniq utilities will do the trick;
$ cat spooler.log | cut -f 2- -d '|' | sort | uniq -c
   1422 processing Table-A record
   3593 processing Table-B record
   1427 processing Table-C record
   3572 processing Table-D record
   3625 processing Table-E record
    761 Spooler::init() Connection established with peer port 9000

From this we can see we have B,D,E records providing the highest volume of entries, a smaller number of A,C records and ~700 instances of connection retries.  

While this provides a nice summary of the types of events coming in as well as the quantities, often time-based summaries are essential.  Perhaps we're expecting burst-transfers in the off-hours and a smaller volume during the day.  We can bring the timestamps back into the equation for a look.

Data analysis is a discovery process, you often find a need to repeatedly refine your search criteria from your starting point.  
Let's breakdown a log entry so we can see how we can perform some time-based categorization;
2020-06-10 19:04:23.10|processing Table-B record

date -- char [1..10]
time -- char[12..22]
event -- char[24..]

The cut utility allows us to extract specific fields from each log line.  Let's say we are interested in the number of events organized by date, we'd want to preserve the data + event type like this:
$ cat spooler.log | cut -b 12-14,24-  
15:processing Table-E record
15:processing Table-C record

A bit ugly, but by extracting the hour+':'+event type, you've got the beginnings of what we want.  I purposely left the HH:MM colon delimiter in place to separate the hour from the event type;
$ cat spooler.log | cut -b 12-14,24-  | sort | uniq -c
    326 15:processing Table-A record
    847 15:processing Table-B record
    340 15:processing Table-C record
    831 15:processing Table-D record
    829 15:processing Table-E record
    180 15:Spooler::init() Connection established with peer port 9000
    347 16:processing Table-A record
    914 16:processing Table-B record
    332 16:processing Table-C record
    896 16:processing Table-D record
    932 16:processing Table-E record
    174 16:Spooler::init() Connection established with peer port 9000
    377 17:processing Table-A record
    895 17:processing Table-B record
    350 17:processing Table-C record
    876 17:processing Table-D record
    893 17:processing Table-E record
    202 17:Spooler::init() Connection established with peer port 9000
    348 18:processing Table-A record
    868 18:processing Table-B record
    373 18:processing Table-C record
    904 18:processing Table-D record
    912 18:processing Table-E record
    190 18:Spooler::init() Connection established with peer port 9000
     24 19:processing Table-A record
     69 19:processing Table-B record
     32 19:processing Table-C record
     65 19:processing Table-D record
     59 19:processing Table-E record
     15 19:Spooler::init() Connection established with peer port 9000

We can now see that how event counts change over the course of the hours.  Bursty processing loads would be evident by spending some time with this raw data, or better yet, plotting quantities over time can provide you with real insight into your system.  High volume of atypical events can give you a clue to problems; for example, say you observed a high volume of retries/restarts in the wee hours of the morning may imply controlled restarts or potential problems.

Let's prepare to power down on this post.  With a handful of lesser-known utilities we can start with a process id, locate an available log file and perform some high-level metrics on it in relatively short order.  I've found these utilities particularly useful in these past few months and have acquired a newly-found proficiency using them out of pure necessity.  Hopefully it is useful to you fellow readers. 


No comments:

Post a Comment