Monday, December 17, 2018

Bash - Extracting Metrics from Unstructured Logs


Seems the most common thing I need to do as of late is extra extracting metrics from semi-structured debug logs.  Depending on the complexity, the tool from the toolbox is either a quick bash script, or if a bigger hammer is needed...Python.  This post will however focus on quick bash scripts/commands.

'Structured' files can have any number of meanings, predictable by definition, but for now let's say define structured as to mean a predictable number of substrings seperated by a unique delimiter.

For example; consider the following file snippet:

Fri Nov 30 23:06:01 CST 2018;some user log message;10072
Fri Nov 30 23:06:01 CST 2018;some user log message;1908
Fri Nov 30 23:06:01 CST 2018;some user log message;26583
Fri Nov 30 23:06:01 CST 2018;some user log message;22197
Fri Nov 30 23:06:01 CST 2018;some user log message;14374
Fri Nov 30 23:06:01 CST 2018;some user log message;1545
Fri Nov 30 23:06:01 CST 2018;some user log message;31080
Fri Nov 30 23:06:01 CST 2018;some user log message;18157
Fri Nov 30 23:06:01 CST 2018;some user log message;1606
Fri Nov 30 23:06:01 CST 2018;some user log message;19883

If our object is to extract the last element from each line (i.e. the numeric), we can consider the file format simply structured as it has a fixed number of elements with a unique delimiter (ie. ';').  Extracting the 3rd element separated by the ';' delimiter can be simply done by:

$ cat file.txt | cut -f 3 -d ';'

10072
1908
26583
22197
14374
1545
31080
18157
1606
19883

But what if the number of left fields varys rather than staying fixed?

Fri Nov 30 23:08:23 CST 2018;some user log message;something else;18689
Fri Nov 30 23:08:23 CST 2018;some user log message;31685
Fri Nov 30 23:08:23 CST 2018;some user log message;something else;27534
Fri Nov 30 23:08:23 CST 2018;some user log message;17393
Fri Nov 30 23:08:23 CST 2018;some user log message;something else;14007
Fri Nov 30 23:08:23 CST 2018;some user log message;13763
Fri Nov 30 23:08:23 CST 2018;some user log message;something else;11165
Fri Nov 30 23:08:23 CST 2018;some user log message;28675
Fri Nov 30 23:08:23 CST 2018;some user log message;something else;28553
Fri Nov 30 23:08:23 CST 2018;some user log message;6573

From the left, the number of elements (separated by the delimiter) varys from 2-3 dependent on the lines so extracting the 3rd element like done previously won't work.  However, from the right the fields are fixed....so if we could extract the right-most field we've got exactly what we need.

Surprisingly, this is pretty easy using the 'rev' command makes this an easy lift.  The 'rev' command takes a string and simply reverses it character-by-character.


$ echo "Easy Peezy" | rev

yzeeP ysaE



Reversing a reversed string results in the original string.  Obvious, for sure, but how that helps us can be elusively simple, cut the right-most field;

$ cat file.txt | rev | cut -f 1 -d ';' | rev
18689
31685
27534
17393
14007
13763
11165
28675
28553
6573


Cool, so we've got tricks for extracting specific fields for left or right-justified lines.  But what if we've got a more complicated file with even less structure?  The 'top' utility presents a pretty good example of unstructured log contents.

$ top -b -n 10 -d 1 > /tmp/top.out

Say we want to extract the idle metric; present only some of the lines and considerably unstructured, from the left as well as the right.

The 'grep' utility helps out considerably, allowing to return only the expressions satisfied by a specified regular expression.


$ cat /tmp/top.out | grep -oh "[0-9]*.[0-9] id,"
93.6 id,
79.7 id,
97.5 id,
96.8 id,
97.8 id,
97.5 id,
97.8 id,
77.2 id,
98.0 id,
97.3 id,


Pair it with an appropriate 'cut' command and we're gold.

$ cat /tmp/top.out | grep -oh "[0-9]*.[0-9] id," | cut -f 1 -d ' '
93.6
79.7
97.5
96.8
97.8
97.5
97.8
77.2
98.0
97.3


Again, cool but what if we need to perform some simple statistics, like calculating the average; pairing with 'awk' will crush this issue.



$ cat /tmp/top.out | grep -oh "[0-9]*.[0-9] id," | cut -f 1 -d ' ' | awk '{SUM+=$1;} END {print SUM/NR}'
93.32
How about the median?
$ cat /tmp/top.out | grep -oh "[0-9]*.[0-9] id," | cut -f 1 -d ' ' | sort -n | awk '{count[NR]=$1}END{print count[NR/2]}'
97.3


Python, Smython.....bash prevails.

No comments:

Post a Comment