Tuesday, June 30, 2020

Visualization of Sorting Algorithms

Some time back a trending video made its way into my life, it provided a graphical representation of common sorting algorithms.

This simple video demonstrated the value of visualizing complex algorithms, hopefully something that is incorporated into CS coursework.  The value of divide-n-conquer techniques can be more easily understood with a side-by-side comparison.

We can generate an equivalent video by monitoring the list of numbers, generate a graph at each step and render the sequence of graphs into a video.

In order to generate each graph, we could modify the sorting algorithm to create and store a graph image.  This approach however would require modification to the source code.  An alternative, use a debugger to monitor the list of numbers and print out the list of numbers whenever it is modified.  Then, post-run, we could use the debugger output to create individual graphs and stitch them together.

So, our approach will take the form:
1) create C++ sorting algorithm
2) run w/debugger and monitor the state of the input array
3) parse the debugger log, creating a graphical representation of the input array at each step
4) stitch each frame into a video

1 - Sorting Algorithm
Let's create a simple C++ example with a variety of sorting algorithms.  We'll create a randomized list of integers and prepare to send it through a sorting routine.
$ cat -n main.cpp 
     1 #include <stdio.h>
     2 #include <stdlib.h>
     3 #include <vector>
     4
     5 #ifndef N
     6 #define N 100
     7 #endif
     8
     9 void swap(int& a, int& b)
    10 {
    11   const int tmp=a;
    12   a=b;
    13   b=tmp;
    14 }
    15
    16 void display(int list[], int size)
    17 {
    18   for(int i=0; i<size; i++)
    19   {
    20     printf("%d\n",list[i]);
    21   }
    22 }
    23
    24 //============Selection Sort============
    25 void selectionSort(int list[], int size)
    26 {
    27   for(int i=0; i<size-1; i++)
    28   {
    29     for(int j=i+1; j<size; j++)
    30     {
    31       if (list[j] < list[i])
    32       {
    33         swap(list[j],list[i]);
    34       }
    35     }
    36   }
    37 }
    38 //======================================
    39
    40 //==============Quick Sort==============
    41 int partition(int *a,int start,int end)
    42 {
    43   int pivot=a[end];
    44   //P-index indicates the pivot value index
    45
    46   int P_index=start;
    47   int i,t; //t is temporary variable
    48
    49   //Here we will check if array value is
    50   //less than pivot
    51   //then we will place it at left side
    52   //by swapping
    53
    54   for(i=start;i<end;i++)
    55   {
    56       if(a[i]<=pivot)
    57       {
    58           t=a[i];
    59           a[i]=a[P_index];
    60           a[P_index]=t;
    61           P_index++;
    62       }
    63    }
    64    //Now exchanging value of
    65    //pivot and P-index
    66     t=a[end];
    67     a[end]=a[P_index];
    68     a[P_index]=t;
    69
    70    //at last returning the pivot value index
    71    return P_index;
    72 }
    73
    74 void Quicksort(int *a,int start,int end)
    75 {
    76   if(start<end)
    77   {
    78     int P_index=partition(a,start,end);
    79     Quicksort(a,start,P_index-1);
    80     Quicksort(a,P_index+1,end);
    81   }
    82 }
    83
    84 void quickSort(int list[], int size)
    85 {
    86   Quicksort(list,0,size);
    87 }
    88
    89 void insertionSort(int list[], int size) {
    90
    91   int valueToInsert;
    92   int holePosition;
    93   int i;
    94  
    95   // loop through all numbers 
    96   for(i = 1; i < size; i++) { 
    97
    98      // select a value to be inserted. 
    99      valueToInsert = list[i];
   100
   101      // select the hole position where number is to be inserted 
   102      holePosition = i;
   103
   104      // check if previous no. is larger than value to be inserted 
   105      while (holePosition > 0 && list[holePosition-1] > valueToInsert) {
   106         list[holePosition] = list[holePosition-1];
   107         holePosition--;
   108         printf(" item moved : %d\n" , list[holePosition]);
   109      }
   110
   111      if(holePosition != i) {
   112         printf(" item inserted : %d, at position : %d\n" , valueToInsert,holePosition);
   113         // insert the number at current hole
   114         list[holePosition] = valueToInsert;
   115      }
   116
   117      printf("Iteration %d#:",i);
   118   } 
   119 }
   120
   121 void sort(int list[], int size)
   122 {
   123 //selectionSort(list,size);
   124 //quickSort(list,size);
   125   insertionSort(list,size);
   126 }
   127
   128 int main(int argc, char* argv[])
   129 {
   130   printf("(%s:%d) main process initializing\n",__FILE__,__LINE__);
   131   std::vector<int> L;
   132
   133   for(int i=0; i<N; ++i)
   134   {
   135     L.push_back(rand()%100);
   136   }
   137   sort(L.data(), L.size());
   138  
   139   display(L.data(), L.size());
   140
   141   printf("(%s:%d) main process terminating\n",__FILE__,__LINE__);
   142 }
 
On line 5-8, we declare N which represents the size of the array size, then on lines 131-136 we allocate a list of integers and populate them with randomized numbers.  Line 137 we provide the list as an argument to the sort() method.  

2 - Run w/Debugger
Macros are evil so let me justify the use of N in the C++ source file.  Typically, such a macro would be replaced with a static constant allowing more type-safe implementation.  In this example though the size of the array needs to be known by the GDB commands as well.  The makefile assigns the value of N, passing it in to the compiler as a pre-compiler directive and creates the gdb command file which also needs to know the value of N.
$ cat -n Makefile 
     1 CC=g++
     2 MAIN=main
     3 N=100
     4
     5 all: video.mp4
     6
     7 ${MAIN}: ${MAIN}.cpp
     8 ${CC} -D N=${N} -g $< -o $@
     9
    10 run: ${MAIN}
    11 ${SH} ./${MAIN}
    12
    13 plot: gdb.log
    14 ${SH} ./plotter $<
    15
    16 gdb.cmd:
    17 ${SH} echo "set pagination off" > $@
    18 ${SH} echo "break sort" >> $@
    19 ${SH} echo "commands" >> $@
    20 ${SH} echo "  watch *(int(*)[${N}])(list)" >> $@
    21 ${SH} echo "  commands" >> $@
    22 ${SH} echo "    cont" >> $@
    23 ${SH} echo "  end" >> $@
    24 ${SH} echo "  cont" >> $@
    25 ${SH} echo "end" >> $@
    26 ${SH} echo "run" >> $@
    27 ${SH} echo "quit" >> $@
    28
    29 gdb.log: gdb.cmd ${MAIN}
    30 ${SH} gdb -x gdb.cmd ${MAIN} > $@
    31
    32 video.mp4: plot
    33 ${SH} ffmpeg -i frame%05d.png $@
    34
    35 clean:
    36 ${RM} ${MAIN} *.o gdb.log gdb.cmd
    37 ${RM} *.png *.mp4

Notice on line 3 we define our array size as 100.  On line 8 we pass in the value when compiling the C++ source.  On line 20 we create a gdb command file that we will use to run with the debugger.  I've recently found it useful for makefiles to create such input files as part of the build process rather than author manually and preserve in version control.  The gdb.cmd file is generated as part of the build process and in this case allows the makefile to preserve a consistent value for N in the gdb command file and the C++ source file.

Notice on line 29, we define a  gdb.log target, which runs the executable with the necessary gdb command script and generates a gdb log file.  Line 20 is the key to the gdb script, it attaches a watchpoint to array, when the array changes it spits out the old/new value of the array.  

3 - Parse the Debugger Log
The gdb log file takes the form of:
$ cat -n gdb.log | head -50
     1 GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.3) 7.7.1
     2 Copyright (C) 2014 Free Software Foundation, Inc.
     3 License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
     4 This is free software: you are free to change and redistribute it.
     5 There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
     6 and "show warranty" for details.
     7 This GDB was configured as "x86_64-linux-gnu".
     8 Type "show configuration" for configuration details.
     9 For bug reporting instructions, please see:
    10 <http://www.gnu.org/software/gdb/bugs/>.
    11 Find the GDB manual and other documentation resources online at:
    12 <http://www.gnu.org/software/gdb/documentation/>.
    13 For help, type "help".
    14 Type "apropos word" to search for commands related to "word"...
    15 Reading symbols from main...done.
    16 Breakpoint 1 at 0x400dcc: file main.cpp, line 125.
    17
    18 Breakpoint 1, sort (list=0x604270, size=100) at main.cpp:125
    19 125   insertionSort(list,size);
    20 Watchpoint 2: *(int(*)[100])(list)
    21 Watchpoint 2: *(int(*)[100])(list)
    22
    23 Old value = {83, 86, 77, 15, 93, 35, 86, 92, 49, 21, 62, 27, 90, 59, 63, 26, 40, 26, 72, 36, 11, 68, 67, 29, 82, 30, 62, 23, 67, 35, 29, 2, 22, 58, 69, 67, 93, 56, 11, 42, 29, 73, 21, 19, 84, 37, 98, 24, 15, 70, 13, 26, 91, 80, 56, 73, 62, 70, 96, 81, 5, 25, 84, 27, 36, 5, 46, 29, 13, 57, 24, 95, 82, 45, 14, 67, 34, 64, 43, 50, 87, 8, 76, 78, 88, 84, 3, 51, 54, 99, 32, 60, 76, 68, 39, 12, 26, 86, 94, 39}
    24 New value = {83, 86, 86, 15, 93, 35, 86, 92, 49, 21, 62, 27, 90, 59, 63, 26, 40, 26, 72, 36, 11, 68, 67, 29, 82, 30, 62, 23, 67, 35, 29, 2, 22, 58, 69, 67, 93, 56, 11, 42, 29, 73, 21, 19, 84, 37, 98, 24, 15, 70, 13, 26, 91, 80, 56, 73, 62, 70, 96, 81, 5, 25, 84, 27, 36, 5, 46, 29, 13, 57, 24, 95, 82, 45, 14, 67, 34, 64, 43, 50, 87, 8, 76, 78, 88, 84, 3, 51, 54, 99, 32, 60, 76, 68, 39, 12, 26, 86, 94, 39}

When the list array is modified the watchpoint breakpoint is tripped which results in lines 23-24, outputting the old/new values.  

4 - Stitch Frames into Video
Last bit, we need a simple Python script that is capable of parsing the gdb log file and generate a image representative of the state of the array.
$ cat -n plotter 
     1 #!/usr/bin/python
     2 import sys;
     3 import re;
     4 import matplotlib.pyplot as plt;
     5
     6 def plot(fileName):
     7   print "plotting %s"%(fileName);
     8   with open(fileName,'r') as fp:
     9     C=fp.read();
    10
    11   i=0;
    12   for line in C.split('\n'):
    13     m=re.match('New value = {(.*)}.*',line);
    14     if m:
    15       L=[int(el) for el in m.group(1).split(',')];
    16       print L;
    17       plt.ylim(0,100);
    18       plt.bar(range(len(L)),L,label='');
    19       plt.savefig('frame%05d.png'%(i));
    20       plt.close();
    21       i+=1;
    22   pass;
    23
    24 #---main---
    25 plot(sys.argv[1]);

This script will parse the 'New value =' lines and generate a bar graph that represents the order of the array.  For example, an initial value of a 10 element array could look like this;
Generating a series of images, during each step of the sorting routine and stitching them together into a video using FFmpeg.

Final results, i give you;

Selection Sort

Insertion Sort

Quick Sort



Sunday, June 7, 2020

Pay Now or Pay Later

These past months have been unprecedented, having placed an unusual amount of load on unsuspecting systems.  These are the days that reveal the advantage of robust systems and the consequences of fragile systems.  Robust system engineering behave as a young willow tree when the winds of demand blow; 'bend, don't break'.  Resilient systems will survive the storm and in some cases thrive in it.  Properly designed systems prevail and demonstrate their true value in times like these. 

Like a high performance engine, or a professional athlete, properly engineered computer systems require an investment in building (time and money) as well as continuous maintenance.  Lack of maintaining these systems can have dramatic consequences; like this:

https://www.cnbc.com/2020/04/06/new-jersey-seeks-cobol-programmers-to-fix-unemployment-system.html

Unemployment systems built in the 1960's provide the digital heart to the social programs for entire states.  Departments who financed the building of such systems fall prey to complacency in maintaining them and are publically pants'ed and humiliated when the consequences of their actions hit the headlines. 

No bones about it, maintaining systems is expensive it's simply a manner of how you will pay.  The cost of tech debt, or system maintenance can be viewed like that of a mortgage payment....pay it monthly, or pay it in one solid balloon payment at the end of the term; either way, you'll be paying it.  New Jersey backed into a balloon payment after years of neglect.

There are no real easy answers to these decisions, modernizing a 60 year old software-intensive system is a difficult and complex task and it's easy to be seduced to think the right call is to simply stay-fast with the existing system and I can only suspect that's what NJ did. 

The decision to not modernize isn't necessarily the lower cost option either.  Mainframes are dinosaurs in this technology era and the maintenance of these systems likely fell to single-source high-cost specialty engineering houses like IBM billing out at upwards of $200/hr; pay now and pay later.

If NJ could roll back the clock and play out the ideal narrative the hardware system would be replaced to more modern architected systems and the near-mummified COBOL would be replaced with a more modern language.  Then, single-sourced high-cost hardware would be replaced with common hardware and engineering teams well-versed in modern programming languages would open the field to all kinds of talented personnel.   This is frankly the only course of action to preserve a future of this (and other like) system.  Changes like this don't happen overnight, it's a significant time and financial investment but it requires being done, postponement costs only add to the overall cost of system replacement.