Monday, January 28, 2019

Superset Visualization - Get Your Data On

Big Data can quickly become a big problem and one of the challenges is simply making sense of massive volumes of information.  A good visualization tool is the key, the topic of this lil' post.

From the creators at Airbnb, yes....the same AirBnb that brings you the concept of couch surfing on a strangers chaise lounge, also creates an open-source tool that's worth a serious look at.  It's gone through a few names, originally named Panoramix, renamed to Caravel and again renamed to Superset a few months later so be on the lookout for alternative name references when doing your own readings.  Here is a good place to start; https://www.youtube.com/watch?v=3Txm_nj_R7M

Let's follow through the installation process (performed on Ubuntu 16.04 LTS);


$ sudo sudo apt-get install -y build-essential libssl-dev libffi-dev python-dev python-pip libsasl2-dev libldap2-dev

$ pip install --upgrade setuptools pip

$ sudo pip install superset



$ fabmanager create-admin --app superset

$ superset db upgrade

$ superset load_examples

$ superset init


At this stage, SuperSet should be fully installed with a number of examples.  SuperSet provides a web-interface that is accessible after running the server (as follows);

$ superset runserver


Point your browser to localhost:8088 and bask in your tech glory.  The interface provides a number of dashboards that demonstrate the visualization capabilities.


That's all for now, follow-on posts will explore Superset capabilities.  In the meantime, feel free to reference SuperSet's main documentation here; http://airbnb.io/projects/superset/




Friday, January 18, 2019

FFMpeg Time Machine


We installed a security camera at our house and it, like most, has the ability to capture video based on motion.  Unfortunately robust motion detection tends to introduce latency as it accrues sufficient motion to determine that the event is significant and not noisy things, like a leaf blowing across the lawn.  The trouble with this is that the time leading up to the motion is often lost from the video capture and you're left with part of the event.  For example, it's not uncommon for a video of the mail-person delivering a package starting when the person is well within the scene rather than a more complete video of the person as they enter the scene.

Ideally, what you'd want from a security system is to have a robust motion detection algorithm, but once a motion event has been detected to provide video leading up to the motion, say 10 seconds back and forward.  This could be accomplished by buffering video and bundling this video buffer into the captured video.

This is surprisingly easy with FFMpeg and is the focus of this post.  Read on  ye seeker of FFMpeg sexiness.

Let's break down a simple implementation:

  1. capture video from a camera into 10 second segments, with a common naming convention that includes an incrementing numeric (making each file name unique)
  2. a simulated trigger event which responds by grabbing the last X segments and concatenate into a final video file

Capture Video Segment (e.g. Buffers)

Our video source will be our USB camera.  In the interest of posterity, and to verify our concatenation of the video segments is seamless and in-order, we'll overlay the current time upon the video.  The segment subcommand automagically creates video segments of the specified length.  You can specify a segment file naming convention as well.
The following example captures the camera video, applies a time-stamp overlay and generates files in the form {/tmp/capture-000.mp4 /tmp/capture-001.mp4..../tmp/capture-999.mp4}




$ ffmpeg -i /dev/video0 -vf "drawtext=fontfile=/usr/share/fonts/truetype/droid/DroidSans.ttf:text='%{localtime\:%T}'" -f segment -segment_time 10 -segment_format mp4 "/tmp/capture-%03d.mp4"


Concatenate Video Segments Into Final Video

Two key things need to be done in order to concatenate the video segments into the final video:

  1. determine what video files to concatenate
  2. order the video files in order of capture
  3. concatenate them into final video


The following script does precisely that.  The find command looks for files that are less than 60 seconds old and sorts them via epoch time.  Each file is added to a temp file, this temp file has the list of video segment files in-order.  FFMpeg takes this list of files and concatenates them in-order into the final video file.


$ ./grabCamEvent /tmp/foo.mp4

The above command would result in a 60-70 second video file starting approximately 60 seconds ago.  Approximately because the video segment length comes into play here.



$ cat -n grabCamEvent

     1 #!/bin/bash
     2 outFile=$1
     3 tmpFile=/tmp/temp-$(date +%s)
     4 
     5 for f in $(find /tmp/ -name "capture*mp4" -newermt '60 seconds ago' -printf "%T@ %p\n" | sort -n | cut -f 2 -d ' '); do
     6   echo "file '$f'" >> $tmpFile
     7 done
     8 
     9 ffmpeg -y -f concat -i $tmpFile -vcodec copy $outFile
    10 rm $tmpFile
    11 

This is primarily the foundation for a proof-of-concept.  A proper solution would include periodically deleting old video files and writing the video segments in such a manner as to not burn out your hard-drive, perhaps replacing the destination with a ram-disk.

I'm genuinely puzzled, given the ease of this solution, why more security systems don't employ such a feature.

¯\_(ツ)_/¯






Sunday, January 13, 2019

Bash Printf -- Pretty, Pretty Numbers

A repeated need I've encountered is a need to format a numeric with leading zeros, similar to the common form used in C.  Typically, I take a over-complicated approach of comparing the number to >100 or >10 and pre-pending leading zeros.

After investigating alternative approaches with good 'ole Google, the better approach is shown below;


$ cat /tmp/go 
#!/bin/bash

for i in `seq 100`; do
  N=$(printf %03d $i)
  echo $N
done

Cheers.

Sunday, January 6, 2019

Poor Man's Parallelism with Make



For the past couple years a good deal of my work responsibilities have been concerned with parsing large text datasets with Python and extracting business logic.

It always starts the same, dig through a couple day's worth of text data for something interesting.  Then, process a weeks worth, a month's worth, a year's worth....processing time grows linearly 'til it becomes a real burden.  In some cases, you have a distributed system (like a Hadoop cluster) but more often you have a multi-core Linux system.  Today's multi-core systems often have substantial power, but only if you make use of the available cores.  The scope of this post will focus on utilizing 'make' to parallelize the sequential processing of a dataset.

The fork-join takes a sequence of jobs, executes them in parallel, awaits completion of these jobs and then joins them into a final result.

A common case, and the one we'll be focusing on is processing a series of files with a single utility, each file independent of the rest.

Let's first set the stage with the sequential base-case.  We have a utility processFile which takes as a command line argument a list of files to process, outputting to stdout some result on each file (note, the result needs to be independent for each file, not a cumulative result).

$ processFile file1.txt file2.txt file3.txt ... fileN.txt

Running sequentially like this will likely execute on a single core and depending on your system, you may find you have processing and memory underutilized as a result.  Processing time grows from a cup of coffee, to a lunch break, to overnight...etc as your file list size grows.

In order to parallelize this form of execution we need to:
1) define our file list
2) split the file list into sublists
3) execute multiple instances of the processFile utility on each sublist simultaneously
4) wait for all sublists to be processed
5) join the output results into a single output

The following Makefile accomplishes these steps and we'll step through them.  But first, if you're not already aware; specifying parallel jobs in make is done by supplying the '-j' directive with a numeric representing the number of preferred jobs.  For example make -j2 specifies a desire to run 2 simultaneous jobs.  This will provide our interface for specifying our number of parallel tasks.


$ cat -n Makefile
     1 .PHONY: split all clean
     2 
     3 PID := $(shell cat /proc/$$$$/status | grep PPid | awk '{print $$2}')
     4 JOBS := $(shell ps -p ${PID} -f | tail -n1 | grep -oP '\-j *\d+' | sed 's/-j//')
     5 ifeq "${JOBS}" ""
     6 JOBS := 1
     7 endif
     8 
     9 SPLIT=$(wildcard x??)
    10 OUTFILES=$(addsuffix .out,$(SPLIT))
    11 
    12 all: split
    13  ${MAKE} run.out
    14 
    15 run.out: ${OUTFILES}
    16  ${SH} cat ${OUTFILES} >> $@
    17 
    18 %.out: %
    19  $(SH) stdbuf -o0 processFile `cat $?` > $@
    20 
    21 split: flist.txt
    22  $(SH) split -l $$((`wc -l flist.txt | cut -f 1 -d ' '` / ${JOBS} )) $<
    23 
    24 flist.txt:
    25  ${SH} find /var/tmp/data -name "*.txt" > $@
    26 
    27 clean:
    28  ${RM} flist.txt x?? *.out


For purposes that should become clear, our makefile creates a flist.txt file that contains the filelist we will then provide to the process utility.  The flist.txt target specified on line 24 is populated by issuing a find request for all *.txt files in the /var/tmp/data directory.  The result is a list of files separated by new line characters.  This accomplished out define our file list step.  Easy peasy.  Saddle up, our next step is a bit more complicated.

In order to accomplish our split the file list into sublists our objective is to take the flist.txt file and divide it into J sublists.  J in this case represents the number of parallel tasks.  We can/could simply hard-code this variable within the makefile (e.g. JOBS=2) and likely would have made the solution easier but certainly less robust.  Lines 3-7 accomplish the assignment of the JOBS variable.  Unfortunately, because the '-j' argument value made available to make, getting the value is done in a bit of a round-about manner.  Specifically, this is done by getting the current process id of make and grabbing the jobs numeric from the command line.  Defaults to 1 if no jobs numeric is provided.  With the JOBS value assigned, we next want to take the flist.txt file and split it into sublists.  This is done by the split target in line 21-22.  The syntax is a bit complicated, but all it's really doing is finding the number of lines of flist.txt and dividing it by JOBS.  Specifying 'make -j 2' will split flist.txt into 2 files using the split command.  The output files from split are generated into a series of x?? named files (e.g. xab, xac).  The use of the '-l' argument to split ensures we preserve the integrity of the lines, no splits within the line in the file.  The result of JOBS=2 should be 2, but may be 3 if the number of lines in the file are odd.  It matters little and it's common to get a smaller final file.

Two steps down...in the final stretch.

Lines 9,10,17-19 takes care of our execute multiple instances of the processFile utility step.  Line 9-10 lists the x?? sublist file names and appends a .out suffix to each.  On line 15, this results in a dependency for the run.out target of the form xab.out xac.out....xzz.out.  The make dependency engine then satisfies the generation of these files by executing lines 18-19 in parallel.  Make awaits the completion of all the x??.out files which satisfies our wait for all sublists to be processed objective.

Finally, line 16 joins all the x??.out files into a single run.out file.  This satisfies our final join the output results into a single output step.

Now you've got a means of parallel processing on a Ramen noodle budget.  Simply specify the number of jobs/processes you want via the -j command line arg.

$ make -j 1 all
$ make -j 2 all
...
$ make -j 16 all

Ahh, one last thing.  You may have noticed the subcall to make in line 13.  Since the split command generates the files as part of executing make, the files required in line 9-10 won't exist until that rule is executed.  I attempted the use of secondary expansion to get around the subcall but couldn't find a workable solution.

I've been using this means of cheap parallelism for a few days now on 16 core servers with wonderful results.  Processing tasks that once took 2 hours are now done in 15 minutes if I'm greedy with the cpu/core usage.