Sunday, December 29, 2019

YOLO - Computer Vision

I recently stumbled upon the You Only Look Once (YOLO) computer vision algorithm that shows some remarkable results.  This post will focus on a brief introduction to this system and some examples of use in the limited time I've spent with it as of recent.

YOLO takes the stance of using a classifier as a detector.  In short, the algorithm takes the path of splitting a frame into SxS subimages and processes each subimage under the premise that it has an object within it, centered in the subimage.  It then performs some image processing to determine a series of bounding boxes of interest, then runs classifiers on each of the bounding boxes.  The classifier returns a confidence metric for each classifier, say 0-100.  So, suppose you have a bounding box that contains a dog, the algorithm would run a 'cat' classifier on the bounding box and get a low confidence score, it'd then run a 'bowling ball' classifier and also get a low score,...., then run a 'dog' classifier and get a high score.  The subimage tile would then be tagged as having a dog in it.  The algorithm is based on each subimage tile having no more than one object within it.  Highest confidence metric wins.

The rest of this post will focus on quickly setting up YOLO and running it on a series of test images.  Essentially, 3 steps: 1) download and install darknet (open-source neural net), 2) download neural net YOLO weights, 3) run YOLO on a series of images.  Let's get started.

Install Darknet

$ git clone https://github.com/pjreddie/darknet
Cloning into 'darknet'...
remote: Enumerating objects: 5901, done.
remote: Total 5901 (delta 0), reused 0 (delta 0), pack-reused 5901
Receiving objects: 100% (5901/5901), 6.16 MiB | 4.44 MiB/s, done.
Resolving deltas: 100% (3915/3915), done.
Checking connectivity... done.
$ cd darknet; make
...

Download YOLO Weights

$ wget https://pjreddie.com/media/files/yolov3.weights -O darknet/yolov3.weights

Run on Images

$ cd darknet
$ ./darknet detect cfg/yolov3.cfg yolov3.weights ~/Photos/image01.jpg
$ display predictions.jpg

The predictions image will surround detected images with bounding boxes and a label, like this:


Running YOLO on the above photo will result in the output and predictions image; 
/home/lipeltgm/Downloads/nature-cats-dogs_t800.jpg: Predicted in 76.252667 seconds.
dog: 95%
cat: 94%
person: 99%
person: 99%

YOLO found 4 objects, with high confidence for each: 1 cat, 1 dog and two people;

Running on my existing personal photos (~6400 images) and adhoc reviewing the results looks extremely promising; results follow:

Without any pre-processing or prep, I ran the YOLO classifier at my personal archive of photos, some 6400 images of vacations, camping trips, weddings,....  This process took a couple days, launching the darknet detect process individually for each photo, as a result the weights were loaded for each photo that significantly slowed the process, but wasn't really interested in performance as in the detections themselves.

Here is the types of objects found in my photos:
lipeltgm@kaylee:~$ grep "^.*:" ./blog/YOLO/darknet/bigrun.log | grep -v Predic | cut -f 1 -d ':' | sort | uniq -c | sort -n
      1 baseball glove
      1 broccoli
      1 hot dog
      1 kite
      1 scissors
      2 donut
      2 mouse
      2 parking meter
      3 apple
      3 banana
      3 orange
      3 pizza
      3 skateboard
      3 snowboard
      3 zebra
      4 sandwich
      4 toothbrush
      5 train
      6 bus
      6 skis
      6 stop sign
      7 baseball bat
      7 fork
      8 giraffe
      8 toilet
      9 cow
      9 knife
      9 microwave
      9 spoon
      9 surfboard
     10 frisbee
     10 remote
     12 tennis racket
     14 aeroplane
     15 elephant
     16 oven
     18 motorbike
     18 sink
     20 wine glass
     21 vase
     22 fire hydrant
     26 bicycle
     26 sheep
     29 cake
     31 refrigerator
     34 cat
     34 suitcase
     36 teddy bear
     38 sports ball
     40 horse
     43 laptop
     49 cell phone
     54 traffic light
     64 bear
     70 bowl
     77 bed
     77 clock
     88 pottedplant
    103 bird
    106 backpack
    112 handbag
    124 sofa
    124 tvmonitor
    129 umbrella
    156 dog
    170 bottle
    187 book
    194 diningtable
    204 tie
    237 bench
    275 truck
    304 cup
    355 boat
   1029 chair
   1333 car
  14683 person


Gotta say, pretty cool and located a number of random objects I didn't realized I had photos of.  Who knew I had a photo of zebras, but in-fact I really do.  DisneyWorld is amazing:


Have fun with it!!


Sunday, December 22, 2019

My Journey with Computer Vision


The spring of 1997 was a particularly interesting semester in my academic career, I was immersed in two challenging yet complimentary classes; Computer Graphics and Computer Vision.  This was my first introduction to the concept of computer vision and while I'm far from an authority, I do have a recurring history of dabbling in it ever since.  This week I stumbled upon a new'ish object detection algorithm and once again the computer vision mistress has set its seductive grip on me.

This new'ish algorithm will be a focus of a future post, in the meantime I wanted to spend some time pondering on the general topic of computer vision, consider it a retrospective of what I learned that semester, the change of focus in the technology since and things I wish I knew in college on the subject.

In the 90's the subject of computer vision was heavily based on simple image processing.  Simple may be misleading and in no way is meant to be condescending or judgmental, rather, simple in terms of achievable algorithms given the constraints of the processing power of the era.

At the core of the course was this book;
http://www.cse.usf.edu/~r1k/MachineVisionBook/MachineVision.pdf

I include it as it sets the stage for the state of the discipline at the time.  In that era, the state of computer vision was mostly image processing with a concentration on finding object silhouettes and features followed by trying to match the silhouette to a known 'good'.   This two-phased approach ( detecingt features and comparing features) continues to be at the core of vision systems.  At the time feature detection was in the forefront with limited understanding of how to effectively compare the found features.  I'd argue that the era was primarily video/image processing rather than what we've grown to know as computer vision.  The discipline was in it's primordial stage of evolution, feature detection needed to be solved before classification and again the resources of the time were less bountiful as we have by today's standards. 

So, followers of computer vision concentrated on image/video processing fundamentals.  We searched for ways to process an image pixel and draw relationships between the connectivity of each pixel.  We implemented various means of thresholding and a variety of filters with the objective of generating meaningful binary images or grayscale models.  Binary and/or grayscale models in hand you were met with an unsatisfying cliff-hanger, much like the ending of Sopranos, simply because the development of classification mechanisms was just beginning.

In the introduction, we've arrived at the topic of retrospective; I wish I had understood the *true* reason there was such a focus on image processing because that revolutionized the course of the discipline.

Take this furry little buddy;
The course was primarily focused on generating something like this;

Something readily done today by ImageMagick;
$ convert ~/Downloads/download.jpg -canny 0x1+10%+30% /tmp/foo.jpg 

Take a minute and look at the binary image above and ask yourself....what is the purpose of that image?  Really....take a minute.....I'll wait.

If you said "to get a series of lines/features/silhouettes that represent a cat" then you'd be in lock-step with the discipline at the time.  You'd focus on generating a series of models representing a cat, take that series of pixels and find a way to calculate a confidence metric that it's truly a cat.

What if you took the same approach with this image;


A wee bit tougher now?  But that's where an alternative to 'why we look for lines/features/silhouettes' propelled the course of computer vision.  The features could tell you where to look and this revolutionized the study.  The traditional process was detection => classification, but someone what if you viewed classification as detection?  What if we could simplify the group of cats into a series of cropped images each with one cat and ran a classifier on each subimage?

Take another look at the first binary image of the cat again.  Draw a bounding box around the lines and what you have is an area to concentrate your attention on.  Looking at the top right of the image will get you precisely squat, the bounding box tells you where you should concentrate your computer vision algorithms on.  Same goes to the groups of cats, with an intelligent means of grouping, you can distinguish the 4 regions each containing a cat.  Run you classifier on each region and you're more likely to detect the presence of a cat.

The computer vision algorithms evolved into a slightly different process: 1) define a series of bounding boxes, 2) run a classifier on each box.  

A future post will focus on the YOLO (You Only Look Once) algorithm that is based on this idea.  While the concept of a classifier based detection system pre-dates YOLO, the paper made it clear that the industry had changed and I had not been aware.

Cheers


Sunday, December 15, 2019

FFMpeg Transitions -- Part 4



The past few posts have been focusing on dynamic filter transition effects.  This post will bookend this series.  Let's finish up.

Circular Transitions

Let's dust off our high school geometry notes; we can calculate the distance from position (x,y) from the center of the image (W/2, H/2) using the following:

If we compare this distance to a linearly increasing range, based on time we can apply a increasing/decreasing circular boundary.  In this case, we are utilizing max(W,H) to ensure the circular geometry finalizes with the dimensions of the video (width and/or height);



$ ffmpeg -i image02.mp4 -i image01.mp4 -filter_complex "[0:v][1:v]blend=all_expr='if(gte(sqrt((X-W/2)*(X-W/2)+(H/2-Y)*(H/2-Y)),(T/1*max(W,H))),A,B)'" circleOpen.mp4


$ ffmpeg -i image01.mp4 -i image02.mp4 -filter_complex "[0:v][1:v]blend=all_expr='if(lte(sqrt((X-W/2)*(X-W/2)+(H/2-Y)*(H/2-Y)),(max(W,H)-(T/1*max(W,H)))),A,B)'" circleClose.mp4


Expanding Rectangular Window

Compounding on the horizontal and vertical opening effects we can get an expanding window effect;

$ ffmpeg -i image02.mp4 -i image01.mp4 -filter_complex "[0:v][1:v]blend=all_expr='if(between(X,(W/2-T/1*W/2),(W/2+T/1*W/2))*between(Y,(H/2-T/1*H/2),(H/2+T/1*H/2)),B,A)'" expandingWindow.mp4

While there are countless other possible effects, this is a decent crack at a representative sampling of such effects.

Whelp, that's it; the knowledge bank is empty.  Hope you found this series of posts of some value, perhaps just a momentary read from your palace of solitude.

Cheers.

Saturday, December 7, 2019

FFMpeg Transitions -- Part 3

Our third post concerning video transitions.  Be sure to read our previous posts, I suggest reading them once for knowledge, a second time purely for fun, and periodically thereafter for continued inspiration.

The previous posts focused primarily on the overlay filter.  This post will focus on the blend filter applied dynamically with respect to time or position.

Cross Fade Effect

The blend filter is most typically used in cross fading from one video to another.  Briefly discussed in the first post, the general idea is to provide the filter with two video frames and a fractional value to be applied to each.  For example, a 50/50 split will give an equal weight to each video.  These weights can dynamically change with respect to time.  Cross-fading takes the effect begins by applying a weight of 1.0 to the first video, 0.0 to the second video, then linearly decreasing the weight of the first video while simultaneously increasing the weight of the second.  Finally, ending with a weight of 0.0 for the first video, 1.0 for the second.  Easy, Peasy.

Let's take a look at the full filter example;

$ ffmpeg -i image01.mp4 -i image02.mp4  -filter_complex "[0:v][1:v]blend=all_expr='A*(1-min(T/1,1))+B*(min(T/1,1))'" blend.mp4


Note; like past posts, the denominator in the (T/1) implies that the transition will take 1 second.  Playing with that value will speed up or slow down the morphing.

Location-Based Blend

The previous blend filter is applied uniformly to the entire frame.  Using conditionals and (x,y) locations we can base the blending factors on position.  Consider a simple blend condition where the left-most half of the image uses the first video, the right-most half uses the second video.  Conceptually, the blend would look something like this;
'if(lte(X,H/2),A,B)'

When X is less than the middle of the video frame vertical center, apply A, otherwise B. 

This example, not particularly useful, but imagine applying a moving range rather than simply the middle of the vertical axis.  As that range moved from bottom-to-top the effect would emulate a rising curtain;


$ ffmpeg -i image02.mp4 -i image01.mp4 -filter_complex "[0:v][1:v]blend=all_expr='if(lte(Y,(H-T/1*H)),A,B)'" curtainUp.mp4

Notice rather than (H/2), the range starts at H and progresses to 0 linearly during the 1 second duration.

Similarly, you can perform a curtain down effect like this;
$ ffmpeg -i image01.mp4 -i image02.mp4 -filter_complex "[0:v][1:v]blend=all_expr='if(gte(Y,(T/1*H)),A,B)'" curtainDown.mp4



Using the center point as a start or end point and mirroring the effect on each 1/2 of the frame can open the door to other effects, like;
$ ffmpeg -i image02.mp4 -i image01.mp4 -filter_complex "[0:v][1:v]blend=all_expr='if(between(X,(W/2-T/1*W/2),(W/2+T/1*W/2)),B,A)'" verticalOpen.mp4


$ ffmpeg -i image01.mp4 -i image02.mp4 -filter_complex "[0:v][1:v]blend=all_expr='if(between(Y,(H/2-T/1*H/2),(H/2+T/1*H/2)),B,A)'" horizontalOpen.mp4


Sunday, December 1, 2019

FFMpeg Transitions -- Part 2


This post continues on from last weeks post.  Be sure to read the past post to establish a foundation for the content here.

Our journey will continue to focus on creating scene transitions of the form;

In this post we will focus on wipe transitions, conducted by applying an overlay to a dynamic position.  

We are going to focus on creating wipe transitions; up, down, left, right and the diagonals.  

In general, these wipe transitions are done by applying an overlay on a time-based position.  We will begin by focusing on left/right/up/down which can then be applied to create the diagonals.

Wipe Right

$ ffmpeg -i image01.mp4 -i image02.mp4 -filter_complex "[0:v][1:v]overlay=x='min(0,-W+(t/1)*W)':y=0[out]" -map "[out]" -y wipeRight.mp4

Wipe Left

$ ffmpeg -i image01.mp4 -i image02.mp4 -filter_complex "[0:v][1:v]overlay=x='max(0,W-(t/1)*W)':y=0[out]" -map "[out]" -y wipeLeft.mp4

Wipe Down

$ ffmpeg -i image01.mp4 -i image02.mp4 -filter_complex "[0:v][1:v]overlay=x='0':y='min(0,-H+(t/1)*H)'[out]" -map "[out]" -y wipeDown.mp4

Wipe Up

$ ffmpeg -i image01.mp4 -i image02.mp4 -filter_complex "[0:v][1:v]overlay=x='0':y='max(0,H-(t/1)*H)'[out]" -map "[out]" -y wipeUp.mp4

Diagonals

Now that we have the equations to manipulate the X or Y locations, the diagonals are simply created by applying position changes to X and Y.

$ ffmpeg -i image02.mp4 -i image01.mp4 -filter_complex "[0:v][1:v]overlay=x='min(0,-W+(t/1)*W)':y='min(0,-H+(t/1)*H)'[out]" -map "[out]" -y wipeRightDown.mp4
$ ffmpeg -i image02.mp4 -i image01.mp4 -filter_complex "[0:v][1:v]overlay=x='max(0,W-(t/1)*W)':y='min(0,-H+(t/1)*H)'[out]" -map "[out]" -y wipeLeftDown.mp4
$ ffmpeg -i image02.mp4 -i image01.mp4 -filter_complex "[0:v][1:v]overlay=x='max(0,W-(t/1)*W)':y='max(0,H-(t/1)*H)'[out]" -map "[out]" -y wipeLeftUp.mp4
$ ffmpeg -i image02.mp4 -i image01.mp4 -filter_complex "[0:v][1:v]overlay=x='min(0,-W+(t/1)*W)':y='max(0,H-(t/1)*H)'[out]" -map "[out]" -y wipeRightUp.mp4

Next week we will spend some time on the blend filter.

Cheers