The spring of 1997 was a particularly interesting semester in my academic career, I was immersed in two challenging yet complimentary classes; Computer Graphics and Computer Vision. This was my first introduction to the concept of computer vision and while I'm far from an authority, I do have a recurring history of dabbling in it ever since. This week I stumbled upon a new'ish object detection algorithm and once again the computer vision mistress has set its seductive grip on me.
This new'ish algorithm will be a focus of a future post, in the meantime I wanted to spend some time pondering on the general topic of computer vision, consider it a retrospective of what I learned that semester, the change of focus in the technology since and things I wish I knew in college on the subject.
In the 90's the subject of computer vision was heavily based on simple image processing. Simple may be misleading and in no way is meant to be condescending or judgmental, rather, simple in terms of achievable algorithms given the constraints of the processing power of the era.
At the core of the course was this book;
http://www.cse.usf.edu/~r1k/MachineVisionBook/MachineVision.pdf
I include it as it sets the stage for the state of the discipline at the time. In that era, the state of computer vision was mostly image processing with a concentration on finding object silhouettes and features followed by trying to match the silhouette to a known 'good'. This two-phased approach ( detecingt features and comparing features) continues to be at the core of vision systems. At the time feature detection was in the forefront with limited understanding of how to effectively compare the found features. I'd argue that the era was primarily video/image processing rather than what we've grown to know as computer vision. The discipline was in it's primordial stage of evolution, feature detection needed to be solved before classification and again the resources of the time were less bountiful as we have by today's standards.
So, followers of computer vision concentrated on image/video processing fundamentals. We searched for ways to process an image pixel and draw relationships between the connectivity of each pixel. We implemented various means of thresholding and a variety of filters with the objective of generating meaningful binary images or grayscale models. Binary and/or grayscale models in hand you were met with an unsatisfying cliff-hanger, much like the ending of Sopranos, simply because the development of classification mechanisms was just beginning.
In the introduction, we've arrived at the topic of retrospective; I wish I had understood the *true* reason there was such a focus on image processing because that revolutionized the course of the discipline.
Take this furry little buddy;
The course was primarily focused on generating something like this;
Something readily done today by ImageMagick;
$ convert ~/Downloads/download.jpg -canny 0x1+10%+30% /tmp/foo.jpg
Take a minute and look at the binary image above and ask yourself....what is the purpose of that image? Really....take a minute.....I'll wait.
If you said "to get a series of lines/features/silhouettes that represent a cat" then you'd be in lock-step with the discipline at the time. You'd focus on generating a series of models representing a cat, take that series of pixels and find a way to calculate a confidence metric that it's truly a cat.
What if you took the same approach with this image;
A wee bit tougher now? But that's where an alternative to 'why we look for lines/features/silhouettes' propelled the course of computer vision. The features could tell you where to look and this revolutionized the study. The traditional process was detection => classification, but someone what if you viewed classification as detection? What if we could simplify the group of cats into a series of cropped images each with one cat and ran a classifier on each subimage?
Take another look at the first binary image of the cat again. Draw a bounding box around the lines and what you have is an area to concentrate your attention on. Looking at the top right of the image will get you precisely squat, the bounding box tells you where you should concentrate your computer vision algorithms on. Same goes to the groups of cats, with an intelligent means of grouping, you can distinguish the 4 regions each containing a cat. Run you classifier on each region and you're more likely to detect the presence of a cat.
The computer vision algorithms evolved into a slightly different process: 1) define a series of bounding boxes, 2) run a classifier on each box.
A future post will focus on the YOLO (You Only Look Once) algorithm that is based on this idea. While the concept of a classifier based detection system pre-dates YOLO, the paper made it clear that the industry had changed and I had not been aware.
Cheers
No comments:
Post a Comment