Saturday, March 13, 2021

Processing Large Quantities of Files with Find/Exec

Photo by Markus Winkler from Pexels

 

I've always found the find command to be incredibly useful, but using with the exec command powerful but frustrating and confusing.  Often, patience runs thin and rather than take the time to learn how to effectively use find/exec for complex problems I bunt, returning to creating a tailor-made one-off bash script.  In the end, mission accomplished, but I always feel disappointed to have to revert to a custom bash script when I know in my heart-of-hearts its accomplishable quickly if I only knew how to do it.

Today is the day, I'm gonna spend some time to better understand how to use find/exec for some repeatedly necessary types of problems.

Let's start with an easy case, one that's less common, but really easy to accomplish.  We'll build from there.

Copying Files To New File Name (Prepending/Appending)

Let's say you have a list of files that you want to rename by pre-pending, or appending a substring.  For instance, say you have a hierarchy of directories with image files that you wish to copy to a *.backup filename;

$ find . -name "*.jpg" -exec cp {} {}.backup \;

The above command will find all *.jpg named files, for each file execute 'cp <filename> <filename>.backup'.  In other words, when finding image01.jpg the exec command would be cp image01.jpg image01.jpg.backup.  This would be done for every encountered file that satisfies the regex.

Prepending a string in a similar manner could be done by:

$ find . -name "*.jpg" -exec cp {} backup-{} \;

In this case image01.jpg would be copied to backup-image01.jpg.

Simple, fast, but not particularly useful if you're particular about the destination file names.

Replacing File Extension 

A bit more practical scenario is to want to change file extensions.  For example, say you really prefer *.jpg but have a series of files named *.jpeg.

This one is a bit trickier, takes a little more expertise, but can readily be accomplished and understood with a bit of time.

$ find . -name "*.jpeg" -exec sh -c 'mv "$0" "${0%.jpeg}.jpg"' {} \;

The simple filename substitution (e.g. {}) just doesn't cut it like the previous example because we wish to manipulate the filename.  So, we inline a shell command, one that is capable of using the incoming file name as is (e.g. $0) and able to manipulate it (changing .jpeg to .jpg).  That's the brief, lets dig a bit into it to better understand what's going on.

Incoming filenames sent to the shell script will be called *.jpeg (guaranteed by the find regex).  The filename comes in as a parameter (e.g. $0) to the shell script so the first 1/2 of the move command could be 'mv file01.jpeg ...'.  It may be worth pointing out, those that author shell commands may be familiar with $0 being the script name and the first argument be $1, but for an inline shell script, the first argument will be $0 as we are using it.

How about the seconds 1/2 of the shell command; while it looks like Snoopy dropping curse words it genuinely is meaningful.


The "${0%.jpeg}.jpg" is comprised of two parts; the first part ${0%.jpeg} is a variable pattern substitution;

${var%Pattern} Remove from $var the shortest part of $Pattern that matches the back end of $var

refer to this for details: https://tldp.org/LDP/abs/html/parameter-substitution.html

Simply put, take $0 (the filename) and grab everything up to .jpeg, image01.jpeg would be expanded to image01.

The second 1/2 of the expression simply re-adds .jpg, so the whole expression of image01.jpeg would be image01.jpg.  With the existing and new file names now available, pairing them with a mv command and you're in business.

Removal of Spaces in File Names

 Ugh, I'd rather step in dog shit barefoot than have spaces in my file names.  I know, it's an irrational hatred but there it is.  Filenames are the bane of scripting, while they can be addressed, a simple 3 line shell script quickly becomes immensely more complex when dealing with filenames w/spaces.  But, like bedbugs, any exposure to the outside world likely will bring them into your system.  So, you need to be prepared to either live with them or a quick means to remove them from the filenames.  My latest headache was downloading a series of video files from a MOOC, resulting in files of the form 'index 1.mp4'.  So, we will extend on our above example, but utilize bash (rather than sh) to gain some substitution features.  The "${0/ /_}" (a space between the '/' pairs) means replace all instances of spaces with '_'

        $ find . -name "*.html" -exec bash -c 'mv "${0/ /_}"' {} \;

Massive Media Conversion in One Command

While we've been focusing on shell scripts using 'cp' or 'mv' commands, we aren't limited to easy commands, let's say we wished to convert a hierarchical folder structure of AVI files that we want to reencode as MP4 files.

$ find . -name "*.avi" -exec sh -c 'ffmpeg -i "$0" -acodec copy "${0%.avi}.mp4"' {} \;

Cut that puppy loose on your computer and come back to a newly created list of MP4 files.

 

Hope this helps some of you.  I feel I understand the use of exec better having worked through this.  Cheers.

No comments:

Post a Comment