Saturday, March 13, 2021

Archiving Paper Records

 As part of the 'quarantining housecleaning' I've found myself digging out old paper material from years past.  Literally, hundreds of pounds of archives that have made their way hundreds of miles, through multiple residences only to gather dust for the past 15 years in the same location.  These papers have proven useful on a rare occasion, just often enough to prevent me from simply tossing them in the recycling bin.

With a surplus of time available and limited ways to spend it I've spent the last days scanning these documents for archival.  Terabytes are cheap, at the ready, and small in physical size....an obvious way to make a bit more room in the old office.

Single-sided archives in bundles that readily fit in the scanner hopper are easy, slide them in, hit scan and wait for them to complete.  Double-sided archives and/or bundles too large for the hopper required scanning in smaller bundles and reassembling thereafter.  I'll spend a bit of time on how I've accomplished such hurdles.

Ubuntu provides a couple PDF utilities that were used, in particular pdfseparate and pdfunite.


Let's say you have a couple hundred page single-sided document and need to scan them in bundles [1..10].  These bundles can be assembled (or concatenated) into a final PDF as follows:

$ pdfunite bundle01.pdf bundle02.pdf bundle03.pdf...bundleN.pdf final.pdf

The final pdf file is defined as the destination.  The bundles, when provided in order, will be concatenated in order into the final destination.  This ordering can be explicitly defined, but more than once trying to accomplish it via the following command has eaten my lunch.

$ pdfunite `ls bundle*pdf` final.pdf

The ls command provides no guarantee of delivering the list in sorted order, and `ls | sort -n` can be fickle depending on how the file names are names, but reader beware, don't be surprised for files like bundle1* and bundle10 arriving adjacently and really screwing up your final page order.

What does however work well is asking ls to list the files in a single column ordered numerically by ls -1v, for example:

$ pdfunite `ls -1v bundle*pdf` final.pdf


Double-sided material is a constant drag, especially for large stacks.  The process, scan the front pages of the entire bundle first, then flip them and scan the back pages....in order.  Then separate the front and back page PDFs in a way that staggers them and reassemble.


It's not uncommon to have a 100+ page double-sided stack, we'll split them into 3 bundles, scan the front, flip them, scan the backs and reassemble, something like this;

$ pdfunite SCN_0001.pdf SCN_0002.pdf SCN_0003.pdf front.pdf

$ pdfunite SCN_0004.pdf SCN_0005.pdf SCN_0006.pdf back.pdf

$ pdfseparate front.pdf page%d-01.pdf
$ pdfseparate back.pdf page%d-02.pdf

This sequence should result in N front-sourced pages (page*-01.pdf) and N back-sourced pages (page*-02.pdf) and out naming convention allows them to be listed front/back in order.

We can reassemble as follows:
$ psfunite `ls -1v page*pdf` final.pdf

These utilities ease the effort, but two pages stuck together during any stage of the scanning can really ruin your night sifting through what page(s) are out of order.

So far, I'm on my way to fill the 2nd 64-gallon recycling bin, making a good deal of room for future hoarding.


No comments:

Post a Comment