two discoveries currently worth mentioning:
-
ocrmypdf is a worthwhile wrapper for tesseract that makes doing quite a few tasks significantly easier
-
unpaper at first glance looks like something ive been looking for for a goddamn MINUTE for processing scans, lets see how it is
-
taking a crack at applying unpaper to earthchild:
-
'sheet' and 'page' separate concepts: eg, a spread is two 'pages' on one 'sheet'
-
so, we have spreads, and want images of 1 page
-
--input-pages 1
--output-pages 2
...is what we want, because despite the variable name being 'pages', this corresponds to files
the word i would use to describe the prose of unpaper's documentation is 'turgid'
unpaper --layout double input%03d.pbm output%03d.pbm
unpaper --layout double (...options...) \
--output-pages 2 \
doublepage%03d.pgm singlepage%03d.pgm
need to set 'mask' for proper processing, and we would certainly like masks to be automatically detected...
this 'mask scan detection' algorithm, though, doesn't sound like it'll do what i want.