𐂠Diary02 17 2025

two discoveries currently worth mentioning:

  • ocrmypdf is a worthwhile wrapper for tesseract that makes doing quite a few tasks significantly easier

  • unpaper at first glance looks like something ive been looking for for a goddamn MINUTE for processing scans, lets see how it is

  • taking a crack at applying unpaper to earthchild:

    • 'sheet' and 'page' separate concepts: eg, a spread is two 'pages' on one 'sheet'

    • so, we have spreads, and want images of 1 page

--input-pages 1
--output-pages 2

...is what we want, because despite the variable name being 'pages', this corresponds to files


the word i would use to describe the prose of unpaper's documentation is 'turgid'

unpaper --layout double input%03d.pbm output%03d.pbm
unpaper --layout double (...options...) \
  --output-pages 2 \
  doublepage%03d.pgm singlepage%03d.pgm

need to set 'mask' for proper processing, and we would certainly like masks to be automatically detected...

this 'mask scan detection' algorithm, though, doesn't sound like it'll do what i want.