𐂠GuidesOcr

tesseract

dumping ocr text

pdf -> images

convert -density 200 porn.pdf -quality 100 o%03d.png

(convert is an imagemagick command)

images -> ocr

install tesseract

for file in *.png; do tesseract $file ${file/.png/}; done
cat o???.txt > all.txt

multiple languages

eg, eng and rus for 'leaf of spring'

ocr a pdf in-place

tesseract leaf_of_spring.pdf -l eng+rus pdf

pdftotext -layout images/toc.pdf -

Subhyphae