tesseract
dumping ocr text
pdf -> images
convert -density 200 porn.pdf -quality 100 o%03d.png
(convert
is an imagemagick
command)
images -> ocr
install tesseract
for file in *.png; do tesseract $file ${file/.png/}; done
cat o???.txt > all.txt
multiple languages
eg, eng and rus for 'leaf of spring'
ocr a pdf in-place
tesseract leaf_of_spring.pdf -l eng+rus pdf
pdftotext -layout images/toc.pdf -