The trouble with pdffonts is that sometimes it returns nothing, like this:
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
And sometimes it returns this:
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
[none] Type 3 yes no no 266 0
[none] Type 3 yes no no 9 0
[none] Type 3 yes no no 297 0
[none] Type 3 yes no no 341 0
[none] Type 3 yes no no 381 0
[none] Type 3 yes no no 394 0
[none] Type 3 yes no no 428 0
[none] Type 3 yes no no 441 0
[none] Type 3 yes no no 451 0
[none] Type 3 yes no no 480 0
[none] Type 3 yes no no 492 0
[none] Type 3 yes no no 510 0
[none] Type 3 yes no no 524 0
[none] Type 3 yes no no 560 0
[none] Type 3 yes no no 573 0
[none] Type 3 yes no no 584 0
[none] Type 3 yes no no 593 0
[none] Type 3 yes no no 601 0
[none] Type 3 yes no no 644 0
With that in mind, let's write a little text tool to get all the fonts from a pdf:
pdffonts my-doc.pdf | tail -n +3 | cut -d' ' -f1 | sort | uniq
If your pdf is not OCR'ed, this will output nothing or [none].
If you want it to run faster, use the -l flag to only analyze, say, the first 5 pages:
pdffonts -l 5 my-doc.pdf | tail -n +3 | cut -d' ' -f1 | sort | uniq
Now wrap it in a bash script, e.g. is-pdf-ocred.sh:
#!/bin/bash
MYFONTS=$(pdffonts -l 5 "$1" | tail -n +3 | cut -d' ' -f1 | sort | uniq)
if [ "$MYFONTS" = '' ] || [ "$MYFONTS" = '[none]' ]; then
echo "NOT OCR'ed: $1"
else
echo "$1 is OCR'ed."
fi
Finally, we want to be able to search for pdfs. The find command does not know about your aliases or functions in .bashrc, so we need to give it the path to the script.
Run it in your directory of choice like so:
find . -type f -name "*.pdf" -exec /path/to/my/script/is-pdf-ocred.sh '{}' \;
I'm assuming that the pdf files end in .pdf, although this is not always an assumption you can make.
You will probably want to pipe it to less or output it into a text file:
find . -type f -name "*.pdf" -exec /path/to/my/script/is-pdf-ocred.sh '{}' \; | less
find . -type f -name "*.pdf" -exec /path/to/my/script/is-pdf-ocred.sh '{}' \; > pdfs.txt
I was able to do about 200 pdfs in a little more than 10 seconds using the -l 5 flag.