Several years ago I wrote a very popular post on this site about how to automatically OCR Documents with Hazel and PDFpen. It occurs to me that in the years since that post I’ve updated my method and it was time for an update.
Optical Character Recognition (OCR) is a magical thing. Normally, when you scan something it is processed simply as a flat image. OCR is the process of converting scanned images of typed (or sometimes printed) text into machine-readable text. Once a PDF document is readable by the computer, there are any number of actions the computer can take on that document (see the article above). Thus the OCR is a critical first step.
I’ve setup a Hazel rule to monitor my downloads folder for any PDF that is not already OCR’ed. I determine if the file has already been OCR’ed based on whether the file has been created by ScanSnap (since I OCR all my scans) and whether that file’s contents includes a vowel. If the criteria is not met, then Hazel will run an AppleScript will kick off PDFpen Pro to OCR the document, then quit PDFpen when finished and tag the file as having been OCR’ed so Hazel doesn’t keep repeating the same action on it. The Hazel rule looks like this:
Here’s the embedded script courtesy of Greg Scown, one of the developers of PDFpen. Note if you use PDFpen instead of PDFpen Pro you’ll have to alter the script accordingly,
tell application "PDFpenPro" open theFile as alias -- does the document need to be OCR'd? get the needs ocr of document 1 if result is true then tell document 1 ocr repeat while performing ocr delay 1 end repeat delay 1 close with saving end tell --In PDFpen, when no documents are open, window 1 is "Preferences" --If other documents are open, do not close the App. if name of window 1 is "Preferences" then tell application "PDFpenPro" quit end tell end if else -- Scan Doc was previously OCR'd or is already a text type PDF. tell document 1 close without saving end tell --In PDFpen, when no documents are open, window 1 is "Preferences" --If other documents are open, do not close the App. if name of window 1 is "Preferences" then tell application "PDFpenPro" quit end tell end if end if end tell