Automatically OCR PDFs with Hazel and PDFPen (2017 Edition)

Several years ago I wrote a very popular post on this site about how to automatically OCR Documents with Hazel and PDFpen. It occurs to me that in the years since that post I’ve updated my method and it was time for an update.

Optical Character Recognition (OCR) is a magical thing. Normally, when you scan something it is processed simply as a flat image. OCR is the process of converting scanned images of typed (or sometimes printed) text into machine-readable text. Once a PDF document is readable by the computer, there are any number of actions the computer can take on that document (see the article above). Thus the OCR is a critical first step.

I’ve setup a Hazel rule to monitor my downloads folder for any PDF that is not already OCR’ed. I determine if the file has already been OCR’ed based on whether the file has been created by ScanSnap (since I OCR all my scans) and whether that file’s contents includes a vowel. If the criteria is not met, then Hazel will run an AppleScript will kick off PDFpen Pro to OCR the document, then quit PDFpen when finished and tag the file as having been OCR’ed so Hazel doesn’t keep repeating the same action on it. The Hazel rule looks like this:

Screen Shot 2017-01-14 at 3.41.04 PM.png

Here’s the embedded script courtesy of Greg Scown, one of the developers of PDFpen. Note if you use PDFpen instead of PDFpen Pro you’ll have to alter the script accordingly,

tell application "PDFpenPro"
    open theFile as alias
    -- does the document need to be OCR'd?
    get the needs ocr of document 1
    if result is true then
        tell document 1
            ocr
            repeat while performing ocr
                delay 1
            end repeat
            delay 1
            close with saving
        end tell
        --In PDFpen, when no documents are open, window 1 is "Preferences"
        --If other documents are open, do not close the App.
        if name of window 1 is "Preferences" then
            tell application "PDFpenPro"
                quit
            end tell
        end if
    else
        -- Scan Doc was previously OCR'd or is already a text type PDF.
        tell document 1
            close without saving
        end tell
        --In PDFpen, when no documents are open, window 1 is "Preferences"
        --If other documents are open, do not close the App.
        if name of window 1 is "Preferences" then
            tell application "PDFpenPro"
                quit
            end tell
        end if
    end if
end tell