OCR

An image only PDF is one which is typically created by using a document scanner to scan a hardcopy document. It only contains picture(s) of the scanner page(s), There is no text layer.

Whereas a text searchable PDF is one that contains both the picture(s) of the scanned page(s), and a text layer containing every word that is shown on that page. That text layer allows words to be cut cut from that layer or searched for.

The OCR pre-processor is used to generate OCR text from a PDF document's pages. That data in that text layer can be passed into RIA Fields, or passed into an Open AI model for analysis and data extraction.

The OCR screen looks like this:

Stage Settings

Process Subject File

Always enabled.

Process Attachments

Enabled if you want to OCR attachment pages.

Process Generated

Usually enabled

OCR Resolution

Leave at 300 DPI

Skip Pages With Text

You might enable this option, if your document PDF pages are already text searchable and you want to keep that existing text layer and not re-ocr that page to create a new text layer for it.

Remove Blank Pages

Enable this option if you want to remove document pages during the file converter processing.

Rotate

Enable this option if you want the OCR engine to rotate the page to the correct orientation (portrait or landscape) based on the OCR text orientation.

Deskew

Enable this option if you want the OCR engine to deskew (straighten) the page so is as close to 100% straight as is possible. Mainly used to help straighten documents generated by the scanning hard copy paper documents. where the images may be skewed to the left or right as the pages are scanned.

Include OCR Text Layer

Usually enabled.

PDF Optimisation

When enabled, may help to reduce the size of the PDf file

Concurrent Threads

Defaults to 1 OCR thread.

The time taken for a single thread to OCR a single page is between 3-6 seconds.

Make sure your server is deployed on hardware that use fast CPU frequency. The higher the CPU clock rate the faster the OCR.

Pages with little or no text take less time to OCR than pages with hundreds of words to OCR.

Pages with graphics and shading will slow down the OCR process.

Images scanned at 300 DPI are optimal for OCR processing.

Performing OCR on a small document with 1-3 pages is fairly quick.

Performing OCR on a large document with 500 pages will cause a massive OCR bottleneck, stuck waiting for that document to OCR, before any other documents OCR can be run.

Consider limiting file sizes sent to OCR, or only OCR large files between 7pm and 5am, or consider enabling more OCR threads.