HOW TO: Use OCR AutoPreprocessing With Image-Over-Text PDFs

#1 Posted : Friday, July 12, 2019 4:55:12 PM(UTC)

Joe Kerrigan

Groups: Registered, Tech Support
Posts: 6

Previous how-to posts have outlined the process of creating a searchable PDF using the LEADTOOLS OCR toolkit. Using the PDF image-over-text feature, we can ensure that resultant searchable documents are visually identical to their input documents. It may, however, be advantageous to apply some image processing prior to running OCR. In particular, IOcrPage.AutoPreprocess() can improve OCR results by rotating, deskewing, or inverting parts of the document. Other image processing steps such as binarization (using AutoBinarizeCommand) can also improve OCR results in some circumstances.

Steps like binarization, however, alter the image in ways that might not be desirable, even if their inclusion boosts OCR accuracy. What if there were some way to use one version of the document for OCR and another for display in the final PDF? The snippet below does exactly this: as it processes a document, it maintains one copy of each page that it applies whichever image processing commands are desirable while keeping another to include in the final PDF. In essence, one version of each page is a "working copy" where we can do whatever we want to improve results. The other image is the "presentable" version, which is left more-or-less untouched throughout the whole process.

The solution faces one final issue introduced by IOcrPage.AutoPreprocess. Depending on flags passed to AutoPreprocess, the image might be rotated slightly before being recognized by the OCR engine. If we do not rotate our own "presentable" image, the text boxes will be misaligned with their corresponding text in the final PDF. To account for this, we can call IOcrPage.GetPreprocessValues. The OcrPageAutoPreprocessValues returned describe the transformation applied to the image before it was passed in for OCR.

Code:


string filename = @""; // TODO: ensure this path points to a valid document/image
LoadDocumentOptions opts = new LoadDocumentOptions();

LEADDocument doc = DocumentFactory.LoadFromFile(filename, opts);

IOcrEngine engine = OcrEngineManager.CreateEngine(OcrEngineType.LEAD, false);
engine.Startup(null, null, null, @"C:\LEADTOOLS 20\Bin\Common\OcrLEADRuntime"); // TODO: ensure this path points to a directory containing OCR runtime files

IOcrDocument ocrDoc = engine.DocumentManager.CreateDocument();
foreach (DocumentPage page in doc.Pages)
{
   RasterImage image = page.GetImage().Clone(); // this image is our "working copy" that we can alter however we want

   AutoBinarizeCommand binarize = new AutoBinarizeCommand(); // binarize the page (we don't care about text color, the presentable image will remain in full color)
   binarize.Run(image);

   // TODO: add whichever image processing commands you want to improve OCR results

   IOcrPage ocrPage = engine.CreatePage(image, OcrImageSharingMode.AutoDispose);
   ocrDoc.Pages.Add(ocrPage); // create the page, add it to the document

   ocrPage.AutoPreprocess(OcrAutoPreprocessPageCommand.All, null); // preprocess the page however necessary

   ocrPage.AutoZone(null); // try and detect zones for this page

   if (ocrPage.Zones != null && ocrPage.Zones.Count > 0)
   {
      ocrPage.Recognize(null); // if zones exist, run recognize on them
   }

   RasterImage presentable = page.GetImage().Clone(); // create another clone for the presentable image--what we'll actually see in the final result

   OcrPageAutoPreprocessValues values = ocrPage.GetPreprocessValues(); // get preprocess values for rotation
   if (values.RotationAngle != 0)
   {
      RotateCommand rotate = new RotateCommand(values.RotationAngle, RotateCommandFlags.Bicubic, RasterColor.White); // rotate this image to match the text
      rotate.Run(image);
   }

   ocrPage.GetRasterImage().Dispose(); // dispose the old image; it is no longer needed after recognition has occurred
   ocrPage.SetRasterImage(presentable); // set the image for this page to the presentable clone
}

PdfDocumentOptions pdfOpts = ocrDoc.DocumentWriterInstance.GetOptions(DocumentFormat.Pdf) as PdfDocumentOptions;
pdfOpts.ImageOverText = true;
ocrDoc.DocumentWriterInstance.SetOptions(DocumentFormat.Pdf, pdfOpts);

string outpath = Path.Combine(Path.GetDirectoryName(filename), Path.GetFileNameWithoutExtension(filename) + "_ocr.pdf");

ocrDoc.Save(outpath, DocumentFormat.Pdf, null);

ocrDoc.Dispose();
doc.Dispose();
engine.Dispose();

Attached is a sample project containing the snippet above. After supplying it with the path to a document, it will load the document, preprocess it, and generate a searchable PDF that uses images from the original document (rotated to account for AutoPreprocessCommand).

File Attachment(s):

PdfImageOverTextExample.zip (3kb) downloaded 103 time(s).

Joe Kerrigan
Intern
LEAD Technologies, Inc.


	Try the latest version of LEADTOOLS for free for 60 days by downloading the evaluation: https://www.leadtools.com/downloads Wanna join the discussion? Login to your LEADTOOLS Support account or Register a new forum account.

Notification

HOW TO: Use OCR AutoPreprocessing With Image-Over-Text PDFs