LEADTOOLS Support
General
LEADTOOLS SDK Examples
HOW TO: Use OCR AutoPreprocessing With Image-Over-Text PDFs
#1
Posted
:
Friday, July 12, 2019 4:55:12 PM(UTC)
Groups: Registered, Tech Support
Posts: 6
Previous how-to posts have outlined the process of creating a searchable PDF using the LEADTOOLS OCR toolkit. Using the PDF image-over-text feature, we can ensure that resultant searchable documents are visually identical to their input documents. It may, however, be advantageous to apply some image processing prior to running OCR. In particular, IOcrPage.AutoPreprocess() can improve OCR results by rotating, deskewing, or inverting parts of the document. Other image processing steps such as binarization (using AutoBinarizeCommand) can also improve OCR results in some circumstances.
Steps like binarization, however, alter the image in ways that might not be desirable, even if their inclusion boosts OCR accuracy. What if there were some way to use one version of the document for OCR and another for display in the final PDF? The snippet below does exactly this: as it processes a document, it maintains one copy of each page that it applies whichever image processing commands are desirable while keeping another to include in the final PDF. In essence, one version of each page is a "working copy" where we can do whatever we want to improve results. The other image is the "presentable" version, which is left more-or-less untouched throughout the whole process.
The solution faces one final issue introduced by IOcrPage.AutoPreprocess. Depending on flags passed to AutoPreprocess, the image might be rotated slightly before being recognized by the OCR engine. If we do not rotate our own "presentable" image, the text boxes will be misaligned with their corresponding text in the final PDF. To account for this, we can call IOcrPage.GetPreprocessValues. The OcrPageAutoPreprocessValues returned describe the transformation applied to the image before it was passed in for OCR.
Code:
string filename = @""; // TODO: ensure this path points to a valid document/image
LoadDocumentOptions opts = new LoadDocumentOptions();
LEADDocument doc = DocumentFactory.LoadFromFile(filename, opts);
IOcrEngine engine = OcrEngineManager.CreateEngine(OcrEngineType.LEAD, false);
engine.Startup(null, null, null, @"C:\LEADTOOLS 20\Bin\Common\OcrLEADRuntime"); // TODO: ensure this path points to a directory containing OCR runtime files
IOcrDocument ocrDoc = engine.DocumentManager.CreateDocument();
foreach (DocumentPage page in doc.Pages)
{
RasterImage image = page.GetImage().Clone(); // this image is our "working copy" that we can alter however we want
AutoBinarizeCommand binarize = new AutoBinarizeCommand(); // binarize the page (we don't care about text color, the presentable image will remain in full color)
binarize.Run(image);
// TODO: add whichever image processing commands you want to improve OCR results
IOcrPage ocrPage = engine.CreatePage(image, OcrImageSharingMode.AutoDispose);
ocrDoc.Pages.Add(ocrPage); // create the page, add it to the document
ocrPage.AutoPreprocess(OcrAutoPreprocessPageCommand.All, null); // preprocess the page however necessary
ocrPage.AutoZone(null); // try and detect zones for this page
if (ocrPage.Zones != null && ocrPage.Zones.Count > 0)
{
ocrPage.Recognize(null); // if zones exist, run recognize on them
}
RasterImage presentable = page.GetImage().Clone(); // create another clone for the presentable image--what we'll actually see in the final result
OcrPageAutoPreprocessValues values = ocrPage.GetPreprocessValues(); // get preprocess values for rotation
if (values.RotationAngle != 0)
{
RotateCommand rotate = new RotateCommand(values.RotationAngle, RotateCommandFlags.Bicubic, RasterColor.White); // rotate this image to match the text
rotate.Run(image);
}
ocrPage.GetRasterImage().Dispose(); // dispose the old image; it is no longer needed after recognition has occurred
ocrPage.SetRasterImage(presentable); // set the image for this page to the presentable clone
}
PdfDocumentOptions pdfOpts = ocrDoc.DocumentWriterInstance.GetOptions(DocumentFormat.Pdf) as PdfDocumentOptions;
pdfOpts.ImageOverText = true;
ocrDoc.DocumentWriterInstance.SetOptions(DocumentFormat.Pdf, pdfOpts);
string outpath = Path.Combine(Path.GetDirectoryName(filename), Path.GetFileNameWithoutExtension(filename) + "_ocr.pdf");
ocrDoc.Save(outpath, DocumentFormat.Pdf, null);
ocrDoc.Dispose();
doc.Dispose();
engine.Dispose();
Attached is a sample project containing the snippet above. After supplying it with the path to a document, it will load the document, preprocess it, and generate a searchable PDF that uses images from the original document (rotated to account for AutoPreprocessCommand).
Joe Kerrigan
Intern
LEAD Technologies, Inc.
LEADTOOLS Support
General
LEADTOOLS SDK Examples
HOW TO: Use OCR AutoPreprocessing With Image-Over-Text PDFs
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.