LEADTOOLS Support
General
LEADTOOLS SDK Examples
HOW TO: Use OCR AutoPreprocessing With Image-Over-Text PDFs
Groups: Registered, Tech Support
Posts: 6
Previous how-to posts have outlined the process of creating a searchable PDF using the LEADTOOLS OCR toolkit. Using the PDF image-over-text feature, we can ensure that resultant searchable documents are visually identical to their input documents. It may, however, be advantageous to apply some image processing prior to running OCR. In particular, IOcrPage.AutoPreprocess() can improve OCR results by rotating, deskewing, or inverting parts of the document. Other image processing steps such as binarization (using AutoBinarizeCommand) can also improve OCR results in some circumstances.
Steps like binarization, however, alter the image in ways that might not be desirable, even if their inclusion boosts OCR accuracy. What if there were some way to use one version of the document for OCR and another for display in the final PDF? The snippet below does exactly this: as it processes a document, it maintains one copy of each page that it applies whichever image processing commands are desirable while keeping another to include in the final PDF. In essence, one version of each page is a "working copy" where we can do whatever we want to improve results. The other image is the "presentable" version, which is left more-or-less untouched throughout the whole process.
The solution faces one final issue introduced by IOcrPage.AutoPreprocess. Depending on flags passed to AutoPreprocess, the image might be rotated slightly before being recognized by the OCR engine. If we do not rotate our own "presentable" image, the text boxes will be misaligned with their corresponding text in the final PDF. To account for this, we can call IOcrPage.GetPreprocessValues. The OcrPageAutoPreprocessValues returned describe the transformation applied to the image before it was passed in for OCR.
Code:string filename = @"";
LoadDocumentOptions opts = new LoadDocumentOptions();
LEADDocument doc = DocumentFactory.LoadFromFile(filename, opts);
IOcrEngine engine = OcrEngineManager.CreateEngine(OcrEngineType.LEAD, false);
engine.Startup(null, null, null, @"C:\LEADTOOLS 20\Bin\Common\OcrLEADRuntime");
IOcrDocument ocrDoc = engine.DocumentManager.CreateDocument();
foreach (DocumentPage page in doc.Pages)
{
RasterImage image = page.GetImage().Clone();
AutoBinarizeCommand binarize = new AutoBinarizeCommand();
binarize.Run(image);
IOcrPage ocrPage = engine.CreatePage(image, OcrImageSharingMode.AutoDispose);
ocrDoc.Pages.Add(ocrPage);
ocrPage.AutoPreprocess(OcrAutoPreprocessPageCommand.All, null);
ocrPage.AutoZone(null);
if (ocrPage.Zones != null && ocrPage.Zones.Count > 0)
{
ocrPage.Recognize(null);
}
RasterImage presentable = page.GetImage().Clone();
OcrPageAutoPreprocessValues values = ocrPage.GetPreprocessValues();
if (values.RotationAngle != 0)
{
RotateCommand rotate = new RotateCommand(values.RotationAngle, RotateCommandFlags.Bicubic, RasterColor.White);
rotate.Run(image);
}
ocrPage.GetRasterImage().Dispose();
ocrPage.SetRasterImage(presentable);
}
PdfDocumentOptions pdfOpts = ocrDoc.DocumentWriterInstance.GetOptions(DocumentFormat.Pdf) as PdfDocumentOptions;
pdfOpts.ImageOverText = true;
ocrDoc.DocumentWriterInstance.SetOptions(DocumentFormat.Pdf, pdfOpts);
string outpath = Path.Combine(Path.GetDirectoryName(filename), Path.GetFileNameWithoutExtension(filename) + "_ocr.pdf");
ocrDoc.Save(outpath, DocumentFormat.Pdf, null);
ocrDoc.Dispose();
doc.Dispose();
engine.Dispose();
Attached is a sample project containing the snippet above. After supplying it with the path to a document, it will load the document, preprocess it, and generate a searchable PDF that uses images from the original document (rotated to account for AutoPreprocessCommand).
Joe Kerrigan
Intern
LEAD Technologies, Inc.

LEADTOOLS Support
General
LEADTOOLS SDK Examples
HOW TO: Use OCR AutoPreprocessing With Image-Over-Text PDFs
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.