In This Topic ▼

Parsing Text with the Document Library

The LEADDocument class supports parsing the text of any page using LEADTOOLS SVG or OCR technologies. This allows applications to perform actions such as full text search, highlight text on the document, and create text-based annotation review objects. The LEADTOOLS Document Viewer Library and the Document Viewer Demo is one such example.

Text can be parsed in one of two ways:

If the document type supports SVG (Scalable Vector Graphics), then the text can be parsed from the SVG data directly. This provides 100% accuracy, speed, support for any language, and will ignores logos and other graphics items from the text results.

Searchable PDF and PDF/A files, Microsoft Office Documents (DOC/DOCX, XLS/XLSX, PPT/PPTX), SVG, CAD files (DWG, DXG, DWF), AFP MODCA and PTOCA are just some examples of the file formats that can be parsed by LEADTOOLS using the SVG engine.
If the document type does not support SVG, then the LEADTOOLS OCR engine can be used to parse the text. The LEADDocument class will perform the recognition operation internally using the OCR settings provided by the user (such as which languages and spell check engine to use, and so forth) to parse the text and return it.

Raster PDF files, TIFF, JPEG and PNG are examples of such formats. These are raster image formats that do not contain any text data. However, OCR can be used to recognize and read any text from the image.

It is preferable to extract the text data using the SVG engine for 100% accuracy and maximum speed. If the SVG data is not available, then OCR should be used. The LEADDocument class provides support for performing the above automatically while hiding all the internal details. The user of the class will obtain the text data in the same manner regardless of whether SVG or OCR was used.

The text can be obtained per page using the DocumentPage.GetText method. This will return a DocumentPageText object that contains information on each character found on the page including its location, size and code. This information is unified regardless of whether SVG or OCR was used. The class also contains helper methods to organize these characters into words, lines or a simple string object. For more information, refer to DocumentPageText.

If caching is used with the document, then subsequent calls to DocumentPage.GetText will fetch the data from the cache and it is not parsed again, speeding up the operation.

When DocumentPage.GetText is called, the LEADDocument object will use the options set in DocumentText to determine how the text is parsed. These settings are in the LEADDocument.Text property and are global to all the pages of the document. These include:

DocumentText.TextExtractionMode: This is set to DocumentTextExtractionMode.Auto by default, which means to use SVG if supported; otherwise, use OCR. You can change this value if needed to disable SVG or disable OCR if required by your application. Note that if you set the value to a mode that is not available (for example, to DocumentTextExtractionMode.SvgOnly), and the document type does not support SVG, then DocumentPageText will succeed but will return an empty object.
DocumentText.ImagesRecognitionMode. This is set to DocumentTextImagesRecognitionMode.Auto by default and indicates how to treat the image elements encountered in the SVG representation of this page during text extraction.
DocumentText.RecognizeGlyphs. This is set to true by default and will automatically try to recognize glyphs found in SVG files using the OCR engine. This is only valid if the value of ImagesRecognitionMode is DocumentTextImagesRecognitionMode.Always.
DocumentText.RemoveRubbishOcrZones. This is set to true by default and will automatically ignore OCR results with very low confidence (such as noise or rubbish), and not insert them into the DocumentPageText.

In all cases, a DocumentPageText object will be returned to the user with the same exact information regardless of the extraction mode.
DocumentText.OcrEngine: This is an instance of any LEADTOOLS IOcrEngine that will be used when OCR is invoked by the document. Internally, the engine will create an IOcrPage object for the image of the page, call IOcrPage.Recognize and then parse the results into a DocumentPageText object. The DocumentText.StoreOcrPageCharacters property can be used to instruct the engine to store the original OCR object used to create the document characters inside the DocumentPageText object to obtain more OCR information such as the character color, confidence, and baseline values.