Parsing Text with the Documents Library

Summary

The Document class supports parsing the text of any page using LEADTOOLS SVG or OCR technologies. This allows applications to perform actions such as full text searching, highlighting text on the document, and creating text-based annotation review objects. The LEADTOOLS Document Viewer Library and the Document Viewer Demo is one such example.

Text can be parsed in one of two ways:

  • If the document type supports SVG (Scalable Vector Graphics), then the text can be parsed from the SVG data directly. This provides 100% accuracy, speed, has support for any language and will ignore logos and other graphics items from the text result.

    Searchable PDF files, Microsoft Office Documents (DOC/DOCX, XLS/XLSX, PPT/PPTX), HTML, ePub, Text, SVG, CAD files (DWG, DXG, DWF), IOCA/MODCA are examples of the just a few of the file formats that can be parsed by LEADTOOLS using the SVG engine.

  • If the document type does not support SVG, then LEADTOOLS OCR engine can be used to parse the text. The Document class will perform the recognition operation internally using the OCR settings provided by the user (such as languages to use, spell check engine and so forth) to parse the text and return it.

    Raster PDF files, TIFF, JPEG and PNG are examples of such formats. These are raster image formats that do not contain any text data. However, OCR can be used to recognize and read any text from the image.

It is preferable to extract the text data using the SVG engine for 100% accuracy and maximum speed. If the SVG data is not available, then OCR should be used. The Document class provides support for performing the above automatically while hiding all the internal details. The user of the class will obtain the text data in the same manner regardless of whether SVG or OCR was used.

The text can be obtained per page using GetText. This will return a DocumentPageText object that contains information on each character found on the page including its location, size and code. This information is uniform regardless of whether SVG or OCR was used. The class also contains helper methods to organize these characters into words, lines or a simple string object. Refer to DocumentPageText for more information.

If caching is used with the document, then subsequent calls to GetText will fetch the data from the cache, but it is not parsed again (to speed up the operation).

When GetText is called, the Document object will use the options set in DocumentText to determine how the text is parsed. These settings are in the Text property and are global to all the pages of the document. These include:

  • DocumentText.TextExtractionMode: This is set to DocumentTextExtractionMode.Auto by default which means use SVG if supported, otherwise use OCR. You can change this value if needed to disable SVG or disable OCR if required by your application. Note that if you set the value to a mode that is not available, for example, to DocumentTextExtractionMode.SvgOnly and the document type does not support SVG, then DocumentPageText will succeed but will return an empty object.

    In all cases, a DocumentPageText object will be returned to the user with the same exact information regardless of the extraction mode.

  • The OCR engine instance set in the service: This is an instance of any LEADTOOLS OCR Engine that will be used when OCR is invoked by the document. Internally, the engine will create an OCR page for the image of the page, calls recognize and then parse the results into a DocumentPageText object.

LEADTOOLS OCR and SVG technologies are completely thread safe and the user can parse any number of pages at the same times from any number of threads.

For an example, GetText.

Products | Support | Contact Us | Copyright Notices
© 1991-2017 LEAD Technologies, Inc. All Rights Reserved.
LEADTOOLS HTML5 JavaScript
Click or drag to resize