In This Topic ▼

Parsing Text with the Document Library

The Document class supports parsing the text of any page using LEADTOOLS SVG or OCR technologies. This allows applications to perform actions such as full text searching, highlighting text on the document, and creating text-based annotation review objects. The LEADTOOLS Document Viewer Library and the Document Viewer Demo is one such example.

Text can be parsed in one of two ways:

It is preferable to extract the text data using the SVG engine for 100% accuracy and maximum speed. If the SVG data is not available, then OCR should be used. The LEADDocument class provides support for performing the above automatically while hiding all the internal details. The user of the class will obtain the text data in the same manner regardless of whether SVG or OCR was used.

The text can be obtained per page using GetText. This will return a DocumentPageText object that contains information about each character found on the page including its location, size, and code. This information is uniform regardless of whether SVG or OCR is used. The class also contains helper methods to organize these characters into words, lines, or a simple string object. Refer to DocumentPageText for more information.

If caching is used with the document, then subsequent calls to GetText will fetch the data from the cache, but it is not parsed again (to speed up the operation).

When GetText is called, the LEADDocument object will use the options set in DocumentText to determine how to parse the text. These settings are in the Text property and are global to all the pages of the document. These settings include:

LEADTOOLS OCR and SVG technologies are completely thread-safe and any number of pages can be parsed at the same time from any number of threads.

For an example, GetText.

Help Version 20.0.2020.4.3
Products | Support | Contact Us | Intellectual Property Notices
© 1991-2020 LEAD Technologies, Inc. All Rights Reserved.

LEADTOOLS HTML5 JavaScript