Parsing Text with the Documents Library

The Document class supports parsing the text of any page using LEADTOOLS SVG or OCR technologies. This allows applications to perform actions such as full text search, highlight text on the document and create text-based annotation review objects. The LEADTOOLS Document Viewer Library and the Document Viewer Demo is one such example.

Text can be parsed in one of two ways:

If the document type supports SVG (Scalable Vector Graphics), then the text can be parsed from the SVG data directly. This provides 100% accuracy, speed, have support for any language and will ignores logos and other graphics items from the text result.

Searchable PDF and PDF/A files, Microsoft Office Documents (DOC/DOCX, XLS/XLSX, PPT/PPTX), SVG, CAD files (DWG, DXG, DWF), AFP MODCA and PTOCA are just an example of the some of the file format that can be parsed by LEADTOOLS using the SVG engine.
If the document type does not support SVG, then LEADTOOLS OCR engine can be used to parse the text. The Document class will perform the recognition operation internally using the OCR settings provided by the user (such as languages to use, spell check engine and so forth) to parse the text and return it.

Raster PDF files, TIFF, JPEG and PNG are examples of such formats. These are raster image formats that do not contain any text data. However, OCR can be used to recognize and read any text from the image.

It is preferable to extract the text data using the SVG engine for 100% accuracy and maximum speed. If the SVG data is not available, then OCR should be used. The Document class provides support for performing the above automatically while hiding all the internal details. The user of the class will obtain the text data in the same manner regardless of whether SVG or OCR was used.

The text can be obtained per page using the DocumentPage.GetText method. This will return a DocumentPageText object that contains information on each character found on the page including its location, size and code. This information is unified regardless of whether SVG or OCR was used. The class also contains helper methods to organize these characters into words, lines or a simple string object. For more information, refer to DocumentPageText.

If caching is used with the document, then subsequent calls to DocumentPage.GetText will fetch the data from the cache and it is not parsed again to speed up the operation.

When DocumentPage.GetText is called, the Document object will use the options set in DocumentText to determine how the text is parsed. These settings are in the Document.Text property and are global to all the pages of the document. These include:

DocumentText.TextExtractionMode: This is set to DocumentTextExtractionMode.Auto by default which means use SVG if supported, otherwise use OCR. You can change this value if needed to disable SVG or disable OCR if required by your application. Note that if you set the value to a mode that is not available, for example, to DocumentTextExtractionMode.SvgOnly and the document type does not support SVG, then DocumentPageText will succeed but will return an empty object.

In all cases, a DocumentPageText object will be returned to the user with the same exact information regardless of the extraction mode.
DocumentText.OcrEngine: This is an instance of any LEADTOOLS IOcrEngine that will be used when OCR is invoked by the document. Internally, the engine will create an IOcrPage object for the image of the page, calls IOcrPage.Recognize and then parse the results into a DocumentPageText object.

LEADTOOLS OCR and SVG technologies are completely thread safe and the user can parse any number of pages at the same times from any number of threads.

Parsing Text with the Documents Library

Reference