Output Text Format List: OCR Plus

The following table summarizes the output text formats available in the OCR engine. One of these can be specified as the format for the final output document. The file containing each output converter is also given.

Format constant

Format string

Description

DOC_TEXT_STANDARD

"Text - Standard"

Text output with line break after each line. If table is present, its cells are positioned by TABs. The output converter is stored in "R4R11T.DLN".

DOC_TEXT_SMART

"Text - Smart"

Text output with line break after each line. Left margin is taken into account (with SPACEs) If a table is present; its cells are positioned by SPACEs. The output converter is stored in "R4R11T.DLN".

DOC_TEXT_STRIPPED

"Text - Stripped"

Text output with line break after each paragraph. If table is present, its cells are separated by TABs. The output converter is stored in "R4R11T.DLN".

DOC_TEXT_PLAIN

"Text - Plain"

Text output with line break after each line. Left and Upper margins are is taken into account (with SPACEs and NEWLINEs) If table is present, its cells are positioned by TABs. The output converter is stored in "R4R11T.DLN".

DOC_TEXT_COMMA_DELIMITED

"Text - Comma Delimited"

Comma delimited text output. Line/cell contents are surrounded by quotes (""). The default delimiter (comma) can be overwritten (see the REC_OUT.ini file). The output converter is stored in "R4R11T.DLN".

DOC_TEXT_TAB_DELIMITED

"Text - Tab Delimited"

TAB separated text output. Line/cell contents are surrounded by quotes (""). The output converter is stored in "R4R11T.DLN".

DOC_REC_ASCII_FORMATTED

"Rec ASCII (Formatted)"

Text output, layout retention with mimicked SPACEs. Line/cell contents are surrounded by quotes (""). The output converter is stored in "R4R07T.DLN".

DOC_REC_ASCII_STANDARD

"Rec ASCII (Standard)"

Text output allowing quick text conversion. Kept for compatibility reasons. The output converter is stored in "R4R04T.DLN".

DOC_REC_ASCII_STANDARDEX

"Rec ASCII (StandardEx)"

Text output allowing quick text conversion. Line break after each line and after each zone. Kept for compatibility reasons. The output converter is stored in "R4R08T.DLN".

DOC_GENERAL_WORD_PROCESSOR

"General Word Processor"

Text output allowing quick text conversion. Line break after each paragraph. Kept for compatibility reasons. The output converter is stored in "R4R01T.DLN".

DOC_PDF

"Adobe PDF"

Adobe PDF with text only. The text can be searched. The PDF file contains the recognized characters in the same positions as in the original. The original page image is not overlayed ontop of the PDF document.

DOC_PDF_IMAGE_SUBSTITUTES

"Adobe PDF with image substitutes"

As above, but the problematic recognition cases are handled by the inclusion of smaller image snippets in the output file taken from the original image.Image snippets are also exported for the following cases:- words containing suspect character(s) in the recognized text.- words not approved by the checking subsystem ("non-dictionary" words)- words containing rejection symbol(s)- words containing missing symbol(s).

DOC_PDF_IMAGE_ON_TEXT

"Adobe PDF with image on text"

The generated PDF file contains one image for each page in the document and also contains the recognized characters underneath. Displaying the generated PDF file in a PDF-reader results in a very similar look to the original document. The text can be searched..

DOC_PDF_IMAGEONLY

"Adobe PDF image only"

The generated PDF file contains one image for each page in the document. The file won't contain characters at all, so the text cannot be searched.

DOC_PDF_EDITED

"Adobe PDF edited"

The text can be searched. The PDF file contains the recognized characters in the same positions as in the original. Use this output format only if the application has made any change to the recognition result through the L_DocGetRecognizedCharacters and L_DocSetRecognizedCharacters.

DOC_HTML_3_2

"HTML 3.2"

HTML output. HTML 3.2 is useful to export with FORMAT_LEVEL_PART. The output files support both IE and Netscape.

DOC_HTML_4_0

"HTML 4.0"

HTML output. HTML 4.0 can set the exact position/size of objects, use this output format with the FORMAT_LEVEL_FULL option.

DOC_WORD_97_2000_XP

"Word 97, 2000, XP"

Microsoft Word 97, Word 2000 and Word XP output format.

DOC_EXCEL_97_2000

"Excel 97, 2000"

Microsoft Excel 97 and Excel 2000 output format.

DOC_WORDPERFECT_8

"WordPerfect 8"

WordPerfect 8 format.

DOC_RTF

"Rich Text Format"

Quick conversion to Rich Text Format.

DOC_PPT_97_RTF

"PowerPoint 97 (RTF)"

Rich Text Format for PowerPoint 97

DOC_PUB_98_RTF

"Publisher 98 (RTF)"

Rich Text Format for Publisher 98

DOC_WORDPAD_RTF

"WordPad (RTF)"

Rich Text Format for WordPad

DOC_RTF_WORD_2000

"RTF Word 2000"

Rich Text Format for Word 2000

DOC_RTF_WORD_97

"RTF Word 97"

Rich Text Format for Word 97

DOC_RTF_WORD_6_95

"RTF Word 6.0/95"

Rich Text Format for Word 6.0/95

DOC_OPEN_EBOOK_1_0

"Open eBook 1.0"

Open eBook 1.0 forma

DOC_XML

"XML"

XML output format

DOC_2G_TYPE_2

"2G Type 2"

Binary output of the recognition with a 16-byte long structure for each recognized character.2G Type 2 structure output

DOC_2G_TYPE_3

"2G Type 3"

Binary output of the recognition with a 16-byte long structure for each recognized character.2G Type 3 structure output

DOC_WORDPERFECT_9_10

“WordPerfect 9, 10”

WordPerfect 9, 10 format

DOC_MICROSOFT_READER

“Microsoft Reader”

Microsoft’s Open eBook format

DOC_MICROSOFT_WORD_2003

“Microsoft Word 2003 (WordML)”

Microsoft Word 2003 output format

DOC_REC_PDF_IMAGE_ON_TEXT

“Rec PDF (Image On Text)”

Quick output conversion to produce searchable PDF output format similar to “Adobe PDF with image on text”

DOC_PDFA_IMAGE_ON_TEXT

"Adobe PDF/A Image On Text"

Adobe PDF/A Image On Text format