Output Text Format List: OCR Professional
The following table summarizes the output text formats available in the OCR engine. One of these can be specified as the format for the final output document. The file containing each output converter is also given.
Format constant |
Format string |
Description |
DOC2_TEXT |
"Text" |
Writes the recognized text into a simple text file that can be read by most text editors and word processors. |
DOC2_UTEXT |
"Unicode Text" |
Same as Text, but using two-byte Unicode characters. |
DOC2_FORMATTED_TEXT |
"Formatted Text" |
Writes the recognized text into a text file, but tries to retain the layout of the page by inserting extra spaces. |
DOC2_UFORMATTED_TEXT |
"Unicode Formatted Text" |
Same as Formatted Text, but using two-byte Unicode characters. |
DOC2_TEXT_LINEBREAKS |
"Text with linebreaks" |
Same as Text, but it inserts line breaks at the end of lines instead of only inserting them at the end of paragraphs. |
DOC2_UTEXT_LINEBREAKS |
"Unicode Text with linebreaks" |
Same as Text with linebreaks, but using two-byte Unicode characters. |
DOC2_TEXT_CSV |
"Comma Separated Text" |
Writes the recognized text into a tabled text file (Comma delimited text file) that can be read by Excel. "List Separator" separates the cells and NL (new line character) separates the lines of the table. |
DOC2_TEXT_UCSV |
"Unicode Comma Separated Text |
Same as Comma Separated Text, but using two-byte Unicode characters |
DOC2_PDF |
"PDF" |
Adobe PDF with text only. The text can be searched. The PDF file contains the recognized characters in the same positions as in the original. The original page image is not overlayed ontop of the PDF document. |
DOC2_PDF_IMAGE_SUBSTITUTES |
"PDF with image substitutes" |
A special PDF converter, where the suspect words are covered by their images cut out from the original image. |
DOC2_PDF_IMAGE_ON_TEXT |
"PDF with image on text" |
A PDF converter where the original (input) image are retained in the foreground with the recognized text hidden in the background (but in the correct position). Perfect for archiving & indexing documents. |
DOC2_PDF_EDITED |
"PDF edited" |
This PDF converter does not rely on the position of the recognized characters, so it can be used even after inserting large new text portions in the editor. |
DOC2_XML |
"XML" |
An XML file format.. |
DOC2_HTML_3_2 |
"HTML 3.2" |
The HTML 3.2 format is a clear, small but useable HTML format, this format is supported by ‘all’ HTML interpreters (contrary to HTML 4.0.). |
DOC2_HTML_4_0 |
"HTML 4.0" |
The HTML 4.0 format is not so clear as HTML 3.2, but Cascading Style Sheet (CSS) technology can be used for box-like absolute positioned objects, styles and manipulating all paragraph and character attributes. |
DOC2_RTF_6 |
"RTF Word 6.0/95" |
Rich Text Format converter based on the version 1.3 of the RTF Specification. The generated files could be interpreted by almost all RTF readers. The downside is that the size of the output files could be considerably larger than those generated by later RTF converters. |
DOC2_RTF_97 |
"RTF Word 97" |
This RTF converter uses some new features that can only be interpreted by Microsoft Word 97 and up (or by readers with compatible capabilities). |
DOC2_RTF_2000 |
"RTF 2000" |
Similar to RTF 97 converter, but using new features only available in Microsoft Word 2000 and up. |
DOC2_RTF_WORD_2000 |
"RTF 2000 Exact Word" |
This converter is based on RTF Word 2000. It loads the resulting file into Microsoft Word, and tries to correct the pagination errors by slight modifications to spacing values. |
DOC2_WORD_2000 |
"Microsoft Word 2000, XP" |
The same as RTF Word 2000 |
DOC2_WORD_97 |
"Microsoft Word 97" |
The same as RTF Word 97 |
DOC2_EXCEL_97 |
"Microsoft Excel 97" |
Generates Microsoft Excel 97 binary files (.xls). |
DOC2_EXCEL_2000 |
"Microsoft Excel 97" |
Generates Microsoft Excel 2000 binary files (.xls). |
DOC2_PPT_97 |
"Microsoft PowerPoint 97" |
An RTF-based converter that generates a plain and simple RTF file, which can be interpreted by Microsoft PowerPoint. |
DOC2_PUB_98 |
"Microsoft Publisher 98" |
An RTF-based converter that generates a plain and simple RTF file, which can be interpreted by Microsoft Publisher. |
DOC2_MICROSOFT_READER |
"Microsoft Reader" |
Converter for Microsoft Reader ebook format (.lit files). |
DOC2_WORDML |
"Microsoft Word WordML" |
A converter for the XML-based file format of Microsoft Word 2003. Its features, capabilities and layout retention quality are practically the same as in the RTF Word 2000 converter. |
DOC2_WORDPERFECT_8 |
"WordPerfect 8" |
Almost the same as the WordPerfect 9, 10 converters. A few minor features (connected with layout retention) are disabled as they are not supported by WordPerfect 8. |
DOC2_WORDPERFECT_10 |
"WordPerfect 9, 10" |
WordPerfect binary file format for WordPerfect 9 and up |
DOC2_WORDPAD |
"WordPad" |
An RTF-based converter that generates a plain and simple RTF file, which can be interpreted by Microsoft WordPad (and other simple RTF readers). |
DOC2_INFOPATH |
"Microsoft InfoPath" |
A Microsoft InfoPath converter. It supports the saving of various recognized form elements like checkboxes and input lines |
DOC2_EBOOK |
"eBook" |
Open Ebook Specification 1.0 XML converter |