Confidence Reporting: OCR Plus

Show in webframe

The OCR Plus engine can provide confidence information for the correctness of the recognized text in two different ways:

The Engine's output-marking feature (see: the MARKOPTIONS structure) enables the L_DocSaveResultsToMemory function to place a user-defined character sequence into the final output document before suspiciously recognized characters and/or words. Alternatively, the suspicious characters and/or words can be set to a particular color in the output document to indicate recognition results with low confidence.
In another approach, the Engine can generate output, which consists of structured data for each recognized character. In this output there is one structure or record for each character. The character code of the recognized entity is the primary field. Other fields include the coordinates of the character on the image, the zone to which the character belongs to, the font information for the character, and the confidence information.

The output-marking feature is supported by most of the output converters. Marking low confidence recognition with color requires the selection of an output format (e.g. MS Word) that supports colored text.

For more information on these converters, see the Output converter formatting properties topic.

A possible output for the marking feature might be as follows:

"We would like to ask you some questions, ta*king around 15 minutes"

The text extract above was generated using the output-marking feature, in which the asterisk ('*') character was set to mark the suspiciously recognized characters in the output.

Structured recognized output can be produced though one of the two special output converters (e.g. "2G Type 3"). Even more information can be retrieved directly into application memory by a call to L_DocGetRecognizedCharacters, just after calling L_DocRecognize. The L_DocGetRecognizedCharacters call provides the most detailed information about the recognized data. It results in a 36-byte long RECOGCHARS structure for each recognized character. Whereas the 2G Type output formats provide 2nd guesses for each character, the RECOGCHARS structure provides three.

For some applications, it may be important to know the reliability of the recognized text generated by the engine. These applications may require having additional confidence information for the recognized characters and/or words.

There are two fields in the RECOGCHARS structure, which provide character recognition confidence information: nConfidence and nSpaceErr.

The RECOGCHARS.nConfidence field is a combined value. Its most significant bit is used to express the certainty/uncertainty of the word. (If this bit is set to 1, the word is uncertain.) The remaining bits express the certainty of the character recognition: ranging from 0 to 100.
A value of 100 means that the Engine recognized the character with high confidence. In some cases a word may have some or all characters that are individually suspicious but the characters are not marked as such in the word bit. This is usually a result of language or user dictionary checking, meaning that the word was validated by the checking module.
If only the User-written checking or the User Dictionary are enabled on a zone and the section name is specified, the characters of the non-dictionary words are assigned a value of 100 in their RECOGCHARS.nConfidence field.
If a zone enables only User Dictionary, and the section name is specified, the non-dictionary words are replaced with similar dictionary ones.
The nSpaceErr member ranges between 0 and 100, and it expresses the confidence of the value in the space field of the structure, i.e. whether the Engine is certain regarding the space estimation in front of the recognized character.

Applications that examine the character confidence information can use a threshold value, above which the character value is treated as a suspicious result. A value of 64 is recommended for this purpose. A value less than 64 will indicate that the character was recognized with high confidence. A value of 64 or greater marks that code is suspicious.

Note: The value 64 is also used internally in the same manner, when the output-marking feature for suspicious characters in the output text is enabled.

NOTE: The confidence reporting system works best when all three recognition modules are used in the voting scheme (RECOGMODULE_OMNIFONT_PLUS3W), but this is not the default value. If other machine print recognition modules are used (RECOGMODULE_OMNIFONT_PLUS2W, RECOGMODULE_MTEXT_OMNIFONT, etc) then confidence information is still available, but the ability of the system to properly report confidence will be reduced. This will result in a higher level of false negative and false positive reporting of suspicious recognition results.