OCR Confidence Reporting

For some applications, it may be important to know the reliability of the recognized text generated by the engine. These applications may require having additional confidence information for the recognized characters and/or words.

The engine can provide confidence information for the correctness of the recognized text in two different ways:

The Engine's output-marking feature (see: OCR Engine Specific Settings) enables the IOcrDocument.Save, IOcrDocument.SaveXml or IOcrPage.RecognizeText methods to place a user-defined character sequence into the final output document before suspiciously recognized characters and/or words. Alternatively, the suspicious characters and/or words can be set to a particular color in the output document to indicate recognition results with low confidence.

In another approach, the Engine can generate output, which consists of structured data for each recognized character. In this output there is one structure or record for each character. The character code of the recognized entity is the primary field. Other fields include the coordinates of the character on the image, the zone to which the character belongs to, the font information for the character, and the confidence information.

The output-marking feature is supported by most of the output converters. Marking low confidence recognition with color requires the selection of an output format (e.g. MS Word) that supports colored text.

A possible output for the marking feature might be as follows:

"We would like to ask you some questions, ta*king around 15 minutes"

The previous text extract was generated using the output-marking feature, in which the asterisk ('*') character was set to mark the suspiciously recognized characters in the output.

More information can be retrieved directly into application memory by a call to IOcrPage.GetRecognizedCharacters, just after calling IOcrPage.Recognize and IOcrPage.RecognizeText. The IOcrPage.GetRecognizedCharacters call provides the most detailed information about the recognized data. It results in a OcrCharacter structure for each recognized character.

There are three properties in the OcrCharacter structure, which provide character recognition confidence information: the OcrCharacter.Confidence, OcrCharacter.WordIsCertain and the OcrCharacter.LeadingSpacesConfidence properties.

The OcrCharacter.WordIsCertain property express the certainty/uncertainty of the word this character is part of.

The OcrCharacter.Confidence property express the certainty of the recognition of the character, which ranges between 0 and 100. A value of 100 means that the Engine recognized the character with high confidence. In some cases a word may have some or all characters that are individually suspicious but the characters are not be marked suspicious in OcrCharacter.WordIsCertain. This is usually a result of language or user dictionary checking. It means the word was validated by the checking subsystem.

The OcrCharacter.LeadingSpacesConfidence property ranges between 0 and 100, and it expresses the confidence of the value in the OcrCharacter.LeadingSpaces property of the structure, i.e. whether the Engine is certain regarding the space estimation in front of the recognized character.

Applications that examine the character confidence information can use a threshold value, above which the character value is treated as a suspicious result. A value of 64 is recommended for this purpose. A value less than 64 will indicate that the character was recognized with high confidence. A value of 64 or greater marks that code is suspicious.

Note:

This value (64) is also used internally in the same manner when the output-marking feature for suspicious characters in the output text is enabled.

IMPORTANT NOTE

The confidence reporting system works best when all three recognition modules are used in the voting scheme (OcrZoneRecognitionModule.OmniFontPlus3WayVoting). If other machine print recognition modules are used (OcrZoneRecognitionModule.OmniFontPlus2WayVoting, OcrZoneRecognitionModule.OmniFontMText, etc) then confidence information is still available, but the ability of the system to properly report confidence will be reduced. This will result in a higher level of false negative and false positive reporting of suspicious recognition results.

Confidence level is also reported for OMR zones. For more information, refer to Using OMR in LEADTOOLS .NET OCR.