enum
{
L_OcrXmlOutputOptions_None = 0,
L_OcrXmlOutputOptions_Characters = 1 << 0,
L_OcrXmlOutputOptions_CharacterAttributes = 1 << 1,
};
typedef L_UINT L_OcrXmlOutputOptions;
Controls the format of the XML data obtained from L_OcrDocument_SaveXml.
Value | Meaning |
---|---|
L_OcrXmlOutputOptions_None | Default. Write the recognized word values in the result XML data. |
L_OcrXmlOutputOptions_Characters | Write the recognized character values instead of the word values in the result XML data |
L_OcrXmlOutputOptions_CharacterAttributes | Only valid with Characters. Write the character attributes (font for example) in the result XML data. |
The various L_OcrDocument_SaveXml methods accept a combination of one or more of the L_OcrXmlOutputOptions enumeration members to control the format of the output XML data.
The format of the result XML data is as follows:
<?xml version="1.0" encoding="UTF-16" standalone="yes"?>
<pages>
<page>
<zone>
<paragraph>
<line>
<word>
<character/>
<character/>
</word>
</line>
</paragraph>
</zone>
</page>
</pages>
The pages element is repeated once per document and it has no value and no additional attributes.
The page element is repeated for every page in the document (L_OcrDocument_GetPageCount). If this page has not been recognized or contains no zones, then the page element will not contain any child zone elements.
The page element has no value and contains the following additional attributes:
Attribute | Value |
---|---|
horizontal_resolution | Horizontal resolution of the page. The value is the page original bitmap handle X resolution. |
vertical_resolution | Vertical resolution of the page. The value is the page original bitmap handleY resolution. |
width | Width of the page in pixels. The value is the page original bitmap handle width. |
height | Height of the page in pixels. The value is the page original bitmap handle height. |
The zone element is repeated for every zone in the current page. The zone element has no value and contains the following additional attributes:
Attribute | Value |
---|---|
type | The zone type. Either "Text", "Graphic", "Table", "OMR" or "Micr". If the zone element is of type "Text", then it will contain zero or more paragraph child elements. If the zone is of type "Graphic", then it will not contain and other child elements. |
left | The zone left position in pixels. The value is L_OcrZone.Bounds.left converted to pixels. |
top | The zone top position in pixels. The value is L_OcrZone.Bounds.top converted to pixels. |
right | The zone right position in pixels. The value is L_OcrZone.Bounds.right converted to pixels. |
bottom | The zone bottom position in pixels. The value is L_OcrZone.Bounds.bottom converted to pixels. |
subtype | The zone type. The value is L_OcrZone.ZoneType. |
The paragraph element is repeated for every text paragraph in the current zone and it has no attributes. If this zone has no recognition text, then the paragraph element will not contain any child line elements.
The paragraph element has no attributes and no value.
The line element is repeated for every line of text in the current paragraph. The line element has no value and contains the following additional attributes:
Attribute | Value |
---|---|
left | The line left position in pixels. |
top | The line top position in pixels. |
right | The line right position in pixels. |
bottom | The line bottom position in pixels. The value of left, top, right and bottom is calculated from the summation of all the boundaries of the words that make up this line. |
base | The position of the baseline of this line. The value is calculated from the summation of the baselines of all the words that make up this line. |
The word element is repeated for every word of text in the current line. If L_OcrXmlOutputOptions_Characters was not specified in the generation options, then the word element will contain the value of the word as its value. Otherwise, the word element will contain no value.
The word element has the following attributes:
Attribute | Value |
---|---|
left | The word left position in pixels. |
top | The word top position in pixels. |
right | The word right position in pixels. |
bottom | The word bottom position in pixels. The value of left, top, right and bottom is calculated from the summation of all the boundaries of the characters that make up this word. |
base | The position of the baseline of this word. The value is calculated from the summation of the baselines of all the characters that make up this word. |
The character element is repeated for every character in the following word only if L_OcrXmlOutputOptions_Characters was specified in the generation options. Otherwise, the word element will contain no child character elements. If L_OcrXmlOutputOptions_Characters was specified in the generation options, then the character element will contain the value of the character as its value. Otherwise, the character element will contain no value.
The character element contains the following additional attributes:
Attribute | Value |
---|---|
left | The character left position in pixels. |
top | The character top position in pixels. |
right | The character right position in pixels. |
bottom | The character bottom position in pixels. The value of left, top, right and bottom is calculated from L_OcrCharacter.Bounds. |
base | The position of the baseline of this character. The value is L_OcrCharacter.Base. |
confidence | The confidence of this character. The value is L_OcrCharacter.Confidence. |
font_size | The font size in points. The value is L_OcrCharacter.FontSize. Only available if L_OcrXmlOutputOptions_CharacterAttributes is specified. |
proportional | "yes" if the character font is proportional, "no"; otherwise. The value is calculated from L_OcrCharacter.FontStyles. Only available if L_OcrXmlOutputOptions_CharacterAttributes is specified. |
serif | "yes" if the character font is serif, "no"; otherwise. The value is calculated from L_OcrCharacter.FontStyles. Only available if L_OcrXmlOutputOptions_CharacterAttributes is specified. |
bold | "yes" if the character font is bold, "no"; otherwise. The value is calculated from L_OcrCharacter.FontStyles. Only available if L_OcrXmlOutputOptions_CharacterAttributes is specified. |
italic | "yes" if the character font is italic, "no"; otherwise. The value is calculated from L_OcrCharacter.FontStyles. Only available if L_OcrXmlOutputOptions_CharacterAttributes is specified. |
underline | "yes" if the character font is underline, "no"; otherwise. The value is calculated from L_OcrCharacter.FontStyles. Only available if L_OcrXmlOutputOptions_CharacterAttributes is specified. |
The following is an example of the XML output when L_OcrXmlOutputOptions_None is specified:
<?xml version="1.0" encoding="UTF-16" standalone="yes"?>
<pages>
<page horizontal_resolution="300" vertical_resolution="300" width="2544" height="3294">
<zone type="Text" left="371" top="370" right="831" bottom="420" subtype="Text" recognition_module="Auto" fill_method="Default">
<paragraph>
<line left="372" top="371" right="830" bottom="419" base="29">
<word left="372" top="371" right="554" bottom="409" base="30">License</word>
<word left="570" top="372" right="830" bottom="419" base="29">Agreement</word>
</line>
</paragraph>
</zone>
</page>
</pages>
Here is the same XML output when L_OcrXmlOutputOptions_Characters is specified:
<?xml version="1.0" encoding="UTF-16" standalone="yes"?>
<pages>
<page horizontal_resolution="300" vertical_resolution="300" width="2544" height="3294">
<zone type="Text" left="371" top="370" right="831" bottom="420" subtype="Text" recognition_module="Auto" fill_method="Default">
<paragraph>
<line left="372" top="371" right="830" bottom="419" base="29">
<word left="372" top="371" right="554" bottom="409" base="30">
<character left="372" top="372" right="398" bottom="408" base="36" confidence="100">L</character>
<character left="402" top="371" right="409" bottom="408" base="37" confidence="100">i</character>
<character left="414" top="381" right="438" bottom="409" base="27" confidence="100">c</character>
<character left="442" top="381" right="468" bottom="409" base="27" confidence="100">e</character>
<character left="472" top="381" right="496" bottom="408" base="27" confidence="100">n</character>
<character left="501" top="381" right="525" bottom="408" base="27" confidence="100">s</character>
<character left="529" top="381" right="554" bottom="408" base="27" confidence="100">e</character>
</word>
<word left="570" top="372" right="830" bottom="419" base="29">
<character left="570" top="372" right="604" bottom="408" base="36" confidence="100">A</character>
<character left="607" top="381" right="633" bottom="419" base="27" confidence="100">g</character>
<character left="639" top="381" right="655" bottom="408" base="27" confidence="100">r</character>
<character left="657" top="381" right="682" bottom="408" base="27" confidence="100">e</character>
<character left="685" top="381" right="710" bottom="408" base="27" confidence="100">e</character>
<character left="715" top="381" right="753" bottom="408" base="27" confidence="100">m</character>
<character left="758" top="381" right="783" bottom="408" base="27" confidence="100">e</character>
<character left="788" top="381" right="812" bottom="408" base="27" confidence="100">n</character>
<character left="815" top="374" right="830" bottom="408" base="34" confidence="100">t</character>
</word>
</line>
</paragraph>
</zone>
</page>
</pages>
You can use the L_OcrXmlOutputOptions_CharacterAttributes option along with L_OcrDocumentManager_GetFontName to obtain the font family name of each character. When performing OCR, the engine cannot distinguish similar fonts such as Arial and Calibri, instead, the engine gets information on whether the character has serif and whether the font is proportional or fixed.
According to L_OcrDocumentManager_GetFontName, the returned value is character string array depending on the passed L_OcrDocumentFontType, as follows:
Index | Description |
---|---|
L_OcrDocumentFontType_ProportionalSerif | The font used with proportional serif characters |
L_OcrDocumentFontType_ProportionalSansSerif | The font used with proportional sans-serif characters |
L_OcrDocumentFontType_FixedSerif | The font used with monospaced serif characters |
L_OcrDocumentFontType_FixedSansSerif | The font used with monospaced sans-serif characters |
L_OcrDocumentFontType_ICR | The font used with ICR (hand-written) characters |
L_OcrDocumentFontType_MICR | The font used with MICR (check font) characters |