LEADTOOLS Support
Document
Document SDK Questions
Problems Deleting Unwanted Output from ZoneCharacters
This topic and its replies were posted before the current version of LEADTOOLS was released and may no longer be applicable.
#1
Posted
:
Thursday, March 19, 2009 8:31:54 AM(UTC)
Groups: Registered
Posts: 15
v16/VS2008/C#
I figured out how to delete unwanted chars from zoneCharacters collection after doing OCR on an IOcrDocument. However, now that I can remove zone characters I have two new issues:
1)I get run-on words where my program deleted invalid zone characters that was at the end of line.
OCR Output Text File w/zone characters removed
FOR
GENERATOR
I now get in the text file:
FORGENERATOR
2)Removing the zone characters does not appear to remove the extra lines. This is more of an issue if I need to save the result as non-ASCII text format.
OCR Output Text File w/zone characters removed
2400V
BUS
// <--- leaves blank lines where zone chars were removed
// " " "
---
480V STATION
SERVICE BUS NO 4 YIB
Here is the code fragment:
void IterateOcrResults()
{
foreach (IOcrPage ocrPage in _document.Pages)
{
IOcrPageCharacters pageCharacters = ocrPage.GetRecognizedCharacters();
List delZoneChars = new List();
foreach (IOcrZoneCharacters zoneCharacters in pageCharacters)
{
ICollection recogWords = zoneCharacters.GetWords(ocrPage.DpiX, ocrPage.DpiY, LogicalUnit.Pixel);
foreach (OcrWord word in recogWords)
{
if (word is bad)
{
for (int i = word.FirstCharacterIndex; i <= word.LastCharacterIndex; i++)
{
OcrCharacter zoneCharacter = zoneCharacters[i];
delZoneChars.Add(zoneCharacter);
}
}
// remove invalid zone chars
foreach (OcrCharacter ocrChar in delZoneChars)
{
zoneCharacters.Remove(ocrChar);
}
ocrPage.SetRecognizedCharacters(pageCharacters);
}
}
Do you have any suggestions how to resolve these issues?
Thank you!
Warren
[:)]
#2
Posted
:
Friday, March 20, 2009 11:06:34 AM(UTC)
Groups: Registered, Tech Support, Administrators
Posts: 764
The problem is likely due to the fact that you are actually deleting the OcrCharacter rather than just modifying it. The OcrCharacter structure has a Position property that flags whether this is the end of a line, end of a paragraph, etc. You can delete the character, but you need to make sure that you are modifying previous characters to maintain the position property properly.
I would suggest simply changing the character code to a space since it achieves nearly the same result and is much simpler to code. However, if that's not an option for you, you'll need to implement some way to keep track of the most recent valid character so that when you come upon a character you want to delete that has a Position property of something other than None you can go back and make changes if necessary.
#3
Posted
:
Friday, June 12, 2009 9:08:01 AM(UTC)
Groups: Registered
Posts: 3
Please refer to the latest OcrEditDemo in LEADTOOLS 16.5. It has a functionality in it that does exactly that.
LEADTOOLS Support
Document
Document SDK Questions
Problems Deleting Unwanted Output from ZoneCharacters
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.