#1
Posted
:
Wednesday, July 25, 2018 4:07:30 PM(UTC)
Groups: Registered
Posts: 26
Thanks: 3 times
My application processes a lot of PDFs and I’m using the code below to extract the text from the PDFs.
List<DocumentPageText> documentText = new List<DocumentPageText>();
LoadDocumentOptions documentOptions = new LoadDocumentOptions();
var inputDocument = DocumentFactory.LoadFromFile(imageFile, documentOptions);
inputDocument.Text.TextExtractionMode = DocumentTextExtractionMode.Auto;
var pageText = new DocumentPageText();
foreach (var page in inputDocument.Pages)
{
pageText = page.GetText();
pageText.BuildWords();
pageText.BuildTextWithMap();
documentText.Add(pageText);
documentText.AsList();
}
Occasionally I come across a PDF where the text is not extracted quite right, but looking at the PDF visually in Acrobat and performing an Edit/Select All/Copy and paste to notepad gives me the right result.
For example in the PDF I have attached, the word “Total” at the bottom of the document is split in to two words:
<word left=\1741\ top=\2886\ right=\1790\ bottom=\2922\ >Tot</word>
<word left=\1801\ top=\2886\ right=\1828\ bottom=\2922\ >al</word>
As well as the word “Balance”:
<word left=\1741\ top=\3024\ right=\1788\ bottom=\3060\ >Bal</word>
<word left=\1801\ top=\3024\ right=\1880\ bottom=\3060\ >ance</word>
I see this type of issue on random documents. Most other documents from this particular vendor don’t have this issue.
Do you have any suggestions on how this might be resolved so that “Total” and “Balance” are extracted as single words?
(I’ll send you the PDF via email when I get a reply.)
#2
Posted
:
Wednesday, July 25, 2018 4:29:48 PM(UTC)
Groups: Registered, Tech Support, Administrators
Posts: 199
Was thanked: 28 time(s) in 28 post(s)
Hello Judy,
That is interesting. I'm not entirely sure how we extract the text information from a PDF, but it might actually be stored in the way we are getting. You'll notice from the bounds listed the two "words" are only two pixels apart, so visually they'll appear together. If you could send me the PDF via email I could look into this issue further for you.
Thanks,
Anthony Northrup
Developer Support Engineer
LEAD Technologies, Inc.
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.