PDF SVG text extraction

#1 Posted : Wednesday, July 25, 2018 4:07:30 PM(UTC)

jnethercutt

Groups: Registered
Posts: 26

Thanks: 3 times

My application processes a lot of PDFs and I’m using the code below to extract the text from the PDFs.

List<DocumentPageText> documentText = new List<DocumentPageText>();
LoadDocumentOptions documentOptions = new LoadDocumentOptions();
var inputDocument = DocumentFactory.LoadFromFile(imageFile, documentOptions);
inputDocument.Text.TextExtractionMode = DocumentTextExtractionMode.Auto;
var pageText = new DocumentPageText();
foreach (var page in inputDocument.Pages)
{
pageText = page.GetText();
pageText.BuildWords();
pageText.BuildTextWithMap();
documentText.Add(pageText);
documentText.AsList();
}

Occasionally I come across a PDF where the text is not extracted quite right, but looking at the PDF visually in Acrobat and performing an Edit/Select All/Copy and paste to notepad gives me the right result.

For example in the PDF I have attached, the word “Total” at the bottom of the document is split in to two words:
<word left=\1741\ top=\2886\ right=\1790\ bottom=\2922\ >Tot</word>
<word left=\1801\ top=\2886\ right=\1828\ bottom=\2922\ >al</word>

As well as the word “Balance”:
<word left=\1741\ top=\3024\ right=\1788\ bottom=\3060\ >Bal</word>
<word left=\1801\ top=\3024\ right=\1880\ bottom=\3060\ >ance</word>

I see this type of issue on random documents. Most other documents from this particular vendor don’t have this issue.

Do you have any suggestions on how this might be resolved so that “Total” and “Balance” are extracted as single words?

(I’ll send you the PDF via email when I get a reply.)

#2 Posted : Wednesday, July 25, 2018 4:29:48 PM(UTC)

Anthony Northrup

Groups: Registered, Tech Support, Administrators
Posts: 199

Was thanked: 28 time(s) in 28 post(s)

Hello Judy,

That is interesting. I'm not entirely sure how we extract the text information from a PDF, but it might actually be stored in the way we are getting. You'll notice from the bounds listed the two "words" are only two pixels apart, so visually they'll appear together. If you could send me the PDF via email I could look into this issue further for you.

Thanks,

Anthony Northrup
Developer Support Engineer
LEAD Technologies, Inc.


	Try the latest version of LEADTOOLS for free for 60 days by downloading the evaluation: https://www.leadtools.com/downloads Wanna join the discussion? Login to your LEADTOOLS Support account or Register a new forum account.

Notification

PDF SVG text extraction