While PDF files are flexible and portable, unfortunately they are not always searchable. In fact, a very common request is for the ability to parse text from PDFs. Luckily, LEADTOOLS OCR Engine makes extracting searchable text from PDF files a breeze. LEAD’s AI-enhanced engine can accept any PDF (searchable or not) and extract the text from it, using OCR where necessary. After extraction LEADTOOLS can save that information to a text file, a searchable PDF file, or any of our other 150+ supported document formats.
Below are a few outlines on how to get started reading text from PDFs in C#, VB, and Java.
C# - Get Text From PDF
The following is an outline for a C# console app that will OCR an input file and print the text to the console.
public void DocumentPageGetTextExample()
{
var options = new LoadDocumentOptions();
using (var document = DocumentFactory.LoadFromFile(Path.Combine(LEAD_VARS.ImagesDir, "input.pdf"), options))
{
var ocrEngine = OcrEngineManager.CreateEngine(OcrEngineType.LEAD);
var rasterCodecs = new RasterCodecs();
var documentWriter = new DocumentWriter();
ocrEngine.Startup(rasterCodecs, documentWriter, null, LEAD_VARS.OcrLEADRuntimeDir);
document.Text.OcrEngine = ocrEngine;
// get text
var page = document.Pages[0];
var pageText = page.GetText();
if (pageText != null)
{
pageText.BuildText();
var text = pageText.Text;
Console.WriteLine(text);
}
else
{
Console.WriteLine("Failed!");
}
}
}
static class LEAD_VARS
{
public const string ImagesDir = @"C:\Input_File_Path\";
public const string OcrLEADRuntimeDir = @"C:\LEADTOOLS21\Bin\Common\OcrLEADRuntime";
}
More information on the GetText Method can be found in LEAD’s documentation.
Visual Basic – Get Text From PDF
The following VB code will OCR an input file and print the text to the console.
Public Shared Sub DocumentPageGetTextExample()
Dim options As New LoadDocumentOptions()
Using document As Leadtools.Document.LEADDocument = DocumentFactory.LoadFromFile(Path.Combine(DocumentPath.Path, "input.pdf"), options)
Dim ocrEngine As IOcrEngine = OcrEngineManager.CreateEngine(OcrEngineType.LEAD)
Dim rasterCodecs As New RasterCodecs()
Dim documentWriter As New DocumentWriter()
ocrEngine.Startup(rasterCodecs, documentWriter, Nothing, LEAD_VARS.OcrLEADRuntimeDir)
document.Text.OcrEngine = ocrEngine
' get text
Dim page As Leadtools.Document.DocumentPage = document.Pages(0)
Dim pageText As DocumentPageText = page.GetText()
If Not pageText Is Nothing Then
pageText.BuildText()
Dim text As String = pageText.Text
Console.WriteLine(text)
Else
Console.WriteLine("Failed!")
End If
End Using
End Sub
Public NotInheritable Class LEAD_VARS
Public Const OcrLEADRuntimeDir As String = "C:\LEADTOOLS21\Bin\Common\OcrLEADRuntime"
End Class
More information on the GetText Method can be found in LEAD’s documentation.
Java – Get Text From PDF
The LEADTOOLS engine is capable of storing extracted text into one of over 150 supported file formats. Here is an example of the Java implementation.
static void ConvertToDocument(String inputFile, DocumentConverter docConverter, OcrEngine ocrEngine)
{
DocumentWriter docWriter = new DocumentWriter();
ocrEngine.startup(new RasterCodecs(), docWriter, null, null);
String outputFile = "C:\\OutputFilePath\\searchablePDF.pdf";
docConverter.setDocumentWriterInstance(docWriter);
docConverter.setOcrEngineInstance(ocrEngine, true);
DocumentConverterJobData jobData = DocumentConverterJobs.createJobData(inputFile, outputFile, DocumentFormat.PDF);
jobData.setJobName("DocumentConversion");
DocumentConverterJob job = docConverter.getJobs().createJob(jobData);
docConverter.getJobs().runJob(job);
if (job.getErrors().size() > 0)
for (DocumentConverterJobError error : job.getErrors())
System.out.println("\nError during conversion: " + error.getError().getMessage());
else
System.out.println("Successfully converted file to " + outputFile);
}
LEAD's documentation has a step-by-step guide to converting files with the document converter in Java and C#.
Try for Free
Download the LEADTOOLS SDK for free. It’s fully-functional, good for 60 days, and even comes with unlimited chat and email support.
But Wait! There’s More
Did you see our previous post on How To Convert PDF to DOC / DOCX? Stay tuned for more conversion examples to see how the LEADTOOLS document converter will easily fit into any workflow converting PDF files into other document files or images and back again. Need help in the meantime? Contact our support team for free technical support! For pricing or licensing questions, you can contact our sales team via email or call us at 704-332-5532.