Convert PDF to Text in C#, VB, and Java

Posted on 2021-04-05 Zac Ferraresi

While PDF files are flexible and portable, unfortunately they are not always searchable. In fact, a very common request is for the ability to parse text from PDFs. Luckily, LEADTOOLS OCR Engine makes extracting searchable text from PDF files a breeze. LEAD’s AI-enhanced engine can accept any PDF (searchable or not) and extract the text from it, using OCR where necessary. After extraction LEADTOOLS can save that information to a text file, a searchable PDF file, or any of our other 150+ supported document formats.

Below are a few outlines on how to get started reading text from PDFs in C#, VB, and Java.

C# - Get Text From PDF

The following is an outline for a C# console app that will OCR an input file and print the text to the console.

public void DocumentPageGetTextExample() 
{ 
 var options = new LoadDocumentOptions(); 
 using (var document = DocumentFactory.LoadFromFile(Path.Combine(LEAD_VARS.ImagesDir, "input.pdf"), options)) 
 { 
  var ocrEngine = OcrEngineManager.CreateEngine(OcrEngineType.LEAD); 
  var rasterCodecs = new RasterCodecs(); 
  var documentWriter = new DocumentWriter(); 
  ocrEngine.Startup(rasterCodecs, documentWriter, null, LEAD_VARS.OcrLEADRuntimeDir); 

  document.Text.OcrEngine = ocrEngine; 

  // get text  
  var page = document.Pages[0]; 
  var pageText = page.GetText(); 
  if (pageText != null) 
  { 
   pageText.BuildText(); 
   var text = pageText.Text; 

   Console.WriteLine(text); 
  } 
  else 
  { 
   Console.WriteLine("Failed!"); 
  } 
 } 
} 

static class LEAD_VARS 
{ 
 public const string ImagesDir = @"C:\Input_File_Path\"; 
 public const string OcrLEADRuntimeDir = @"C:\LEADTOOLS21\Bin\Common\OcrLEADRuntime"; 
}

More information on the GetText Method can be found in LEAD’s documentation.

Visual Basic – Get Text From PDF

The following VB code will OCR an input file and print the text to the console.

Public Shared Sub DocumentPageGetTextExample() 
 Dim options As New LoadDocumentOptions() 
 Using document As Leadtools.Document.LEADDocument = DocumentFactory.LoadFromFile(Path.Combine(DocumentPath.Path, "input.pdf"), options) 
  Dim ocrEngine As IOcrEngine = OcrEngineManager.CreateEngine(OcrEngineType.LEAD) 
  Dim rasterCodecs As New RasterCodecs() 
  Dim documentWriter As New DocumentWriter() 
  ocrEngine.Startup(rasterCodecs, documentWriter, Nothing, LEAD_VARS.OcrLEADRuntimeDir) 
  
  document.Text.OcrEngine = ocrEngine 
  
  ' get text  
  Dim page As Leadtools.Document.DocumentPage = document.Pages(0) 
  Dim pageText As DocumentPageText = page.GetText() 
  If Not pageText Is Nothing Then 
   pageText.BuildText() 
   Dim text As String = pageText.Text 
  
   Console.WriteLine(text) 
  Else 
   Console.WriteLine("Failed!") 
  End If 
 End Using 
End Sub 

Public NotInheritable Class LEAD_VARS 
 Public Const OcrLEADRuntimeDir As String = "C:\LEADTOOLS21\Bin\Common\OcrLEADRuntime" 
End Class 

More information on the GetText Method can be found in LEAD’s documentation.

Java – Get Text From PDF

The LEADTOOLS engine is capable of storing extracted text into one of over 150 supported file formats. Here is an example of the Java implementation.

static void ConvertToDocument(String inputFile, DocumentConverter docConverter, OcrEngine ocrEngine) 
{ 
 DocumentWriter docWriter = new DocumentWriter(); 
 ocrEngine.startup(new RasterCodecs(), docWriter, null, null); 

 String outputFile = "C:\\OutputFilePath\\searchablePDF.pdf"; 

 docConverter.setDocumentWriterInstance(docWriter); 
 docConverter.setOcrEngineInstance(ocrEngine, true); 
 DocumentConverterJobData jobData = DocumentConverterJobs.createJobData(inputFile, outputFile, DocumentFormat.PDF); 
 jobData.setJobName("DocumentConversion"); 

 DocumentConverterJob job = docConverter.getJobs().createJob(jobData); 
 docConverter.getJobs().runJob(job); 

 if (job.getErrors().size() > 0) 
  for (DocumentConverterJobError error : job.getErrors()) 
   System.out.println("\nError during conversion: " + error.getError().getMessage()); 
 else 
  System.out.println("Successfully converted file to " + outputFile); 
} 

LEAD's documentation has a step-by-step guide to converting files with the document converter in Java and C#.

Try for Free

Download the LEADTOOLS SDK for free. It’s fully-functional, good for 60 days, and even comes with unlimited chat and email support.

But Wait! There’s More

Did you see our previous post on How To Convert PDF to DOC / DOCX? Stay tuned for more conversion examples to see how the LEADTOOLS document converter will easily fit into any workflow converting PDF files into other document files or images and back again. Need help in the meantime? Contact our support team for free technical support! For pricing or licensing questions, you can contact our sales team via email or call us at 704-332-5532.

LEADTOOLS Blog

LEADTOOLS Powered by Apryse,the Market Leading PDF SDK,All Rights Reserved