Perform a Word Search in Directories of Images and Documents

Posted on 2019-07-23 Nick Villalobos

Continuing off our last white paper about end-to-end eDiscovery with LEADTOOLS Document Imaging, this post will go over a .NET Core console application that performs OCR on each PDF from a given directory while searching for a given word in those PDFs. This application is optimized to handle both raster images and documents, so feel free to change the code around to search more than PDFs. If the word the user submits is found within a file in the directory, the file name containing the searched word will be logged out to the console.

We understand that you could be going through large directories looking for specific words. This application uses a Parallel.ForEach loop to process not multiple documents, but multiple pages within each document at a time. The LEADTOOLS OCR SDK and the Document Class are thread-safe so there isn't any extra coding required to allow multiple document and multiple page being processed at the same time.

The core code of this project can be found below. This application uses the LEADTOOLS Document, Formats.Raster.Additional, Formats.Raster.Common, Formats.Raster.Vector, OCR, and PDF NuGet Packages.

public void Start()
{
    var po = new ParallelOptions();

    var ocrEngine = OcrEngineManager.CreateEngine(OcrEngineType.LEAD, false);
    ocrEngine.Startup(null, null, null, null);

    //add all files from the directory to our list. This example uses PDF
    files.AddRange(Directory.GetFiles(searchDirectory, "*.pdf"));
    Parallel.ForEach(files, po, (file, outerState) =>
    {
        var options = new LoadDocumentOptions();
        using (var document = DocumentFactory.LoadFromFile(file, options))
        {
            document.Text.OcrEngine = ocrEngine;

            //set the extraction mode to auto so it will use SVG if it's available and OCR if not
            document.Text.TextExtractionMode = DocumentTextExtractionMode.Auto;

            //iterate over each page of each document
            Parallel.ForEach(document.Pages, po, (page, state) =>
        {
            var pageText = page.GetText();
            
            if (pageText != null)
            {                        
                pageText.BuildText();
                var text = pageText.Text;
                if (text.ToLower().Contains(wordToSearch.ToLower()))
                {
                    OnFileFound(new FileObject(Path.GetFileName(file)));
                        //we don't want to add the same document twice
                        state.Break();
                }
            }
        });
        }
    });
    Console.WriteLine(Environment.NewLine);
    Console.WriteLine("Finished!");
    Console.ReadLine();
}

Download Project!


To test this with the latest version of the LEADTOOLS NuGet Packages, download the free 30 day evaluation straight from our site. If you have any comments or questions regarding this, feel free to comment on this post or contact our Support department at support@leadtools.com.

LEADTOOLS Blog

LEADTOOLS Powered by Apryse,the Market Leading PDF SDK,All Rights Reserved