LEADTOOLS Support
Document
Document SDK Examples
HOW TO: Extract and Redact Text from a File Based on a Regular Expression
#1
Posted
:
Thursday, June 21, 2018 3:48:55 PM(UTC)
Groups: Registered, Tech Support, Administrators
Posts: 70
Was thanked: 4 time(s) in 4 post(s)
When working with any file, it is important to bear in mind that some files will contain sensitive information. When archiving digital files, it is often important to remove any sensitive data (such as social security numbers, or MICR information on checks). The attached demo written in C# using V20 of the LEADTOOLS SDK showcases how to take an input file, extract all of the text, and how to search and redact the text based off a regular expression. For the purposes of this demo, we are searching for any word containing LEAD or LEADTOOLS.
The code:
Code:
string inputFile = @"C:\Users\Public\Documents\LEADTOOLS Images\leadtools.pdf";
string outputFile = $@"{Path.GetDirectoryName(inputFile)}\{Path.GetFileNameWithoutExtension(inputFile)}-redacted.pdf";
using (var ocrEngine = OcrEngineManager.CreateEngine(OcrEngineType.LEAD, false))
{
ocrEngine.Startup(null, null, null, OcrEnginePath);
var options = new LoadDocumentOptions();
using (var document = DocumentFactory.LoadFromFile(inputFile, options))
{
document.IsReadOnly = false;
document.Text.OcrEngine = ocrEngine;
RasterImage redactedDocument = document.Pages.First().GetImage();
foreach(var page in document.Pages)
{
if (page.PageNumber == 1)
continue;
redactedDocument.AddPage(page.GetImage());
}
Parallel.ForEach(document.Pages, (page) =>
{
AnnContainer container = new AnnContainer();
var pageText = page.GetText();
pageText.BuildWords();
//Regex to search for all instances of LEADTOOLS or LEAD
var pattern = "(LEADTOOLS|LEAD)";
var rgx = new Regex(pattern, RegexOptions.IgnoreCase);
var annotations = new ConcurrentBag<AnnRedactionObject>();
Parallel.ForEach(pageText.Words, (word) =>
{
if (rgx.Match(word.Value.ToLower()).Success)
{
AnnRedactionObject redactionObject = new AnnRedactionObject();
redactionObject.Rect = word.Bounds;
redactionObject.Fill = AnnSolidColorBrush.Create("Black");
annotations.Add(redactionObject);
}
});
var imagePage = page.GetImage();
foreach (var annotation in annotations)
container.Children.Add(annotation);
AnnWinFormsRenderingEngine e = new AnnWinFormsRenderingEngine();
e.RenderOnImage(container, imagePage);
redactedDocument.ReplacePage(page.PageNumber, imagePage);
});
using (RasterCodecs codecs = new RasterCodecs())
codecs.Save(redactedDocument, outputFile, RasterImageFormat.RasPdfJpeg, 0);
redactedDocument.Dispose();
Console.WriteLine($"File has been successfully redacted, and saved to {outputFile}");
}
}
Edited by moderator Monday, February 3, 2020 3:22:55 PM(UTC)
| Reason: Not specified
Duncan Quirk
Developer Support Engineer
LEAD Technologies, Inc.
#2
Posted
:
Monday, February 17, 2020 1:42:35 PM(UTC)
Groups: Registered, Tech Support, Administrators
Posts: 89
Was thanked: 4 time(s) in 4 post(s)
The attached project is a sample for the same redaction functionality in Visual Basic.
Included is a LEADTOOLS sample PDF document and an example of the output file.
Chris Thompson
Developer Support Engineer
LEAD Technologies, Inc.
LEADTOOLS Support
Document
Document SDK Examples
HOW TO: Extract and Redact Text from a File Based on a Regular Expression
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.