Specifies which options to use when parsing the objects of a PDF document.
[SerializableAttribute()]
[FlagsAttribute()]
public enum PDFParsePagesOptions
Value | Member | Description |
---|---|---|
0x00000000 | None | Do not parse any items. |
0x00000001 | Objects | Parse the objects of the page such as text items (characters), images, and rectangles. Specifying this member will populate the PDFDocumentPage.Objects collection with the objects found in the page. |
0x00000002 | Hyperlinks | Parse the hyperlinks found in the page. Specifying this member will populate the PDFDocumentPage.Hyperlinks collection with the hyperlinks found in the page. |
0x00000008 | IgnoreWhiteSpaces | Must be OR'ed with Objects (otherwise it will be ignored). If specified, white space characters such as spaces or tab characters or will not be returned as items in the PDFDocumentPage.Objects collection. Use PDFTextProperties.IsEndOfWord and PDFTextProperties.IsEndOfLine to re-construct the page words and lines as needed. |
0x00000004 | Fonts | |
0x00000010 | Annotations | Parse the annotations found in the page. Specifying this member will populate the PDFDocumentPage.Annotations collection with any annotations found in the page. |
0x00000020 | RTLOriginal | Parse characters right to left as they are stored in the page. |
0x00000040 | RTLFlipBrackets | Flip bracket characters for right to left text when parsing the page. |
0x00000080 | InternalLinks | Parse all internal links found in the page. This is the equivalent of calling PDFDocument.ParseDocumentStructure with the PDFParseDocumentStructureOptions.InternalLinks option. |
0x00000100 | FormFields | Parse the form fields found in the page. Specifying this member will populate the PDFDocumentPage.FormFields collection with the PDF form fields found in the page. |
0x00000200 | Signatures | Parse the digital signatures found in the page. Specifying this member will populate the PDFDocumentPage.Signatures collection with the PDF digital signatures found in the page. |
0x00000317 | All | Parse all objects with white spaces. This the equivalent of Objects | Hyperlinks | Fonts | Annotations | FormFields | Signatures |
0x0000031F | AllIgnoreWhiteSpaces | Parse all objects without white spaces. This the equivalent of Objects | Hyperlinks | Fonts | Annotations | FormFields | Signatures | IgnoreWhiteSpaces |
The PDFParsePagesOptions enumeration is used as the type of the options parameter passed to the PDFDocument.ParsePages method.
When a PDFDocument object is created, the pages of the PDF document are already parsed and populated in the PDFDocument.Pages collection. Each page can contain other objects such as text items (characters), images, rectangles, hyperlinks, annotations, form fields, and digital signatures, as well as the fonts used in these items. These items are not parsed automatically for performance reasons. Instead, call the PDFDocument.ParsePages method with the page ranges you are interested in (or all pages), and the type of items to parse.
Initially, the values of the PDFDocumentPage.Objects, PDFDocumentPage.Hyperlinks, PDFDocumentPage.Annotations, PDFDocumentPage.FormFields, and PDFDocumentPage.Signatures lists of each PDFDocumentPage will be set to null. After the PDFDocument.ParsePages method is called, the corresponding list will be populated with the items found in the page.
Any type of item can be parsed. This is done through the options parameter of type PDFParsePagesOptions passed to PDFDocument.ParsePages. The different options and results are as follows:
If PDFParsePagesOptions.Objects is specified, then the PDFDocumentPage.Objects collection will be populated with a PDFObject object for each object item found in the page. These items can be text (characters), images, or rectangles. If there are no object items found in the page, then the PDFDocumentPage.Objects will be initialized with an empty collection (PDFDocumentPage.Objects.Count will be 0).
If PDFParsePagesOptions.Hyperlinks is specified, then the PDFDocumentPage.Hyperlinks collection will be populated with a PDFHyperlink object for each hyperlink item found in the page. If no hyperlinks are found in the page, PDFDocumentPage.Hyperlinks will be initialized with an empty collection (PDFDocumentPage.Hyperlinks.Count will be 0).
If PDFParsePagesOptions.Annotations is specified, then the PDFDocumentPage.Annotations collection will be populated with a PDFAnnotation object for each annotation item found in the page. If no annotations are found in the page, PDFDocumentPage.Annotations will be initialized with an empty collection (PDFDocumentPage.Annotations.Count will be 0).
If PDFParsePagesOptions.FormFields is specified, then the PDFDocumentPage.FormFields collection will be populated with a PDFFormField object for each form field item found in the page. If no form fields are found in the page, PDFDocumentPage.FormFields will be initialized with an empty collection (PDFDocumentPage.FormFields.Count will be 0).
If PDFParsePagesOptions.Signatures is specified, then the PDFDocumentPage.Signatures collection will be populated with a PDFSignature object for each digital signature item found in the page. If no signatures are found in the page, PDFDocumentPage.Signatures will be initialized with an empty collection (PDFDocumentPage.Signatures.Count will be 0).
White space characters such as spaces or tabs are parsed by default and returned as individual objects. Stop this behavior by OR'ing the PDFParsePagesOptions.IgnoreWhiteSpaces enumeration member with PDFParsePagesOptions.Objects in the options parameter passed to PDFDocument.ParsePages. Note that the words and lines of text in the page can be reconstructed without white characters by using the PDFTextProperties.IsEndOfWord and PDFTextProperties.IsEndOfLine properties. The example of PDFTextProperties shows how to do that.
The values of PDFParsePagesOptions can be OR'ed together.
Note on using PDFParsePagesOptions.Signatures: PDFDocument.ParsePages will automatically call PDFDocument.GetDigitalSignatureSupportStatus to query the status of reading PDF digital signatures. If this method indicates that digital signatures are not available or not supported, then the PDFParsePagesOptions.Signatures is removed and the signatures are not read.
using Leadtools;
using Leadtools.Codecs;
using Leadtools.Controls;
using Leadtools.Pdf;
using Leadtools.Svg;
using Leadtools.WinForms;
public void PDFDocumentParsePagesExample()
{
string pdfFileName = Path.Combine(LEAD_VARS.ImagesDir, @"Leadtools.pdf");
string txtFileName = Path.Combine(LEAD_VARS.ImagesDir, @"LEAD_pdf.txt");
// Open the document
using (PDFDocument document = new PDFDocument(pdfFileName))
{
// Parse everything and for all pages
PDFParsePagesOptions options = PDFParsePagesOptions.All;
document.ParsePages(options, 1, -1);
// Save the results to the text file for examining
using (StreamWriter writer = File.CreateText(txtFileName))
{
foreach (PDFDocumentPage page in document.Pages)
{
writer.WriteLine("Page {0}", page.PageNumber);
IList<PDFObject> objects = page.Objects;
writer.WriteLine("Objects: {0}", objects.Count);
foreach (PDFObject obj in objects)
{
writer.WriteLine(" ObjectType: {0}", obj.ObjectType.ToString());
writer.WriteLine(" Bounds: {0}, {1}, {2}, {3}", obj.Bounds.Left, obj.Bounds.Top, obj.Bounds.Right, obj.Bounds.Bottom);
WriteTextProperties(writer, obj.TextProperties);
writer.WriteLine(" Code: {0}", obj.Code);
writer.WriteLine("------");
}
writer.WriteLine("---------------------");
IList<PDFHyperlink> hyperlinks = page.Hyperlinks;
writer.WriteLine("Hyperlinks: {0}", hyperlinks.Count);
foreach (PDFHyperlink hyperlink in hyperlinks)
{
writer.WriteLine(" Hyperlink: {0}", hyperlink.Hyperlink);
writer.WriteLine(" Bounds: {0}, {1}, {2}, {3}", hyperlink.Bounds.Left, hyperlink.Bounds.Top, hyperlink.Bounds.Right, hyperlink.Bounds.Bottom);
WriteTextProperties(writer, hyperlink.TextProperties);
}
writer.WriteLine("---------------------");
}
}
}
}
private static void WriteTextProperties(StreamWriter writer, PDFTextProperties textProperties)
{
writer.WriteLine(" TextProperties.FontHeight: {0}", textProperties.FontHeight.ToString());
writer.WriteLine(" TextProperties.FontWidth: {0}", textProperties.FontWidth.ToString());
writer.WriteLine(" TextProperties.FontIndex: {0}", textProperties.FontIndex.ToString());
writer.WriteLine(" TextProperties.IsEndOfWord: {0}", textProperties.IsEndOfWord.ToString());
writer.WriteLine(" TextProperties.IsEndOfLine: {0}", textProperties.IsEndOfLine.ToString());
writer.WriteLine(" TextProperties.Color: {0}", textProperties.Color.ToString());
}
static class LEAD_VARS
{
public const string ImagesDir = @"C:\LEADTOOLS22\Resources\Images";
}