Using LEADTOOLS Document Converters

The Document Converters allows conversion from any type of document to another with minimal amount of code.

The input and output document types can be any type of file formats supported by LEADTOOLS. Including but not limited to:

Adobe Acrobat PDF and PDF/A
Microsoft Office DOCX/DOC, XLSX/XLS and PPTX/PPT
CAD formats such as DXF, DWG and DWF
TIFF, JPEG, PNG, EXIF, BMP and hundred more raster image formats
AFP, MODCA and PTOCA

The DocumentConverter class will analyze the input and output documents types and then automatically uses a combination of the LEADTOOLS Raster, SVG and OCR engines to convert the data using the best possible combination of accuracy and speed. Each conversion operation is called a Document Converter Job in the framework.

Input Document

DocumentConverter uses the LEADTOOLS Documents Library to obtain information on the input file. The Document class encapsulates the file format details and returns a uniform set of the functionality needed for reading the pages and parsing the data needed for the conversion job. This includes loading page data as RasterImage or SvgDocument objects, reading the table of content and internal page links and any annotation objects embedded in file or stored in an associated file.

Output Document

The output document file format is divided into two categories:

Document File Formats. The generated output file will have all the text, images, shapes and any other objects found in the input document converted as is. Example of these documents are searchable PDF and Microsoft Word DOCX files.

In this mode, the converter engine will use SvgDocument or IOcrEngine technologies to parse the text and objects from the input document regardless of its type. For example, SVG is used if the input file is also a document format (such as PDF) and OCR is used if the input file is a raster format (such as TIFF).

The main object used for creating these types documents is DocumentWriter. Before running any conversion operations, you must set a new DocumentWriter instance in the converter using SetDocumentWriterInstance. You can get this value at any time using DocumentWriterInstance. Use this value to setup the extra document format options needed using DocumentWriter.SetOptions.
Raster File Formats. The generated output file will only have the raster image representation of the text, images, shapes and any other objects found in the input document. Example of these documents are TIFF, JPEG, PNG or raster PDF files.

In this mode, the converter engine will not use SVG nor OCR, instead relies on extensive file format support provided by LEADTOOLS to load the pages of the input document as raster and save them directly into the output file.

The main object used for creating these types documents is RasterCodecs.

Conversion Options

The document conversion is designed to run unattended. However, the DocumentConverter provides many options to monitor and modify the operation and to customize the output document as needed. This includes:

Built in multi-threading support
Diagnostics and logging through standard .NET tracing
Extensive events to report job status and progress as well as to allow modification of the data on the fly
Pre-processing to clean up the images loaded from the input documents
Annotations support for both input and output documents
Error recovery and quarantine
Page numbering template

Starting Up: DocumentConverter class

The DocumentConverter class is the main entry to the framework. Initialize an instance of this class to be used for converting one or more documents and then set these options:

> > >

Member	Description
SetOcrEngineInstance	IOcrEngine to use for parsing text and objects when SVG is not available in the input document.
SetDocumentWriterInstance	DocumentWriter to use when creating the output file when document format output is selected.
SetAnnRenderingEngineInstance	Optional rendering engine to use when the annotations are overlaid on top of images.
LoadDocumentOptions	Options to use when loading the input document.
Preprocessor	The pre-processing options to use for cleaning up the images of the input document.
Options	Extra optional options to use during the conversion such as error recovery mode and page number template.
Diagnostics	Options for logging such as enabling standard .NET tracing.

Creating Jobs

Once the DocumentConverter class is initialized, use the DocumentConverterJobs class (accessed through DocumentConverter.Jobs property) to create new conversion jobs.

The parameters for a job are set in a DocumentConverterJobData structure. This contains the following members:

Member	Description
Document	Document object to be used as the input of the conversion. Either this or InputDocumentFileName are used.
InputDocumentFileName	Path to the input file for the conversion. Either this or Document are used.
InputAnnotationsFileName	Path to the file containing the annotations file to be added to the output document. Optional.
InputDocumentFirstPageNumber	The number of the first page to be converted from the input document. Optional.
InputDocumentLastPageNumber	The number of the last page to be converted from the input document. Optional.
DocumentFormat	The output format when document conversion is used.
RasterImageFormat	The output format when raster conversion is used.
RasterImageBitsPerPixel	The bits per pixel of the output file when raster conversion is used.
OutputDocumentFileName	Name of the output file to be generated by this conversion.
OutputAnnotationsFileName	Name of the file that will contain the annotations parsed from the input document. Optional.
AnnotationsMode	Customizes how the annotations are saved in the output document.
JobName	Optional name of this job. Useful when tracing is enabled.
UserData	Optional user-defined object that can be used a long side the job events to pass application specified data.

The DocumentConverterJobs.CreateJobData overloaded methods can also be used to quickly create jobs from common input and output options.

When all the options are set, the DocumentConverterJobs.CreateJob method is used to create an instance of the DocumentConverterJob class that holds the job options as well the its status. This object will then passed to DocumentConverterJobs.RunJob or DocumentConverterJobs.RunJobAsync to run the operation.

Running Jobs

DocumentConverterJobs.RunJob or DocumentConverterJobs.RunJobAsync are used to run the job from the data created in the previous section. While the job is running, the DocumentConverterJobs.JobStarted (once), DocumentConverterJobs.JobOperation (more than one) and DocumentConverterJobs.JobCompleted (once) events will fire to indicate the job progress.

The data for the events of type DocumentConverterJobEventArgs and contains all the necessary information on the current job and its status:

Member	Description
Job	The actual job object that was passed to RunJob or RunJobAsync.
Status	The current status of the job and whether it is still running or has been aborted. The user can abort any running jobs by modifying this property.
Operation	Current operation being performed by the converter.
IsPostOperation	Whether this event is being fired before or after Operation.
InputDocumentPageNumber	Current page number in the input document.
OutputDocumentPageNumber	Current page number in the output document.
Document	The Document object being used by this conversion.
DocumentWriter	The DocumentWriter object being used by this operation if document conversion is used.
OcrDocument	The OCR document object being used if this operation is using OCR conversion.
OcrPage	The OCR page object being used if this operation is using OCR conversion.
SvgDocument	The SVG document being used if this operation is using SVG conversion.
OcrPageImage	The raster image object for the current page if this operation is using OCR conversion.
RasterImage	The raster image being used if this operation is using raster conversion.
AnnContainer	Annotation container being used if annotation conversion is used.
AnnotationsMode	Current annotations conversion mode.

For more information on these members and how they can be used or modified, refer to DocumentConverterJobOperation.

The InputDocumentPageNumber property can be used to show a progress bar indicator of the current conversion operation.

Completing Jobs

The job is completed when the RunJob method returns. If RunJobAsync was used, then the JobCompleted should be used to indicate when the job is completed. In both case, the DocumentConverterJob object passed will contain information on the status of this operation as follows:

Member	Description
Status	The job status. This can be success, success but with errors or aborted.
Errors	A list of any errors that might have occurred during the conversion.
JobData	The original options used to create this job.
DocumentConverter	The document converter object used to run the job.

Multi-Threading

DocumentConverter is multi-threaded safe. The RunJobAsync method can be used to run multiple jobs at the same time and run them in separate threads. Internally, the converter uses the .NET Thread Pool exclusively for creating and managing threads.

RunJobAsync will perform sanity check on the options and then start the job and return control to user immediately. The , JobOperation and JobCompleted events can be used to monitor the jobs status and to be notified when a job is completed. AbortAllJobs can be used at any time to abort all running and cancel any pending jobs.

Documents Library Features
Using LEADTOOLS Document Viewer