The Document Converter allows conversion from any type of document to another with a minimal amount of code.
Both input and output document types can be any file format supported by LEADTOOLS, which includes but is not limited to:
Adobe Acrobat PDF and PDF/A
Microsoft Office DOCX/DOC, XLSX/XLS, and PPTX/PPT
CAD formats such as DXF, DWG, and DWF
TIFF, JPEG, PNG, EXIF, BMP, and one hundred more raster image formats
AFP, MODCA and PTOCA
The DocumentConverter class will analyze the input and output document types and then automatically use a combination of the LEADTOOLS Raster, SVG, and OCR engines to convert the data using the best possible combination of accuracy and speed. Each conversion operation is called a Document Converter Job in the framework.
The DocumentConverter uses the LEADTOOLS Document Library to obtain information about the input file. The LEADDocument class encapsulates the file format details and returns a uniform set of the functionality needed for reading the pages and parsing the data needed for the conversion job. This includes loading page data as RasterImage or SvgDocument objects, reading the table of contents and internal page links and any annotation objects embedded in the file or stored in an associated file.
There are two types of output file formats:
Document File Formats. The generated output file will have all the text, images, shapes, and any other objects found in the input document converted as-is. For example, searchable PDF and Microsoft Word DOCX files are created with these characteristics.
In this mode, the converter engine will use SvgDocument or IOcrEngine to parse the text and objects from the input document regardless of its type. For example, SVG is used if the input file is also a document format (such as PDF) and OCR is used if the input file is a raster format (such as TIFF).
The main object used for creating these types of documents is DocumentWriter. Before running any conversion operations, you must set a new DocumentWriter instance in the converter using SetDocumentWriterInstance. You can get this value at any time using DocumentWriterInstance. Use this value to set up the extra document format options needed using DocumentWriter.SetOptions.
Raster File Formats. The generated output file will only have the raster image representation of the text, images, shapes, and any other objects found in the input document. Example of these documents are TIFF, JPEG, PNG, or raster PDF files.
In this mode, the converter engine will not use SVG nor OCR, and instead will rely on the extensive file format support provided by LEADTOOLS to load the pages of the input document as raster and save them directly into the output file.
The main object used for creating these types documents is RasterCodecs.
The document conversion is designed to run unattended. However, the DocumentConverter provides many options to monitor and modify the operation and to customize the output document as needed. This includes the following options:
Built-in multi-threading support
Diagnostics and logging through standard .NET tracing
Extensive events to report job status and progress as well as to allow modification of the data on the fly
Pre-processing to clean up the images loaded from the input documents
Annotations support for both input and output documents
Error recovery and quarantine
Page numbering template
The DocumentConverter class is the main entry to the framework. Initialize an instance of this class to be used for converting one or more documents and then set the following options:
Member | Description |
---|---|
SetOcrEngineInstance | IOcrEngine to use for parsing text and objects when SVG is not available in the input document. |
SetDocumentWriterInstance | DocumentWriter to use when creating the output file when document format output is selected. |
SetAnnRenderingEngineInstance | Optional rendering engine to use when the annotations are overlaid on top of images. |
LoadDocumentOptions | Options to use when loading the input document. |
Preprocessor | The pre-processing options to use for cleaning up the images of the input document. |
Options | (Optional) Extra options to use during the conversion such as error recovery mode and page number template. |
Diagnostics | Options for logging such as enabling standard .NET tracing. |
Once the DocumentConverter class is initialized, use the DocumentConverterJobs class (accessed through the DocumentConverter.Jobs property) to create new conversion jobs.
The parameters for a job are set in a DocumentConverterJobData structure. This contains the following members:
Member | Description |
---|---|
Document | The LEADDocument object to be used as the input of the conversion. Either this or InputDocumentFileName is used. |
InputDocumentFileName | The path to the input file for the conversion. Either this or Document are used. |
InputAnnotationsFileName | (Optional) The path to the file containing the annotations file to be added to the output document. |
InputDocumentFirstPageNumber | (Optional) The number of the first page to be converted from the input document. |
InputDocumentLastPageNumber | (Optional) The number of the last page to be converted from the input document. |
DocumentFormat | The output format when document conversion is used. |
RasterImageFormat | The output format when raster conversion is used. |
RasterImageBitsPerPixel | The bits per pixel of the output file when raster conversion is used. |
OutputDocumentFileName | The name of the output file to be generated by this conversion. |
OutputAnnotationsFileName | (Optional) The name of the file that will contain the annotations parsed from the input document. |
AnnotationsMode | Customizes how the annotations are saved in the output document. |
JobName | (Optional) The name of this job. Useful when tracing is enabled. |
UserData | (Optional) The user-defined object that can be used alongside the job events to pass application-specified data. |
The DocumentConverterJobs.CreateJobData overloaded methods can also be used to quickly create jobs from common input and output options.
When all the options are set, the DocumentConverterJobs.CreateJob method is used to create an instance of the DocumentConverterJob class that holds the job options as well as its status. This object will then be passed to DocumentConverterJobs.RunJob or DocumentConverterJobs.RunJobAsync to run the operation.
The DocumentConverterJobs.RunJob or DocumentConverterJobs.RunJobAsync methods are used to run the job from the data created in the previous section. While the job is running, the DocumentConverterJobs.JobStarted (once), DocumentConverterJobs.JobOperation (more than once), and the DocumentConverterJobs.JobCompleted (once) events will fire to indicate the job progress.
The data for the DocumentConverterJobEventArgs type events contains all the necessary information about the current job and its status:
Member | Description |
---|---|
Job | The actual job object that was passed to RunJob or RunJobAsync. |
Status | The current status of the job and whether it is still running or has been aborted. Use this property to abort any running jobs. |
Operation | The current operation being performed by the converter. |
IsPostOperation | A value that indicates whether this event is being fired before or after Operation. |
InputDocumentPageNumber | The current page number in the input document. |
OutputDocumentPageNumber | The current page number in the output document. |
Document | The LEADDocument object being used by this conversion. |
DocumentWriter | The DocumentWriter object being used by this operation if document conversion is used. |
OcrDocument | The OCR document object being used if this operation is using OCR conversion. |
OcrPage | The OCR page object being used if this operation is using OCR conversion. |
SvgDocument | The SVG document being used if this operation is using SVG conversion. |
OcrPageImage | The raster image object for the current page if this operation is using OCR conversion. |
RasterImage | The raster image being used if this operation is using raster conversion. |
AnnContainer | The annotation container being used if annotation conversion is used. |
AnnotationsMode | The current annotations conversion mode. |
For more information about these members and how they can be used or modified, refer to DocumentConverterJobOperation.
The InputDocumentPageNumber property can be used to show a progress bar indicator of the current conversion operation.
The job is completed when the RunJob method returns. If RunJobAsync was used, then the JobCompleted should be used to indicate when the job is completed. In both cases, the DocumentConverterJob object passed will contain information about the status of this operation as follows:
Member | Description |
---|---|
Status | The job status. This can be Success, SuccessWithErrors or Aborted. |
Errors | A list of any errors that might have occurred during the conversion. |
JobData | The original options used to create this job. |
DocumentConverter | The document converter object used to run the job. |
The DocumentConverter is multi-threaded safe. The RunJobAsync method can be used to run multiple jobs at the same time and run them in separate threads. Internally, the converter uses the .NET Thread Pool exclusively for creating and managing threads.
The RunJobAsync will perform a sanity check on the options and then start the job and return control back to the user immediately. The JobOperation and JobCompleted events can be used to monitor the job's status and to make notifications when a job is completed. AbortAllJobs can be used at any time to abort all running and cancel any pending jobs.
DocumentConverter has support for document conversion with status update. Refer to Status Document Job Converter for more information.