Loading Using LEADTOOLS Document Library

The LEADTOOLS Document library supports loading by creating a LEADDocument object. The data used can reside in a disk file, a remote URL, or data that was previously uploaded to the cache system.

Loading from Disk Files

To load a LEADDocument object from a disk file, create an instance of LoadDocumentOptions, and then pass it along with the file name to DocumentFactory.LoadFromFile:

var loadDocumentOptions = new LoadDocumentOptions(); 
// Initialize loadDocumentOptions as needed 
Document document = DocumentFactory.LoadFromFile(fileName, loadDocumentOptions); 

The following steps explain how this method works:

  1. The LoadDocumentOptions.UseCache is checked. If the value is true, then the application must have set a valid cache object in either LoadDocumentOptions.Cache or DocumentFactory.Cache. Otherwise, an exception is thrown. If the value is false, then the new document will not use caching.

  2. Caching is optional in this mode and not required. It can be used to speed up obtaining document image data or text if the pages are revisited by the application, or to save the document to the cache before it is disposed of. Otherwise, all the data will be parsed from the original file as needed.

  3. If the value of LoadDocumentOptions.AnnotationsUri is not null, then it must contain the URL to a disk file as well. You can create a new Uri object from the physical path to the annotation file on disk and set it in this property. This will create a Uri object using the file:/// naming scheme. Any other scheme (such as http) will fail when using LoadFromFile.

  4. The factory will obtain information about the file format in fileName using RasterCodecs.GetInformation. If this fails (if it is an invalid file format or the required LEADTOOLS file format assembly is not found), then an exception is thrown.

  5. A LEADDocument object is created and the following members are initialized:

    Member Value
    DocumentId A unique identifier created for this document that can be used if the document is saved to the cache.
    Uri new Uri (fileName).
    IsReadOnly true.
    CacheUri null since the document has direct access to the physical file.
    Stream null.
    HasStream false.
    IsDownloaded false since the document was not downloaded.
    GetDocumentFileName Will return the same fileName passed to LoadFromFile.
    GetDocumentStream null.
    GetAnnotationsFileName Will return the same file name passed to LoadDocumentOptions.AnnotationsUri.
    GetAnnotationsStream null.
    HasAnnotationsStream false.
    DocumentType The document type.
    MimeType The MIME type of the document file format set during load.
    HasCache The same value as LoadDocumentOptions.UseCache. If the value is true, then GetDocumentFileName or GetDocumentStream can be used to obtain the original data. Otherwise, it will return the path to the temporary file.
    LastCacheSyncTime Random old date since the document has not yet been saved to the cache.
    CacheStatus DocumentCacheStatus.NotSynced since the document has not yet been saved to the cache.
    AutoDeleteFromCache true. Can be changed to false if the application will re-load this document from the cache at a later time using DocumentFactor.LoadFromCache.
    AutoSaveToCache false.
    InternalObject The internal LEADTOOLS object that being used to parse the document data.
    UserData null
    IsEncrypted false unless the document is encrypted. In this case most of the document properties cannot be used before the document is decrypted. Refer to Loading Encrypted Files Using the Document Library for more information.
    IsDecrypted false
    IsStructureSupported true or false based on the MIME type of the document.
    Metadata Ready to be used.
    Structure Ready to be used.
    Images Ready to be used.
    Text Ready to be used.
    Pages Ready to be used.
    Documents Empty collection since this is not a virtual document.
    HasDocuments false.
    AutoDisposeDocuments false.
    Annotations Ready to be used.
  6. LoadFromFile returns with this LEADDocument object, ready to be used.

  7. LEADDocument will parse data from the original file on disk on demand, therefore the original fileName passed to LoadFromFile must not be deleted while the Document is alive. Otherwise, errors will occur when accessing the document data.

For an example, refer to DocumentFactory.LoadFromFile.

Loading from a Remote URL

To create a LEADDocument object from a remote URL, create an instance of LoadDocumentOptions, and then pass it along with a URL object pointing to the remote location of the document file to DocumentFactory.LoadFromUri:

var loadDocumentOptions = new LoadDocumentOptions(); 
// Initialize loadDocumentOptions as needed 
Document document = DocumentFactory.LoadFromUri(uri, loadDocumentOptions); 

The following steps explain how this method works:

  1. LoadDocumentOptions.UseCache is checked. If the value is true, then the application must have set a valid cache object in either LoadDocumentOptions.Cache or DocumentFactory.Cache. Otherwise, an exception is thrown. If the value is false, then the new document will not use caching.

  2. The cache is also optional in this mode and not required. As well as speeding obtaining document image data or text from pages that were previously visited, the cache can be used to download the document file name in uri as explained below.

  3. If the uri passed to LoadFromUri has the special LEAD cache scheme (detected using IsUploadDocumentUri), then the factory assumes this is the URI to a document previously uploaded to the cache using DocumentFactory.BeginUpload, and the steps below are not performed and no data is downloaded. The data is already in the cache and the factory skips to step 8 below.

  4. If the value of LoadDocumentOptions.AnnotationsUri is not null, then it will be treated as a remote URL and the data is downloaded by the factory in the same manner used for the document file as explained below.

  5. The factory will check the value of LoadDocumentOptions.UseCache:

    • If the value is true, then the document data is downloaded from uri into the cache system.

    • If the value is false, then the document data is downloaded from uri to a temporary file name created on the machine.

  6. Similarly, if LoadDocumentOptions.AnnotationsUri is not null, it will be downloaded either to the cache system or to a temporary file based on cache usage.

  7. When downloading the data, the factory will use the WebClient object in LoadDocumentOptions if not null. Otherwise, it will create a new instance and dispose of it after it has been used. This allows the application to pass a custom WebClient with specific proxy or credential settings or to monitor the download progress.

  8. The factory will obtain information about the file format using RasterCodecs.GetInformation on the downloaded or temporary file or cache data. If this fails (if it is an invalid file format or the required LEADTOOLS file format assembly is not found), then the cache or downloaded data is deleted and an exception is thrown.

  9. A LEADDocument object is created and the following members are initialized:

    Member Value
    DocumentId A unique identifier created for this document that can be used if the document is saved to the cache.
    Uri Same uri passed to LoadFromUri.
    IsReadOnly true.
    CacheUri If the document was downloaded to the cache and if the cache system has virtual directory capabilities, then this property will contain a URI to the original document data (PDF, TIFF, DOCX, etc.). Otherwise, is null.
    Stream null.
    HasStream false.
    IsDownloaded true since the document was downloaded.
    GetDocumentFileName Will return the path to the cache item or temporary file containing the downloaded data of the original document. If the cache does not have direct access to the file system then this will be null.
    GetDocumentStream If the cache does not have direct access to the file system then this will return a stream containing the original document. Otherwise, null.
    GetAnnotationsFileName Will return the path to the cache item or temporary file containing the downloaded data of the annotations. If the cache does not have direct access to the file system then this will be null
    GetAnnotationsStream If the cache does not have direct access to the file system then this will return a stream containing the annotations. Otherwise, null.
    HasAnnotationsStream true or false depending on the above.
    DocumentType The document type.
    MimeType The MIME type of the document file format set during load.
    HasCache The same value as LoadDocumentOptions.UseCache. If the value is true, then GetDocumentFileName or GetDocumentStream can be used to obtain the original data. Otherwise, it will return the path to the temporary file.
    LastCacheSyncTime Random old date since the document has not yet been saved to the cache.
    CacheStatus DocumentCacheStatus.NotSynced since the document has not yet been saved to the cache.
    AutoDeleteFromCache true. Can be changed to false if the application will re-load this document from the cache at a later time using DocumentFactor.LoadFromCache.
    AutoSaveToCache false.
    InternalObject The internal LEADTOOLS object that being used to parse the document data.
    UserData null
    IsEncrypted false unless the document is encrypted. In this case most of the document properties cannot be used before the document is decrypted. Refer to Loading Encrypted Files Using the Document Library for more information.
    IsDecrypted false
    IsStructureSupported true or false based on the MIME type of the document.
    Metadata Ready to be used.
    Structure Ready to be used.
    Images Ready to be used.
    Text Ready to be used.
    Pages Ready to be used.
    Documents Empty collection since this is not a virtual document.
    HasDocuments false.
    AutoDisposeDocuments false.
    Annotations Ready to be used.
  10. LoadFromUri returns with this LEADDocument object ready to be used.

  11. The document will parse data from the downloaded data, therefore the original URL passed to LoadFromUri is never used again and the data it points to can be deleted right away.

  12. When the document is disposed of, the temporary files and cache items will be deleted unless they are saved to the cache first.

For an example, refer to DocumentFactory.LoadFromUri.

Loading From a Remote URL Asynchronously

LoadFromUri does not return control to the application until the document is downloaded and parsed. To create a LEADDocument object from a remote URL asynchronously, create an instance of LoadDocumentAsyncOptions. Pass LoadDocumentAsyncOptions, along with a URL object pointing to the remote location of the document file to DocumentFactory.LoadFromUriAsync:

var loadDocumentAsyncOptions = new LoadDocumentAsyncOptions(); 
// Initialize loadDocumentAsyncOptions as needed. The Completed event is a must: 
loadDocumentAsyncOptions.Completed += (sender, e) => { 
   // Completed, use e.Document 
}; 
DocumentFactory.LoadFromUriAsync(uri, loadDocumentAsyncOptions); 

The following steps explain how this method works:

  1. LoadDocumentOptions.UseCache is checked. If the value is true, then the application must have set a valid cache object in LoadDocumentOptions.Cache or DocumentFactory.Cache. Otherwise, an exception is thrown. If the value is false, then the new document will not use caching.

  2. The cache is also optional in this mode and not required. Caching speeds up obtaining document image data or text from pages that were previously visited, and can also be used to download the document file name in uri as explained below.

  3. If the uri value passed to LoadFromUriAsync is using the LEAD caching scheme, then the factory assumes this is the URI to a document previously uploaded to the cache using DocumentFactory.BeginUpload. The steps below are not performed and no data is downloaded. The data is already in the cache and the factory skips to step 10 below.

  4. A thread is created to handle loading the document, control is returned to the application, and the rest of these steps are performed in the thread procedure.

  5. If the value of LoadDocumentOptions.AnnotationsUri is not null, then it will be treated as a remote URL and the data is downloaded by the factory in the same manner used for the document file as explained below.

  6. The factory will check the value of LoadDocumentOptions.UseCache:

    • If the value is true, then the document data is downloaded from uri into the cache system.

    • If the value is false, then the document data is downloaded from uri to a temporary file created on the machine.

  7. Similarly, if LoadDocumentOptions.AnnotationsUri is not null, it will be downloaded either to the cache system or to a temporary file based on cache usage.

  8. When downloading the data, the factory will use the WebClient object in LoadDocumentOptions if not null. Otherwise, it will create a new instance and dispose of it after it has been used. This allows the application to pass a custom WebClient with specific proxy or credential settings.

  9. The WebClient.DownloadProgressChanged event is mapped to LoadDocumentAsyncOptions.Progress if the value is not null to allow the user to monitor the progress of the download.

  10. When WebClient.DownloadFileCompleted occurs, the factory will obtain information about the file format using RasterCodecs.GetInformation on the downloaded or temporary file. If this fails (if it is an invalid file format or the required LEADTOOLS file format assembly is not found), then the cache or downloaded data is deleted and LoadDocumentAsyncOptions.Completed is fired with the error object in LoadAsyncCompletedEventArgs.Error.

  11. Otherwise, LEADDocument object is created and the following members are initialized:

    Member Value
    DocumentId A unique identifier created for this document that can be used if the document is saved to the cache.
    Uri Same uri passed to LoadFromUriAsync.
    Stream null.
    HasStream false.
    IsDownloaded true since the document was downloaded.
    GetDocumentFileName Will return the path to the cache item or temporary file containing the downloaded data of the original document. If the cache does not have direct access to the file system then this value will be null.
    GetDocumentStream If the cache does not have direct access to the file system then this will return a stream containing the original document. Otherwise, null.
    GetAnnotationsFileName Will return the path to the cache item or temporary file containing the downloaded data of the annotations. If the cache does not have direct access to the file system then this will be null.
    GetAnnotationsStream If the cache does not have direct access to the file system then this will return a stream containing the annotations. Otherwise, null.
    HasAnnotationsStream true or false depending on the above.
    DocumentType The document type.
    MimeType The MIME type of the document file format set during load.
    HasCache The same value as LoadDocumentOptions.UseCache. If the value is true, then GetDocumentFileName or GetDocumentStream can be used to obtain the original data. Otherwise, it will return the path to the temporary file.
    LastCacheSyncTime Random old date since the document has not yet been saved to the cache.
    CacheStatus DocumentCacheStatus.NotSynced since the document has not yet been saved to the cache.
    AutoDeleteFromCache true. Can be changed to false if the application will re-load this document from the cache at a later time using DocumentFactor.LoadFromCache.
    AutoSaveToCache false.
    InternalObject The internal LEADTOOLS object that being used to parse the document data.
    UserData null
    IsEncrypted false unless the document is encrypted. In this case most of the document properties cannot be used before the document is decrypted. Refer to Loading Encrypted Files Using the Document Library for more information.
    IsDecrypted false
    IsStructureSupported true or false based on the MIME type of the document.
    Metadata Ready to be used.
    Structure Ready to be used.
    Images Ready to be used.
    Text Ready to be used.
    Pages Ready to be used.
    Documents Empty collection since this is not a virtual document.
    HasDocuments false.
    AutoDisposeDocuments false.
    Annotations Ready to be used.
  12. The LoadDocumentAsyncOptions.Completed event is fired with the LEADDocument object in LoadAsyncCompletedEventArgs.Document. This LEADDocument object is now ready to be used.

  13. LEADDocument will parse data from the downloaded data, therefore the original URL passed to LoadFromUriAsync is never used again and the data it points to can be deleted right away.

  14. When LEADDocument is disposed of, the temporary files will be deleted unless it is saved to the cache first.

For an example, refer to DocumentFactory.LoadFromUriAsync.

Loading from a Stream

To create a LEADDocument object from a document stored in a stream, create an instance of LoadDocumentOptions, and then pass it along with the stream object to DocumentFactory.LoadFromStream:

var loadDocumentOptions = new LoadDocumentOptions(); 
// Initialize loadDocumentOptions as needed 
Document document = DocumentFactory.LoadFromFile(stream, loadDocumentOptions); 

The following steps explain how this method works:

  1. The LoadDocumentOptions.UseCache is checked. If the value is true, then the application must have already set a valid cache object in LoadDocumentOptions.Cache or DocumentFactory.Cache. Otherwise, an exception is thrown. If the value is false, then the new document will not use caching.

  2. Caching is optional in this mode and not required. It can be used to speed up obtaining document image data or text if the pages are revisited by the application or to save the document to the cache before it is disposed. Otherwise, all the data will be parsed from the original stream as needed.

  3. If the value of LoadDocumentOptions.AnnotationsUri is not null, then it must contain the URL to a disk file as well. You can create a new Uri object from the physical path to the annotation file on disk and set it in this property. This will create an Uri object with file:/// scheme. Any other scheme (such as http) will fail when using LoadFromStream.

  4. The factory will obtain information on the file format in stream using RasterCodecs.GetInformation. If this fails (if it is an invalid file format or the required LEADTOOLS file format assembly is not found) then an exception is thrown.

  5. LEADDocument object is created and the following members are initialized:

    Member Value
    DocumentId A unique identifier created for this document that can be used if the document is saved to the cache.
    Uri null.
    Stream The original stream passed to LoadFromStream.
    HasStream true.
    IsDownloaded false since the document was not downloaded.
    GetDocumentFileName null.
    GetDocumentStream null.
    GetAnnotationsFileName Will return the same file name passed to LoadDocumentOptions.AnnotationsUri.
    GetAnnotationsStream null.
    HasAnnotationsStream false.
    DocumentType The document type.
    MimeType The MIME type of the document file format. The value is set during load.
    HasCache The same value as LoadDocumentOptions.UseCache.
    LastCacheSyncTime Random old date since the document has not yet been saved to the cache.
    CacheStatus DocumentCacheStatus.NotSynced since the document has not yet been saved to the cache.
    AutoDeleteFromCache true. Can be changed to false if the application will re-load this document from the cache at a later time using DocumentFactor.LoadFromCache.
    AutoSaveToCache false.
    InternalObject The internal LEADTOOLS object being used to parse the document data.
    UserData null
    IsEncrypted false unless the document is encrypted. In the document is encrypted, most of the document properties cannot be used before the document is decrypted. Refer to Loading Encrypted Files Using the Document Library for more information.
    IsDecrypted false
    IsStructureSupported true or false based on the MIME type of the document.
    Metadata Ready to be used.
    Structure Ready to be used.
    Images Ready to be used.
    Text Ready to be used.
    Pages Ready to be used.
    Documents Empty collection since this is not a virtual document.
    HasDocuments false.
    AutoDisposeDocuments false.
    Annotations Ready to be used.
  6. LoadFromStream returns with this LEADDocument object ready to be used.

  7. LEADDocument will parse data from the original stream on demand, therefore the original stream passed to LoadFromStream must be kept alive by the user while Document is alive. Otherwise, errors will occur when accessing the document data.

If the document is saved into the cache using SaveToCache, then the entire content of the stream is saved into the cache and the stream is no longer used and can be safely disposed by the user. When the document is later re-loaded from the cache using DocumentFactory.LoadFromCache then it is treated as it was downloaded from an external resource and the stream functionality is not used (the value of Stream will be null).

For an example, refer to DocumentFactory.LoadFromStream.

Aborting Long Loading Operations

Complex document file formats such as DOCX and XSLX can require significantly more time to parse the file structure than simpler document file formats. The amount of time depends on the source file itself. Very complex document files (such as a very large XLSX spreadsheet with thousands or millions of rows), can take many seconds or even minutes. DocumentFactory.LoadFromFile or DocumentFactory.LoadFromUri will not return until all the file data is parsed.

For such documents, using TimeoutMilliseconds allows long-loading operations to be aborted if required. After the allocated timeout has passed, DocumentFactory will abort the load operation and return null instead of a valid LEADDocument is returned from LoadFromFile or LoadFromUri.

Document File Formats Speed-up using MemoryCache

Certain multipage file formats can be slow to load using the default implementation of DocumentFactory. This is especially true for document file formats such as DOCX/DOC, XLSX/XLS, RTF, PDF, and TXT.

Refer to DocumentMemoryCache for more information on how to speed up loading these types of files, especially in a client-server application.

MIME Type Whitelisting

If MIME type whitelisting is used, it is possible for the DocumentFactory load methods to return null as the resulting document if its MIME type was denied. Refer to DocumentMimeTypes for more information.

Cloning a Document

The following methods allow the user to create a clone (an exact copy) of a document stored in the cache:

Getting Document Information

The following methods can be used to quickly obtain information about a document without loading it. Information obtained includes the document name, mime type, and number of pages:

Deleting Documents from the Cache

Documents are automatically deleted when they expire as setup using the cache policies. The following method can be used to manually delete a document from the cache at any time:

Document User Tokens

Each LEADDocument can optionally be associated with a user token to restrict usage. For instance, when a document is first loaded from a URI into the cache using DocumentFactory.LoadFromUri, the value of LoadDocumentOptions.UserToken is checked and if it is not null, will be used as the user token associated with this document. Subsequent calls to DocumentFactory.LoadFromCache will fail if the value of LoadFromCacheOptions.UserToken does not match.

Similarly, a user token can be associated when a document is created from scratch using DocumentFactory.Create through CreateDocumentOptions.UserToken and when a document is uploaded to the cache using (DocumentFactory.BeginUpload through UploadDocumentOptions.UserToken. Attempts to then load these documents with DocumentFactory.LoadFromUri or DocumentFactory.LoadFromCache will fail if the same user token is not passed accordingly. The same behavior also occurs during DocumentFactory.DeleteFromCache, DocumentFactory.DownloadDocument and DocumentFactory.DownloadAnnotations

When using DocumentFactory.GetDocumentCacheInfo to obtain information about a document in the cache, the value of DocumentCacheInfo.HasUserToken will indicate if the document in the cache contains a user token and cannot be loaded or deleted if the correct user token is not used.

Refer to DocumentFactory.InvalidUserTokenException for more information on how to control the way a document fails to load when an invalid user token is used.

See Also

Document Library Features

Uploading Using the Document Library

Document Library Coordinate System

Loading Encrypted Files Using the Document Library

Parsing Text with the Document Library

Barcode Processing with the Document Library

Document Toolkit History Tracking

Document Page Transformation

Using LEADTOOLS Document Viewer

Using LEADTOOLS Document Converter

Document View and Convert Redaction

Help Version 23.0.2024.12.11
Products | Support | Contact Us | Intellectual Property Notices
© 1991-2024 LEAD Technologies, Inc. All Rights Reserved.

LEADTOOLS Imaging, Medical, and Document
Products | Support | Contact Us | Intellectual Property Notices
© 1991-2023 LEAD Technologies, Inc. All Rights Reserved.