Document memory cache support.
Certain multipage file formats can be slow to load using the default implementation of DocumentFactory. This is especially true for document file formats such as DOCX/DOC, XLSX/XLS RTF, PDF, and TXT.
The Using the OptimizedLoad Functions to Speed Loading Large Files topic contains detailed explanation of optimization techniques used by LEADTOOLS to speed up parsing the images, SVG, and text data of the pages of these types of files.
As explained in the topic, CodecsOptimizedLoadData can contain managed or unmanaged memory, depending on the file format. If a format supports managed data (such as TXT and AFP), the DocumentFactory
stores this data into the cache itself and re-uses it whenever a document is loaded from the cache.
If a format can only support unmanaged data (such as DOCX and XSLX), the DocumentFactory
is unable to store this type of data in a cache for re-use. The default behavior is for the factory to re-create this data from scratch anytime a document is loaded from the cache, which can degrade performance.
Performance is generally not a problem for desktop-based applications using the document library, such as the Windows Document Viewer or Document Converter. Here, the workflow is to load the document once from a file or URL, use the resulting LEADDocument, and only dispose of it when the user has closed the document.
This, however, can become a problem for applications that require saving the document into the cache and then re-loading it again (such as the Document Service). Here, the workflow is to load the document initially, generate its unmanaged optimized data, and then save it into the cache and dispose the object. Subsequent calls to get the images, SVG, and text data of the pages will have to first load the document from the cache, and thus regenerate the data with every call.
DocumentMemoryCache can be used by the factory to greatly enhance the performance of such documents by keeping the unmanaged memory alive in memory, independent of the document. Subsequent calls to load the document from cache will then use the memory-cached data instead of re-generating it from scratch. The result is quicker loading and parsing of the data of the pages. Naturally, keeping unmanaged data in memory increases the application's resource usage, and so careful consideration must be used when deciding how and when to store this data.
A typical client/server application such as the LEADTOOLS Document Viewer/Document Service works as follows:
The Document Viewer loads or uploads a document from an external resource, and eventually calls DocumentFactory.LoadFromUri or DocumentFactory.LoadFromFile on the Document Service.
DocumentFactory
calls RasterCodecs.StartOptimizedLoad and RasterCodecs.GetInformation to obtain information about the source document, such as format and number of pages. This can take some time if the document is very large and complex and is especially true for file formats such as XLSX/XLS.
A CodecsOptimizedLoadData
object can be obtained at this point. If obtained, the factory stores it internally inside the LEADDocument
object.
The service then saves this document information in the cache. If CodecsOptimizedLoadData
supports managed data, it is saved into the cache as well. This is true for file formats such as TXT and AFP.
The service disposes the document with all its data, including the CodecsOptimizedLoadData
, if available.
The document ID in the cache is returned to the viewer.
The viewer builds the skeleton of empty pages and thumbnails.
The viewer sends one or more requests asynchronously to the service to obtain pages images, SVG, or text data.
Each one of these requests results in loading the document information from the cache using DocumentFactory.LoadFromCache.
CodecsOptimizedLoadData
is created for the document. If its managed data were stored in the cache, the managed data is loaded and used (the parsing of the complex document structure is not performed).
If CodecsOptimizedLoadData
supports only unmanaged data, then the data was not stored in the cache. Parsing of the complex document structure must be performed again.
DocumentFactory
supports caching of the unmanaged data described above in memory. When enabled, the workflow is modified as follows:
The Document Viewer loads or uploads a document from an external resource, eventually calls DocumentFactory.LoadFromUri or DocumentFactory.LoadFromFile on the Document Service.
DocumentFactory
calls RasterCodecs.StartOptimizedLoad and RasterCodecs.GetInformation to obtain information on the source document, such as format and number of pages. This operation may take some time if the document is very large and complex and is especially true for file formats such as XLSX/XLS.
2.1 New behavior: The previous operation is timed and stored in LEADDocument.LoadDuration.
A CodecsOptimizedLoadData
object may be obtained at this point, and the factory stores it internally inside the LEADDocument
object.
The service then saves this document information in the cache. If CodecsOptimizedLoadData
supports managed data, it is saved into the cache as well. This is true for file formats such as TXT and AFP.
4.1 New behavior: If CodecsOptimizedLoadData
supports unmanaged data only such as DOCX and XLSX/XLS, then LoadDuration is compared against DocumentMemoryCacheStartOptions.MinimumLoadDuration and if greater, the unmanaged data is stored inside an internal memory cache associated with the document.
The service disposes of the document with all its data, including the CodecsOptimizedLoadData
if available.
The document ID in the cache is returned to the viewer.
The viewer builds the skeleton of empty pages and thumbnails.
The viewer sends one or more requests asynchronously to the service to obtain pages, images, SVG, or text data.
Each one of these requests results in loading the document information from the cache using DocumentFactory.LoadFromCache.
CodecsOptimizedLoadData
is created for the document. If its managed data were stored in the cache, it is loaded and used (parsing of the complex document structure is not performed.
10.1 New behavior: Otherwise, the internal memory cache is queries for a CodecsOptimizedLoadData
associated with this document, and used if found. Parsing of the complex document structure is not performed again. Otherwise,
If CodecsOptimizedLoadData
supports only unmanaged data, then it was not stored in the cache. Parsing of the complex document structure must be performed again.
11.1 New behavior: If step 11 occurred then Step 4.1 is repeated and this CodecsOptimizedLoadData
is compared and potentially added to the internal memory cache. Thus, subsequent calls to LoadFromCache
will use this data and the increase of speed is obtained again. This scenario happens when the data stored in the memory cache has expired after a specified time of inactivity.
Therefore, using the document memory cache will increase the performance of loading and parsing pages from complex and large document formats at the expense of keeping the unmanaged data in memory at the server side. LEADTOOLS internal testing showed an increase in speed of up to 50 times when loading very large XLSX files in the JavaScript document viewer if document memory cache is used.
DocumentFactory contains the static property DocumentFactory.DocumentMemoryCache that controls the usage of this feature in the document toolkit.
DocumentMemoryCache usage can be enabled as follows:
Create an instance of DocumentMemoryCacheStartOptions with the options to use.
This will initialize the internal memory cache and clean up timer and all subsequent LoadFromUri
, LoadFromFile
, and LoadFromCache
will switch to "Document Service Workflow 2" described above.
The internal cache contains CodecsOptimizedLoadData
objects associated with document IDs. Each of these items also contains a timestamp to mark its last usage. The timestamps are updated whenever the document is "touched". If a specific amount of time passes without any activity on the saved data of a document, it is considered expired and is removed from the cache.
The engine will automatically keep the data of a document alive between calls to LoadFromCache
and dispose, thus preventing long-running operations (such as converting a large document to a different format), from triggering the expiry on the data.
The engine will also automatically purge the data of a document when it is deleted from the cache.
DocumentMemoryCacheStartOptions contains the following options:
Member | Description |
---|---|
MinimumLoadDuration | The minimum amount of time the initial LoadFromUri /LoadFromFile of a document takes to be considered to be considered for memory optimization. The default value is 2 seconds. |
MaximumItems | Maximum number of items to keep in the cache. The default value is 0, meaning there is no limit. |
SlidingExpiration | Duration at which the cache entry must be "touched" before it is deleted from the cache. The default value is 60 seconds. |
TimerInterval | Interval the timer uses to check for and remove expired items. The default value is 60 seconds. |
DocumentMemoryCache contains the following members:
Member | Description |
---|---|
IsStarted | Checks whether document memory caching support has started. |
Start | Starts document memory caching support. |
Stop | Stops document memory caching support. |
HasDocument | Checks whether an entry associated with the specified document exists. |
This example will simulate a client loading a small and then a large document from URI and then shows the times used. This sample will produce results similar to the following:
Using memory cache is False Initial load from leadtools.pdf took ~0 seconds Is using memory cache is False Multi-threaded load of all pages took ~0 seconds Initial load from complex.xlsx took ~10 seconds Is using memory cache is False 1 Multi-threaded load of all pages took ~13 1 seconds Using memory cache is True Initial load from leadtools.pdf took ~0 seconds Is using memory cache is False Multi-threaded load of all pages took ~0 seconds Initial load from complex.xlsx took ~10 seconds Is using memory cache is True 2 Multi-threaded load of all pages took ~0 2 seconds
Notice how the time it took to load all the pages of complex.xlsx
in a multi-threaded code decreased from 13 to almost 0 seconds when memory cache is used.