A large portion of the Documents class library requires the use of a cache system.
Documents can contain large numbers of pages containing huge amounts of data. Storing all of this data in physical memory is not feasible in most situations.
Typically, systems do not have write access to pages of a document stored in a remote URL, so modifications (annotations, image or text), cannot be stored in the original document.
Caching increases the performance of tasks like getting page data such as images or text. Results can be parsed once from the physical file on disk, processed, and then stored in the cache. Subsequent calls to the same data will simply retrieve it from the cache without any extra processing.
The LEADTOOLS Documents Web Service requires the use of a cache. Web methods are session-less by nature and with cache support, the service can use SaveToCache and LoadFromCache to save/load the same document between calls without the need to maintain session states.
Each Document object contains an ID. The ID is a unique string value that can be generated automatically by the system (using a GUID generator) or provided by the user and stored in the Document.DocumentId field. The ID is all that is required to re-construct a document object from a cache using LoadFromCache.
LoadFromUri can be used to create a Document object that represents a document such as a PDF, TIFF or DOCX document stored in a remote URL. The Document is a data structure containing properties such as the mime type, number of pages, size of each page, and other metadata. It does not contain any actual image, SVG or text data of the pages. The original document data (the PDF, TIFF or DOCX image) is still stored in the remote URL. This data structure is all that is saved into the cache (by default) and therefore saving and then re-loading a document from the cache is a very fast operation that does not require a large amount of memory. When the user requests an image representation of a page, the document parses it from the original data. This data can also be cached, if required, as explained in the "Cache Workflow" section below.
To use caching, an object that implements ObjectCache is initialized once at the start of the application and then passed to the DocumentFactory and Document methods that require caching. Any cache system that can persist data between application re-starts can be used. Refer to the "Cache System Examples" section below for more information.
Generally, the cache is used in one of two ways, depending on the type of the application:
The application uses the cache to get data more quickly from existing documents or to create new documents. The data is not shared with other applications nor is it necessary for it to persist between sessions.
The application requires the document state to persist between sessions.
web.config
file. After a document is loaded using LoadFromUri, it is saved
to the cache using SaveToCache and an object containing the document properties
(along with its ID) is returned to the JavaScript client. The client can then call the various other methods of the
service to obtain page images, text, annotations or any other data using this ID. The web methods will call
LoadFromCache to re-construct the document from cache to obtain the desired
data before viewing it in the LEADTOOLS Document Viewer.These types of applications usually store a policy setting in the web.config
as well. This will control how long to
store items in the cache before they expire.
Cache workflow describes how the LEADTOOLS Documents Web Service uses the cache. The service ships with full source code and the process can be modified as needed. The project source code is located at:
.NET: [Your installation folder]\Examples\JS\Documents\DocumentViewer\Services\DocumentService
Java: [Your installation folder]\Examples\JS\Documents\DocumentViewer\Services\DocumentServiceJava
The JavaScript client demo is located at
JavaScript: [Your installation folder]\Examples\JS\Documents\DocumentViewer\Apps\App1\site
TypeScript: [Your installation folder]\Examples\JS\Documents\DocumentViewer\Apps\App1\ts
The global cache object is created and stored in the static _cache
variable (accessible to the rest of the service code
through the ServiceHelper.Cache
static property). In the sample implementation, this is a LEADTOOLS
FileCache object that stores cache items in the file system (local or as recommended: remote UNC).
The cache eviction policy to determine how long items are kept in the cache is also set up here.
The cache is a persistence system, which means that when the system is restarted, only the cache object is re-created and any non-expired items stored in the cache from previous sessions will still be available.
You can ignore the source code dealing with "Pre-Cache". This deals with special code to pre-cache the LEADTOOLS sample documents used in the demo.
The service can be modified to include:
ServiceHelper.CreateCache
to initialize and set up a different cache system and set it
in the _cache
variable.The LoadFromUri is the entry point where the user loads a new document located in a remote URL. The document can be any file format supported by LEADTOOLS such as PDF, TIFF, DOCX, PNG, XLSX and countless others. It is invoked from the JavaScript Document Viewer client using the "Open URL" menu item.
Ignore the "Pre-Cache" source code to deal with the sample LEADTOOLS documents used by the demo.
DocumentFactory.LoadFromUri is called, passing the cache object that we previously created and the URL requested by the user in LoadDocumentOptions. This method will quickly determine whether the data in the URL contains valid image or document data that is supported, parse the data to determine the number of pages and size of each page, and return a new Document object containing the information.
Each new Document object requires a unique ID. Therefore, if one was not passed in LoadDocumentOptions.DocumentId (the value is null), then a new one is created by a GUID generator. If the user wishes to use their own ID, the same value is used and it is up to the user to guarantee the uniqueness of this ID. This ID is stored in the Document.DocumentId property of the created object.
Document.CacheUri is checked and if null, is set to a value that can be used to obtain the original document data. Refer to the "Under the Hood" section below for more information.
The following properties of the document are set:
AutoDeleteFromCache is set to false. We do not want it to delete itself from the cache when the .NET or Java object is disposed.
AutoDisposeDocuments is set to true: This is useful when the document may contain other child documents in the future. Refer to Creating Documents with LEADTOOLS Documents Library for more information.
AutoSaveToCache is set to false: Not all subsequent operations require re-saving the document to the cache (for instance, obtaining the image data of a page), therefore we will manually control when the document is saved to the cache by setting the value of this property to false to prevent the .NET/Java Document object from calling SaveToCache when it is disposed.
Finally, SaveToCache is called to save the document to the cache. Here, a new cache item is created from the document ID and the data structure required to reconstruct this Document object is saved into the cache.
For now, think of this as a single operation and a single item although in reality multiple items are saved into the cache and the original document data (PDF, DOCX, TIFF) may be stored in the cache as well. This is explained in detail in the "Under the Hood" section below.
The JavaScript code will create an instance of JS Document object from the JSON data and set it in the viewer. The viewer has all the information needed to construct the skeleton required to view the document. Page holders with correct size in the view and thumbnails area, the bookmarks tab if supported and annotation containers. All of these is created but with empty data since the Document object does not contain any. The viewer is fully functional and the user can scroll and click on items that will trigger calls to other methods in the service to obtain the required data.
Multi-user systems that will share the same document ID between different browsers can change the value of Document.CacheOptions from the default of DocumentCacheOptions.None to store page image, SVG and text data into cache and increase performance as explained in the next section.
The service can be modified to include:
Verifying that the current user has access to the URL and can view the document
If the URL can contain just the name of a document, mapping this value to the URL where the actual image data resides in other parts of the system inaccessible from outside the service and passing this new URL to DocumentFactory.LoadFromUri
Setting LoadDocumentOptions.DocumentId to a value that matches the ID of the document in other parts of the system to automatically map with the LEADTOOLS document ID
Adding the document ID to a separate database to inform a system that supports sharing documents with multiple IDs that this document is ready for viewing. This feature can also be implemented directly here: First try to load the document from the cache using DocumentFactory.LoadFromCache and passing the document ID. If this fails (will return null if the document is not in the cache) then assume that this is the first call to load this document and continue with the original code. If LoadFromCache succeeds, then the document is already in the cache and simply return the Document object to JavaScript
Modifying the value of Document.CacheOptions prior to SaveToCache to increase performance of systems that share the cached document ID between multiple users
The document is constructed and the first page outline is visible but without content. The system determines that the
document supports SVG viewing and requests it by calling the PageController.GetSvg
service method with the document
ID and page number.
The service will first try to load the document from ServiceHelper.Cache
using DocumentFactory.LoadFromCache
with the document ID. This method will only request the small data structure required to re-create the .NET/Java
Document object and is very fast. As mentioned earlier, this ID is the only value needed to
reconstruct the .NET/Java object.
The DocumentPage.GetSvgUrl method is called with the specified options and the resulting SVG data is streamed back to the JavaScript code and the .NET/Java object is disposed.
When the Document object is constructed from the cache, it will use the same settings for
AutoSaveToCache and AutoDeleteFromCache,
therefore, the document will not save itself back into the cache upon disposal. PageController.GetSvg
is considered a
read-only method that does not modify the state of the cache object.
The DocumentViewer will generally only call this method a single time per page and rely on the browser own caching if requested again (since this is an HTTP GET operation). The method may be called again from the same session only when the browser cache is exhausted. This is performed automatically by the web browser and is outside the control of LEADTOOLS.
The value of Document.CacheOptions is set to DocumentCacheOptions.None,
meaning that only the parts required to reconstruct the document is saved into the cache and page image, SVG and text data are
not. This is used to minimize cache size since in almost all cases, the DocumentViewer will
never call PageController.GetSvg
for a page more than once and the resulting SVG data (which can be large) is never
requested from the server again.
Multi-user systems that will share the same document ID between different browsers can change the value of
Document.CacheOptions from the default value of DocumentCacheOptions.None
to store page image, SVG and text data into cache to increase performance. For instance, setting the value to
All (includes PageSvg) during FactoryController.LoadFromUri
above before
SaveToCache will instruct the library to store the page data into the cache upon
request. The workflow for DocumentPage.GetSvgUrl is as follows:
Always: check whether the cache contains data for the key "documentID" + "value_of_pageNumber" + "svg". If found, return it. Naturally, the first time this method is called for this page, it will not find any data and will go to step 2.
Extract the SVG data for the page from the original document PDF, DOCX, etc. data. This is almost always a more expensive operation than returning the data directly from the cache.
If Document.CacheOptions of the owner document contains PageSvg, then store the SVG data into the cache using the key above.
Return the SVG data
Thereafter, subsequent calls from other user sessions (or browsers) to obtain the SVG data for the same document and same page will find the data in the cache at the first step and will never extract the data from the original document again.
The process can reset if the data is evicted from the cache manually or through automatic expiration. When the page SVG key is not found, steps 2-4 will repeat and the data is re-generated when it is requested the next time.
The service can be modified to include:
The other methods of the Page
and Document
controllers work in a fashion similar to PageController.GetSvg
. The document
is loaded from the cache, the data is extracted using the .NET/Java Document object and returned to
JavaScript.
The following methods will re-save the Document object into the cache because they modify the data:
FactoryController.Decrypt
- sets the password required to read encrypted documents
PageController.SetAnnotations
- saves annotations modified by the user (in preparation to converting the document to other formats)
FactoryController.SaveToCache
- saves new or updated virtual documents created through JavaScript into the cache.
When DocumentFactory.SaveToCache is called, the items below are stored in the cache. Calling DocumentFactory.LoadFromCache will succeed if all the values are found in the cache.
This is performed by calling ObjectCache.AddOrGetExisting with regionName
equal to the documentID (Document.DocumentId) and key equal to the value described in
the table below. These cache items are always in the cache for a document to be re-constructed
(DocumentFactory.LoadFromCache). If the cache system does not support
regions (or groups), then it can simply concatenate the value of regionName
(the document ID) + key
to create a
unique cache ID. See the "Caches System Examples" section below.
The original document data (PDF, DOCX, TIFF) is required to parse the document page data after it has been loaded from the cache. The data is stored in the "DownloadedFile_CacheId" key described below and the value depends on whether the cache system supports external resources.
If the cache supports external resources (ObjectCache.DefaultCacheCapabilities) contains ExtenalResources - such as the default LEADTOOLS FileCache which has access to a file system, then the original document data is downloaded to the physical disk file acting as the store for the cache item. This is performed by calling ObjectCache.GetItemExternalResource and writing the data directly to the file. This may reduce memory footprint and increase performance.
If the cache does not support external resources - such as the Memory and Ehcache implementation described below, then the
original data is stored as a byte[]
into the cache item directly.
If client-side PDF rendering support is used with the Documents Service, then direct HTTP access to the original image data is required and must be set in Document.CacheUri JavaScript object. The PDF renderer will use this value to obtain the original data and render the PDF pages directly into the viewer surface and DocumentPage.GetSvgUrl and DocumentPage.GetImageUrl are never called.
The .NET/Java DocumentFactory.LoadFromUri
method will not set the value of Document.CacheUri
and leave it to the default value of null prior to returning it to JavaScript. The JavaScript DocumentFactory.LoadFromUri method will check if the for value is null and will then replace it the HTTP GET URL required to call the service CacheController.GetDocumentData
web method. Refer to
source code in the service for more information.
This is the default implementation of the .NET/Java Documents Service for the following reasons:
Simplifies deployment: The cache can be stored in any location and on any machine. Direct virtual directory access is not required.
All calls made by JavaScript to obtain document data are routed through one place: The Documents service.
Alternatively, if using a cache system that stores the items in a virtual directory, such as the LEADTOOLS FileCache
, then FileCache.CacheVirtualDirectory
can be set to the full virtual directory path of the cache items and the .NET/Java DocumentFactory.LoadFromUri
will set Document.CacheUri
to the path of the document original data. Finally, the JavaScript DocumentFactory.LoadFromUri method will check for this value, and will not modify it since it is not null.
All possible cache IDs for a document can be obtained through Document.GetCacheKeys.
When a document is deleted from the cache using DocumentFactory.DeleteFromCache, the cache checks whether the system contains DefaultCacheCapabilities.CacheRegions.
regionName
. Key | Data | Source | Notes |
---|---|---|---|
"DownloadedFile_CacheId" | Either the original URL or byte[] containing the original PDF, DOCX, TIFF, etc data of the document being viewed | URI passed to DocumentFactory.LoadFromUri | Set during LoadFromUri. Get during LoadFromCache. Must exist. LoadFromCache will check for this and return null if it cannot be found in the cache. |
"AnnotationsFile_CacheId" | Either the original URL or byte[] containing the optional annotation data. | LoadDocumentOptions.AnnotationsUri passed to DocumentFactory.LoadFromUri | Set during LoadFromUri. Get during LoadFromCache. This is optional and can be null if no annotation file was served with the document. |
"Values_CacheId" | String containing internal data to re-create the Document object | Created internally during DocumentFactory.LoadFromUri | Set during LoadFromUri. Get during LoadFromCache. Must exist. |
"Pages_CacheId" | String containing internal data to re-create the Document page objects | Created internally during DocumentFactory.LoadFromUri | Set during LoadFromUri. Get during LoadFromCache. Must exist. |
"Bookmarks_CacheId" | String containing internal data to re-create the Document bookmark objects | Created internally during DocumentFactory.LoadFromUri | Set during LoadFromUri. Get during LoadFromCache. Must exist. |
"RasterCodecsOptions_CacheId" | String containing internal data to re-create the RasterCodecs used to load/save images and SVG | Created internally during DocumentFactory.LoadFromUri | Set during LoadFromUri. Get during LoadFromCache. Must exist. |
The following cache items are added to the cache depending on DocumentCacheOptions. The format is:
String key = pageId + "_" + itemName
Where pageId
is a GUID representing the page.
Key | Data | Used when | Notes |
---|---|---|---|
"thumbnailImage" | RasterImage containing the thumbnail of this page | DocumentCacheOptions.PageThumbnailImage is set | Try to get and set during DocumentPage.GetThumbnail |
"text" | DocumentPageText serializer | DocumentCacheOptions.PageText is set | Try to get and set during DocumentPage.GetText |
"annotations" | String containing the XML representation of the annotations container for the page | DocumentCacheOptions.PageAnnotations is set | Try to get and set during DocumentPage.GetAnnotations |
"image_[number]" number is 0,1,2 or 4 depending on the value of Document.Images.MaximumImagePixelSize | RasterImage containing the image of this page | DocumentCacheOptions.PageImage is set | Try to get and set during DocumentPage.GetImage |
"svgBackImage_[number]" number is 0,1,2 or 4 depending on the value of Document.Images.MaximumImagePixelSize | RasterImage containing the image of this page | DocumentCacheOptions.PageSvgBackImage is set | Try to get and set during DocumentPage.GetSvgBackImage |
"svg_[number1]_[number2]" number1 can be either 0 or 1 (depending on whether this SVG is to be used for viewing or conversion). number2 can be 0,1,2 or 4 similar to the images above | SvgDocument containing the SVG representation of this page | DocumentCacheOptions.PageSvg is set | Try to get and set during DocumentPage.GetSvg |
ObjectCache is an abstract class. Derived object can be implemented to add support for caching using external systems. Below are sample implementations.
FileCache is the default implementation of ObjectCache. It supports regions, external resources and virtual directories.
This example shows a simple in-memory cache implementation showing the basics of custom caching and should not be used in production environment.
This example shows an implementation of Azure Redis Cache to be used with the LEADTOOLS Documents Library.
This example shows an implementation of Azure Redis Cache and Storage Blobs to be used with the LEADTOOLS Documents Library.
This examples show an implementation of the popular Java Ehcache system to be used with the LEADTOOLS Document Library.
Loading Documents Using LEADTOOLS Documents Library
Creating Documents with LEADTOOLS Documents Library
Uploading Using the Documents Library
Documents Library Coordinate System
Loading Encrypted Files Using the Documents Library
Parsing Text with the Documents Library
Barcode processing with the Documents Library
Using jQuery Promises in the Documents Library