MIME type whitelisting support.
LEADTOOLS supports reading a large number of file formats. These include formats that are used frequently in document management systems such as PDF (application/pdf), TIFF (image/tiff) and DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document). It also includes formats that are rarely used in these situations such as GIF files (image/gif).
DocumentFactory contains the DocumentFactory.LoadFromUri, DocumentFactory.LoadFromFile and DocumentFactory.LoadFromStream methods that are used to load a document from a URI, file or stream respectively. And if the data contains an image or document format that can be loaded by LEADTOOLS, a new LEADDocument object is created and returned to the user.
In certain situations, an application may require to explicitly allow/disallow certain mime types to lessen the possibility of failure, for security reasons or to improve user experience. This technique is called MIME type whitelisting.
DocumentFactory contains a static instance of the DocumentMimeTypes class in the DocumentFactory.MimeTypes property. This instance contains entries for mime types and a status indicating the behavior of each (unspecified, allow, deny). These entries are stored in the DocumentMimeTypes.Entries dictionary with each entry containing a mime type as key and DocumentMimeTypeStatus enumeration member as the value. The default status of each mime type is stored in DocumentMimeTypes.DefaultStatus property which has a value of DocumentMimeTypeStatus.Unspecified (meaning, perform the default action).
When DocumentFactory checks a MIME type of a document in the process of loading it, it will make a call to DocumentMimeTypes.GetStatus. This method will first check the Entries dictionary, and if the mime type key is found, will return the DocumentMimeTypeStatus value. If no such entry is found, the value of DefaultStatus is returned.
By default, the Entries dictionary is empty and the value of DefaultStatus is Unspecified causing this status value to be returned by GetStatus each time DocumentFactory is loading a document. Therefore, DocumentFactory will load all file formats with any MIME type supported by LEADTOOLS by default.
To disable loading GIF files using the Document toolkit, add an entry for its mime type as follows:
// Disallow GIF mime type
DocumentFactory.MimeTypes.Entries.Add("image/gif", DocumentMimeTypeStatus.Denied);
// Load a GIF file
LEADDocument gifDocument = DocumentFactory.LoadFromUri("http://example.org/images/file.gif", loadOptions);
Debug.Assert(gifDocument == null);
Will result in document equal to null and the application can check for this value and perform the next action such as informing the user that the document has a MIME type that has been explicitly denied.
Note that all other MIME types not specified by the application will still work, since the value of DefaultStatus is Unspecified to perform the default action.
To allow loading only PDF and TIFF files using the Document toolkit, add entries for the mime types as follows:
// Allow PDF and TIF
DocumentFactory.MimeTypes.Entries.Add("application/pdf", DocumentMimeTypeStatus.Allowed);
DocumentFactory.MimeTypes.Entries.Add("image/tiff", DocumentMimeTypeStatus.Allowed);
// Load a PDF document
LEADDocument pdfDocument = DocumentFactory.LoadFromUri("http://example.org/images/file.pdf", loadOptions);
// PDF is allowed per our requirement
Debug.Assert(pdfDocument != null);
// Load a TIF document
LEADDocument tiffDocument = DocumentFactory.LoadFromUri("http://example.org/images/file.tif", loadOptions);
// TIFF is allowed per our requirement
Debug.Assert(tifDocument != null);
// Load a GIF file
LEADDocument gifDocument = DocumentFactory.LoadFromUri("http://example.org/images/file.gif", loadOptions);
// GIF is disallowed per our requirement (only PDF and TIFF)
Debug.Assert(gifDocument == null);
For the PDF and TIFF document, the factory will call GetStatus and since entries for the MIME type is found, the status (Allowed) is returned and the documents are loaded correctly.
For the GIF file, GetStatus will not find an entry for its mime type and return the value of DefaultStatus, and since this is Unspecified by default (performs the default action), the factory will still be able to load the GIF file. This is obviously not what we wanted and assert will fail. Therefore, modify the example as follows:
// Allow PDF and TIF
DocumentFactory.MimeTypes.Entries.Add("application/pdf", DocumentMimeTypeStatus.Allowed);
DocumentFactory.MimeTypes.Entries.Add("image/tiff", DocumentMimeTypeStatus.Allowed);
// Deny GIF
DocumentFactory.MimeTypes.Entries.Add("image/gif", DocumentMimeTypeStatus.Denied);
// GIF is disallowed per our requirement (only PDF and TIFF)
Debug.Assert(gifDocument == null);
And now gifDocument
will be null and our requirement is met.
What about loading a PNG file?
// Load a PNG file
LEADDocument pngDocument = DocumentFactory.LoadFromUri("http://example.org/images/file.png", loadOptions);
// PNG is disallowed per our requirement (only PDF and TIFF)
Debug.Assert(pngDocument == null);
However, this does not work and the document is loaded because DefaultStatus is still Unspecified. We could add image/png to the list of denied MIME types but this will fail again for the next new MIME type we encounter. Instead, to meet our requirement of only allowing PDF and TIFF documents, modify the example like this:
// Allow PDF and TIF
DocumentFactory.MimeTypes.Entries.Add("application/pdf", DocumentMimeTypeStatus.Allowed);
DocumentFactory.MimeTypes.Entries.Add("image/tiff", DocumentMimeTypeStatus.Allowed);
// Disallow everything else instead of denying MIME types manually
DocumentFactory.MimeTypes.DefaultStatus = DocumentMimeTypeStatus.Denied;
// Load a PDF document
LEADDocument pdfDocument = DocumentFactory.LoadFromUri("http://example.org/images/file.pdf", loadOptions);
// PDF is allowed per our requirement
Debug.Assert(pdfDocument != null);
// Load a TIF document
LEADDocument tiffDocument = DocumentFactory.LoadFromUri("http://example.org/images/file.tif", loadOptions);
// TIFF is allowed per our requirement
Debug.Assert(tifDocument != null);
// Load a GIF file
LEADDocument gifDocument = DocumentFactory.LoadFromUri("http://example.org/images/file.gif", loadOptions);
// GIF is disallowed per our requirement (only PDF and TIFF)
Debug.Assert(gifDocument == null);
// Load a PNG file
LEADDocument gifDocument = DocumentFactory.LoadFromUri("http://example.org/images/file.gif", loadOptions);
// PNG is disallowed per our requirement (only PDF and TIFF)
Debug.Assert(pngDocument == null);
Using Entries and DefaultStatus, the application can have any combination of explicitly allowing or denying any or all MIME types.
DocumentFactory will check for MIME types using GetStatus during LoadFromUri, LoadFromFile and LoadFromStream as follows:
Checks LoadDocumentOptions.MimeType. This member a default value of null but can be set by the user application to the actual mime type of the document being loaded. The factory will call GetStatus passing this value and fail loading the document if the status was Denied.
Next, for LoadFromUri, the factory can obtain the MIME type (media type) from the URL by reading the HTTP headers returned by the server hosting the document. It will also be checked and if denied, the load fails.
Next, for uploaded documents, a MIME type can also be set by the user application in UploadDocumentOptions.MimeType. If this value was set by the user then it will also be checked and if denied, the load will fail.
For the first two options, the MIME type might not available or set to a wrong value, therefore, finally the factory will obtain the real MIME type from the actual image data (using RasterCodecs) and if re-checked again and denied, the load fails.
Finally, the final MIME type obtained from all of the above is stored in the LEADDocument.MimeType property and load succeeds. The value of DocumentCacheInfo.MimeTypeStatus for this document will be set to the status found during this load operation.
All the above can be logged and traced using the UserGetDocumentStatusHandler callback. The application can set a custom handler in UserGetDocumentStatus and the factory will invoke this callback for all the operations above with the following parameters:
Parameter | Description |
---|---|
uri | The URI to the document being loaded. |
options | The LoadDocumentOptions object passed by the user. |
source | The source of this callback invocation. |
mimeType | The MIME type being checked |
source can be any of the following:
Member | Value |
---|---|
DocumentMimeTypeSource.User | The mime type is passed by the user. For instance, in LoadDocumentOptions.MimeType. |
DocumentMimeTypeSource.Cache | The mime type is stored in the cache, for example, from UploadDocumentOptions.MimeType |
DocumentMimeTypeSource.Url | The mime type was obtained from the HTTP headers of a URL as set by the server containing the document. |
DocumentMimeTypeSource.Data | The mime type is read by LEADTOOLS RasterCodecs from the actual image data. |
The RasterCodecs utility methods GetExtensionMimeType, GetMimeType and GetMimeTypeExtension can be used to obtain a MIME type to/from an extension or from a LEADTOOLS RasterImageFormat enumeration member.
This example will allow only loading PDF and TIFF documents and deny everything else. The example also installs a callback to log all the MIME type verification operations. The user callback can return any value for the status of the MIME type or call GetDocumentStatus to continue with the configured action.