Resource Histories Technical Detail
In unblu, resources represent pieces of data which are (usually) not directly contained in a visual but referenced from it (using a URI). Typical resource types are binary resources like images, documents (pdf, doc, etc.) or multimedia content (videos, audio, etc.). Besides those, also textual resources - e.g., style sheet (CSS) files - and theoretically even HTML itself can be classed as resources.
There are some exceptions to this rule:
The following styles are typically directly contained in a visual and then transformed into resources on the server side:
- Styles in HTML attributes
- Styles in <style> tags
Resource Storage and Access
An unblu resource consists of two parts:
The resource itself.
The data contained - called a blob (binary large object).
Note: The word ‘blob’ is a ‘backronym’. The data type ‘blob’ described data too large and diverse to be managed on older systems. The name was originally taken from the 1958 film of the same name. Current computer systems now can handle such large files easily, thus the blob was designated the Binary Large Object retrospectively, when it became a viable technology.
The Resource Object
The resource object provides the following information:
- uuid representing the resource
- uri of the original resource
- mime type
- charset (if textual resource)
- state (can be PENDING, REQUESTING, REQUESTED, INVALID, MATERIALIZED, DELIVERABLE)
- origin (can be CSS_PROPERTY, CSS_IMPORT, STYLE_ATTRIBUTE, TAG, OTHER)
- reference to a blob containing the actual data of the resource
The Blob Object
A ‘blob’ contains the actual data of the resource. There are two kinds of blobs:
In addition there exists a “dummy” blob, which is marked with a specific id:
Information provided by basic blobs:
- Checksum (currently a CRC32 checksum)
- Length (in bytes)
- creation date
- binary data
Information provided by typed blobs:
same as blob plus:
- mime type
‘Resource storage’ is split into three areas:
The first two refer to their respective objects (see above). The resource table is the resolution table when looking for resources with a known backend URI (but not a UUID). By default, all three stores are located in memory. The resource and resource table stores have session scope - that is, their content is ‘dropped’ when the session ends. The blob store on the other hand is global - thus the same resources in multiple sessions are stored only once in memory.
Resource Request URI
When the Resource History is turned off, URIs arriving at the agent browser directly point to the original image / element of the original backend web server. Such URIs are directly requested by the agent browser.
On the other hand, when the Resource History is turned on, visuals arriving at the agent browser do not contain URIs pointing to the original backend web server. Instead, the URIs are converted to a specific format pointing to the collaboration server.
Resource URI with the Resource History turned on:
Note that the resource-UUID is never sent to the collaboration server when the resource is requested from an agent browser, since the so called ‘fragment’ is only used in the browser. The reason to have it in our URIs is that it is appended, e.g., in a URI contained in a CSS. Thus, if the CSS is parsed on the server, the resource UUID can be extracted and the relevant information (especially inbound / outbound references) retrieved and processed. As soon as the CSS arrives at the agent browser, the browser will request the CSS but without the resource UUID – (in fact, the resource UUID is not required).
The fact that it is actually the blob being retrieved and not the surrounding resource is important. Since blobs are stored only once, the agent browser also only has to retrieve them once (and then have them locally cached). If there are many resources (e.g., many URIs) with the same content (blob) then caching is extremely effective. In extreme cases it is possible that the agent browser can load a web page faster than the visitor browser. This can happen if the original webpage contains hundreds of images with differing URIs but always the same data. In the agent browser, all of those resources would have the same blob, resulting in the same URI and thus would be retrieved only once.
Activating Resource Histories
In unblu, resources are handled fundamentally differently depending on whether the Resource History is turned on or off. The purpose of the Resource History (when turned on) is to store all resources that are visible on the visitor browser.
How to Activate Resource History
In your configuration properties file, add the following line:
The following configuration is optional: If set to true, missing resources will be loaded by the server and thus require the collaboration server to be able to access the backend web server (which is not always desired or possible).
Behavior with the Resource History switched off
Without the Resource History enabled, resource URIs in visuals are left as-is and transferred unchanged from the visitor browser to the agent browser. Thus, the resources themselves are not transferred to the server nor are they stored somewhere. Instead, the agent browser will request the resource directly from the original backend web server on demand.
Note: Even though the URIs remain as-is in the visitor browser, they are checked on the collaboration server and only let through if they correctly point to a resource on the backend web server. Thus, it is not possible to ‘inject’ a link to some obscure resource on some unknown web server, or at least, this resource will not be requested by the agent’s browser.
Note: This is the fastest way to access resources.
Behavior with the Resource History switched on
With the Resource History turned on, the agent browser only interacts with the collaboration server. There is nothing requested from anywhere else. All resources thus must be uploaded to the server once they are discovered in a visual (e.g., as an image or stylesheet link tag).
This configuration provides the maximum level of security. If the visitor’s browser sends a link to some other file, the agent’s browser will not be able to resolve that resource - that is, it will request the resource from the collaboration server which has no such resource and returns a ‘404 not found’ message.
Note that browser caching is a problem for this approach. Typically, resources are transferred to unblu, e.g., from a reverse proxy where all data to the end user is flying by. Usually, traffic is only monitored by unblu once a session has been started. That means that it may well happen that an end user surfs on the web site, retrieves images and stylesheets and has them stored in his browser cache from that moment on. Once the session starts, the resources no longer fly by the reverse proxy and thus do not get transferred to unblu. unblu will detect this and send the visitor browser commands to re-request such files. However, this behavior may take some time and thus the performance is typically slower with the Resource History enabled than without.
Depending on the resource type, a resource will be ‘processed’ prior to being used or it is used in its raw / original form. A typical (and currently the only) processor is the CSS processor.
The main purpose of the CSS processor is to identify URI reference locations and convert them to unblu resource URIs. The collaboration server features two kinds of CSS processors:
- Full CSS parser-based
- ‘Simple’ regex-based
The ‘Full CSS parser-based’ processor is used when the Resource History is turned on. Basically, it simulates a CSS parser as present in the browser and filters the CSS. The full CSS parser thus not only scans for URIs, in addition it also drops unknown / unsafe / problematic CSS rules.
The ‘Simple regex-based’ CSS processor only scans for URIs and verifies / replaces them.
Resources can have incoming (inbound) or outgoing (outbound) references. A CSS resource, for example, can have outbound references to background images and inbound references to HTML files or other CSS resources.
These references have an impact once a resource changes. Resources can have the following states:
- PENDING: A resource with a certain URI has been seen, e.g., in a visual, but is not present in the resource store. It is expected to arrive on the server later on.
- REQUESTING: A resource is supposed to have arrived on the server but did not arrive (so far). It is requested from the client again.
- REQUESTED: A missing resource has been requested.
- INVALID: A missing resource is invalid - e.g., not available, 404 or similar.
- MATERIALIZED: A resource has arrived but may need processing.
- DELIVERABLE: A resource has arrived and has been processed. It is ready for delivery.
If a resource changes its state, e.g., from PENDING to DELIVERABLE, it typically includes changes in its data (that is, the contained blob has changed). If the blob changes, the URIs in visuals (and possible dependent resources, like CSS files) need to be updated. Thus, a status update always leads to a cascade of resource updates which, in the simplest case, means there is nothing to do (no dependencies) or, for example, if a picture changes this can lead to reprocessing of the imported CSS, the default CSS, as well as the part of the visual where the default CSS was included in the HTML.