Le modèle de document de Sydonie
An FRBR approach
Note : if you need a quick introduction of the FRBR, see:
- [Wikipedia has a short article on FRBR that gives a quick view of its concepts]
- [An 8 page PDF that explains more in depth the main concepts of FRBR ]
Document Model
Our approach uses some principles of the FRBR report: we proposes a Document Model based on the Work, Expression and Manifestation entities as defined by the FRBR report. Our model considers the Document as the tree composed of a Work entity and its Expression and Manifestation descendants. Also inspired by the FRBR, our Data and Metadata model associates each entity node with the appropriate information using RDF-like statements. Each piece of information about a document is therefore placed either at the Work, Expression or Manifestation level, according to the FRBR guidelines.
Using the guidelines for group 1 entities from the FRBR report, we define a document model with the Work, Expression and Manifestation entity levels and a pointer to the resource itself. They represent intellectual or physical aspects of a document:
- The intellectual Work is represented through the Work entity. It contains information relevant to all the versions of the document, such as the author of the original Work, when it was first published, etc;
- Expression entities can represent a translation, an abstract or other versions of the same Work entity. Expressions contain metadata information describing the variant of the Work. For example the translator’s name, the language used, etc;
- Manifestation entities represent the different formats, sizes, codecs, etc. that materialize the given Expression. It may contain metadata such as resolution for an image or bit rate for an mp3. Each Manifestation includes a pointer to a content resource containing the specific version.
For a same intellectual Work, these different entities are represented in a tree structure, as shown the Figure below. Instead of considering each version as a separate document, as most systems do, our model considers a document as the complete tree. This model allows for language negotiation and content negotiation when a document is requested. It also makes it easier to create a navigational map to view other Expressions or Manifestations of a same Work, thus providing a user with links to other available versions.
Like FRBR, our model groups versions, translations, formats, etc. of a document in a single tree structure using Work, Expression and Manifestation entities. The Work entity represents the intellectual creation and therefore only contains metadata. For example it contains the first publication date, information about the author, and additional metadata depending on the type of document.
Expression entities represent the various realizations of a Work. As defined in FRBR, expressions are therefore intellectual entities containing metadata. A “reference Expression” refers to the original Expression (i.e. the English Expression in Figure above). In the case of a translation for example, Expression entities may contain the title of the translation of the Work, its language, information about the translator, plus specific information. Expression-to-Expression relationships express how a given Expression is derived from the reference Expression, with relationships such as is a translation of or is an abridged version of for example.
A Manifestation entity is an embodiment of an Expression of a Work. Manifestation entities can refer to various formats, image sizes or media encoding (such as HTML, PDF, PNG, MP3, webM, etc.). A “reference Manifestation” refers to the occurrence that served for the creation of the other ones (i.e. the HTML Manifestation in Figure 2). For image documents, the reference Manifestation would be the original image that was used to create smaller ones such as thumbnails for example. Manifestation entities can contain information such as content type and file size for example. But they mostly carry the content of the document, using a pointer to a content resource.
Document views are therefore computed on a specific branch of the document tree, using metadata information from the Work entity to the Manifestation entity. When a user requests a document (considered here as the whole tree), the branch to be used for creating the view depends on which type of entity is requested. If the request occurs at the Manifestation level, for example requesting the PDF Spanish Expression of a document, then the branch used is that of the requested Manifestation. If the entity level requested is Work or Expression, negotiation will occur. As outlined in Cool URIs for the Semantic Web (Sauermann, Cyganiak, Ayers & Völkel, 2008), a W3C note published in December 2008, HTTP Language and Content Negotiation can be used to serve the most suitable corresponding content to the client’s preferences. Keeping in mind that a client will always be served a view on a branch, providing an access to a document through the Work or Expression entity level follows the algorithm:
- At the Work level, Language-Negotiation is used to know which Expression of the document to use;
- At the Expression level, Content Negotiation is used to decide which Manifestation to serve. A typical use case would be a web browser accessing a resource and being served HTML content whereas a robot would be served an XML or RDF content;
- At the Manifestation level, the system can directly serve the content, i.e. the resource file.
Once the Manifestation to use is chosen, the system uses the data and metadata from each entity level to build the rendered view. Since a document is a tree with data and metadata attached to the nodes, it is self-aware of the various translations and formats available.
Data and Metadata Model
In the document tree, each entity node carries the information it is associated with. The specified information may be metadata or content data, depending on the application model. Since Work and Expression entities represent intellectual information, the information they carry is metadata about the document (at the Work level) or about a specific version (at the Expression level). Having a model that contains data, metadata and a pointer to the content resource avoids information redundancies.
However, the proposed model must be able to apply to any type of document, keeping the entity nodes as generic as possible. We need a way to model the various types of information each entity node may carry. To provide a generic way to manage the information attached to a node, our approach uses a RDF-like model. Each entity node has a set of predicates. The arc for each predicate points to an object modeling its value. Similarly to RDF, data is then represented as triples (subject, predicate, object) where:
- Subject is an instance of a document entity node, i.e. a Work, Expression or Manifestation node.
- Predicate is the name of the relation, or, in terms of OO concepts, the name of the attribute.
- Object is the value associated to the predicate. In the implementation detailed later in this article, it is an object (in the OO sense). Its class models the information it represents.
The modeled data attached to an entity node may be of any kind. It may be scalar data such as text, or complex data such as a postal address for example. Figure below illustrates this model in the case of an article, using the document tree shown in Figure 1 and showing a detailed view of a branch.
In order to use this model in various applications, we need to provide ways to define classes of documents managed by an application. A class of documents is the definition of the type of data each entity node may contain. To define such a class, one specifies, for each entity level, the list of accepted predicates and the type of information each predicate points to. Using this model, any kind of document may be defined.
In order to do so, our model defines the notion of attribute. An attribute is a predicate and its associated object, called attribute type. A class of documents is thus defined by specifying the list of accepted attributes for each node. An XML formalism based on our model enables the definition of a class of document.
Benefits
Since information is attached to the highest possible entity level (Work, Expression or Manifestation), information redundancy is avoided. For example, information specified at the Work level is used by all its Expressions. In the above example of a short story, when creating a translation, information about the author and the original publication are already present at the Work level and therefore are automatically used by the new Expression. Similarly, when adding a PDF version, i.e. creating a new Manifestation, data and metadata at the Work and Expression levels are already present for that article and are reused. This process ensures that all versions of a document carry the same information and avoids redundancies between versions. Data and metadata consistency of a document and its variant forms are therefore improved.