CellML.org - Meeting Minutes 24 April 2001

CellML Logo

Meeting Minutes 24 April 2001

Melanie's Metadata Madness

Get
the PDF!

Author:
          Melanie Nelson (Physiome Sciences Inc.)
Contributor:
          Warren Hedley (Bioengineering Research Group, University of Auckland)

1  Introduction

These meeting minutes were written to document some of my philosophy and methods for metadata in CellML. They are intended to serve as a resource to the poor soul who has to take over primary responsibility for metadata in CellML, and to answer the "Was she crazy?" sorts of questions that will no doubt pop up frequently as work progresses.

These minutes are split into four main sections:

  • Section 2 discusses some basic RDF issues.
  • Section 3 discusses the Dublin Core.
  • Section 4 discusses the CellML metadata data model.
  • Section 5 discusses some specific serialization issues.

2  RDF

Information about RDF can be found on the W3C's Semantic Web Activity Page.

2.1  What is RDF?

There is no doubt about it: RDF is nowhere near as intuitive as XML is. It is pretty clear what the purpose of XML is and how to go about using XML to solve problems. Not so for RDF. However, with a little patience, I think most people can understand what RDF is and what it hopes to accomplish. With a little more work, your average XML geek can figure out how to use and create RDF vocabularies, too.

So what is RDF? Well, the easy answer is that it is the W3C's recommendation for handling metadata on the web. But that doesn't tell you much. RDF stands for "Resource Description Framework", and that really does describe what RDF is. It is a framework that allows you to store descriptions (i.e., metadata) about resources. A resource can be literally anything. For the purposes of CellML, resources can be the model document, the model itself, or components in the model.

2.2  Why Bother With RDF?

RDF by itself does not allow people to store metadata. It merely provides a standard framework onto which various groups can hang their metadata vocabularies. Some benefits of having this standard framework are:

  • It provides a common attribute=value data model for the metadata. All metadata expressed in RDF can be presented as a series of attributes (i.e., properties of the resource) and their values. For instance, one attribute:value pair for a CellML model might be species=Mus musculus.
  • It provides an extensible method for storing metadata of increasing complexity. Some metadata properties will have simple values, such as the species property shown above. Other metadata properties will have complex values. In the latter case, the value of the metadata property is itself considered a resource, and additional metadata properties are stored about it. This is made more clear by an example. Consider the case of the model creator property. This could be given a simple value of the creator's name, such as John Doe. However, it is more powerful to consider the value of the creator property to be a new resource (the person identified by the name "John Doe"). This allows the person's name to be stored as metadata about the new resource. Why bother doing this? Because it also allows additional metadata to be stored about the person, such as the person's mailing address, phone number, etc. Most importantly, we don't have to know ahead of time what sorts of metadata processing software might want to store about the person. If a particular application wants to store the person's favorite color, it can do so. Other applications might not recognize the meaning of the particular element that stores the favorite color, but they will be able to understand that it is some sort of property about the resource (i.e., person) that is the creator of the model. This allows the application to handle the unknown metadata gracefully (I suspect most applications would at least be able to present the attribute=value pair to the user).
  • It makes it possible for applications that don't know anything about CellML to understand our metadata. OK, this one isn't a reality yet, but it is part of Tim Berners-Lee's vision of a semantic web. Eventually, search engine tools could become RDF capable. In that case, people would be able to perform much more powerful searches for information on the web. If someone wants to find all web resources created by John Doe, he/she could search explicitly for resources where creator=John Doe, instead of just searching for resources that contain the string "John Doe".
  • There are tools out there that use RDF. It is definitely true that RDF is still a fledgling technology. However, there are tools out there that parse RDF, and tools that actually use RDF to build databases, knowledge stores, and other such things. See the W3C's RDF project list for a list of tools and projects using RDF.
  • RDF provides some good basic data structures. The RDF team did a good job defining groupings for metadata values, for instance. This makes it easy to allow multiple values for a metadata property, and allows you to unambiguously specify whether the multiple values should be considered an ordered list, a list of alternative values, or just an unorder group. These data structures (called containers) can be used with any RDF element, with not additional definition by the RDF vocabulary author.

2.3  Multiple Methods for Expressing Metadata in RDF

The plethora of methods for expressing the same basic piece of information is probably Warren's biggest problem with RDF. (Well, maybe the verbosity is his biggest problem, which is saying something, given his penchant for verbose implementations....)

Why does RDF allow multiple methods for storing the same information? I think it has to do with the fundamental database design principle of generalize whenever possible, and overload your classes as much as possible. The reason this is good database design is that generalizing and overloading are both good techniques to create a flexible data model that will handle new types of information as they become necessary. If you exactly map each type of information to a column in a database table (or an element in an XML schema), new types of information will require new columns (or elements). If you have a generalized structure that handles an entire class of information, you are much more likely to be able to handle any type of information without changing your schema. I suspect that this line of reasoning explains why the basic RDF data model allows some types of information to be stored in more than one way.

Of course the downside is that someone new to RDF has incredible anxiety attacks trying to figure out which way is the "right way", when in fact, all that matters is that he/she pick a method and implement it consistently.

3  The Dublin Core

Information about the Dublin Core can be found on the Dublin Core's website.

3.1  What is the Dublin Core?

The Dublin Core is a group of metadata properties. These properties were identified as "common" across a large range of resources by a group of library science and knowledge management folks. These properties are things like creator, publisher, subject, and date. A full list, with definitions, can be found in the Dublin Core Metadata Element Set, Version 1.1: Reference Description.

The Dublin Core Metadata Initiative group has also provided a standard set of "qualifier" elements. These elements add information to the basic elements. Qualifier elements either provide type information or scheme information. Type information classifies the basic element. For instance, the date element can have a type of created, modified, valid, available, or issued. Scheme information indicates how the content of the element is encoded. For instance, the date element can have a scheme of W3C-DTF or DCMI Period. A full list of qualifiers and their allowed values can be found in the Dublin Core Qualifiers document.

It is important to note that Dublin Core does not have to be expressed in RDF. The Dublin Core elements are not elements in the XML sense. They are simply standard names and definitions for common types of metadata. However, the Dublin Core can be expressed in RDF, which is how I recommend the CellML project take advantage of this work.

3.2  Why Bother With the Dublin Core?

Not all of the Dublin Core elements are immediately applicable to CellML metadata requirements. So why should we bother with the Dublin Core? Because this set of elements is widely referenced. General purpose tools are more likely to understand the Dublin Core metadata vocabulary than any other vocabulary. Also, it makes it more obvious that certain CellML metadata properties (such as model creator) map directly to metadata properties that are found in other fields. While this may not be of great value to us right now, later on we will probably be glad we took the effort to make our metadata more generally understandable.

4  The CellML Metadata Data Model

The basic CellML metadata properties were defined in section 16 of the CellML requirements document.

In the interests of writing the serialization section, and because the data model for CellML metadata hasn't really changed since October of 1999, this section will be done last. Which probably means I won't have time to do it at all!

5  Specific Serialization Issues

This section assumes that the reader is reasonably familiar with basic RDF constructs such as containers and the rdf:resource and rdf:parseType attributes. It also assumes a certain familiarity with the Dublin Core element set and qualifiers set.

5.1  Use of <rdf:Description> elements vs. rdf:parseType attributes

Both the <rdf:Description> element and the rdf:parseType attribute (with a value of "Resource") can be used to create a new resource. When should each be used? Well, this is largely up to the discretion of the vocabulary author. I have used the rdf:parseType attribute wherever possible, because it is less verbose. However, the <rdf:Description> element must be used if you want to give the new resource an identifier, which would allow the resource to be referenced by other RDF constructs.

5.2  Use of the <rdf:value> Element

The <rdf:value> element is often used when a new resource is created inline, and some metadata is stored about it. The <rdf:value> element stores the actual value of the metadata, and additional elements can be used to store qualifying information. If a processing application does not recognize the qualifying elements, it can take the contents of the <rdf:value> element and transfer them to the containing element. For instance, in the example shown in Figure 1, an application that only understands the Dublin Core, but not the Dublin Core qualifiers, could reasonably be expected to assign the date metadata a value of 2001-04-24. An application that understands the Dublin Core qualifiers would also know that this is a creation date, encoded using the W3C-DTF format.


<rdf:RDF
    
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    
xmlns:dc="http://purl.org/dc/elements/1.0/"
    
xmlns:dcq="http://purl.org/dc/qualifiers/1.0/">
    
  
<rdf:Description about="#cellml_element_id">
    
<dc:date rdf:parseType="Resource">
      
<dcq:dateType>created</dcq:dateType>
      
<dcq:dateScheme>W3C-DTF</dcq:dateScheme>
      
<rdf:value>2001-04-24</rdf:value>
    
</dc:date>  
  
</rdf:Description>
</rdf:RDF>

Figure 1 Date metadata that uses the <rdf:value> element.


5.3  Use of Qualifiers

The Dublin Core qualifier scheme is such a good one that I copied it extensively in other areas of the CellML metadata. Any time I wanted to provide a classification for a piece of metadata, I used a type qualifier. Any time I wanted to provide information about how the metadata information is encoded, I used a scheme qualifier. These elements are named by taking the name of the containing element and appending "_type" or "_scheme".

5.4  Controlling the Value of a Metadata Property

It is often desirable to control the value of a metadata property. This is particularly true for qualifiers. For instance, knowing that the date is encoded using "W3C-DTF" doesn't do you much good if you don't know what "W3C-DTF" means. There are several ways to control the value of a metadata property in RDF. I've listed them below, with an explanation for when I think each method should be used:

  • Use the rdf:resource attribute. The rdf:resource attribute can be defined on an RDF element to indicate that the value of the element is a resource, which is identified by the value of the rdf:resource attribute. I recommend using this method to control the value of a metadata property when the value subtypes the containing resource, and the subtyping adds properties to the metadata content of the containing resource.

    "What did she just say?" I hear you cry. Let me try to explain. The containing resource is the RDF element that creates a resource to represent the thing about which we are trying to define metadata. The best example of this from the CellML metadata is the definition of a literature citation for a CellML model. The citation is represented by the <bqs:reference> element. This creates a resource to represent the journal article, or whatever it is we are trying to reference, and contains a lot of metadata about the article. One of the pieces of metadata is the <bqs:reference_type> element. This element identifies the subtype of reference: is it a journal article, book article, web resource, or what? Additional metadata properties are required for a reference based upon which subtype is identified. If it is a journal article, we need to identify the journal, the volume, and the first and last pages. If it is a book, we need to identify the publisher, edition, volume, etc. The data structure for each of these subtypes is represented in the RDF schema. That is, there is a resource (in the RDF schema) that defines the additional metadata content added to the reference resource by the subtyping. I think the best way to serialize this is to make the URI that identifies the resource (in the RDF schema) that defines the new metadata properties on the containing resource.

  • Define a controlled vocabulary and enforce it in the RDF schema. This method can be used whenever you think you know ahead of time all of the allowed values for the metadata. The Dublin Core qualifiers are good examples of this method. The Dublin Core folks have defined certain qualifiers for their elements, and defined certain allowed values for these qualifiers. An RDF schema implementing the Dublin Core could limit the values of the qualifier elements to the values defined in the DC qualifiers document. The value of the qualifier element adds information to the metadata about the containing resource, but does not add content to the metadata allowed on the containing resource.
  • Define a standard method for storing the information, and recommend some values. If you think it is likely that you don't know all the values that an element may receive, but would still like to recommend some limitations, you can define the meaning of certain values for the element in the textual documentation on your scheme, and/or in any technical specification you provide for your RDF vocabulary. A good example of this is the <cmeta:species> element. The CellML metadata specification will recommend that the value of this element be a scientific name for a species, but the accompanying RDF schema will not enforce this.

5.5  Use of the rdf:resource Attribute

An rdf:resource attribute is used to indicate that the value of the metadata property represented by the element is a resource, which is identified by the value of the rdf:resource attribute. I use this attribute in two ways:

  1. As a way to control the value of a qualifier element that is subtyping another element, as described in Section 5.4.
  2. When the value of an element is a resource that is defined elsewhere. For instance, in the BQS serialization, information can be defined about a journal, and then referenced via an rdf:resource attribute in a journal article citation.

Technically, the first case is just a special instance of the second case. The resource identified by the value of the rdf:resource attribute is in the RDF schema, but it is still a resource.

                                                                                

Valid HTML!Valid CSS!XML/XSL