CellML.org - Meeting Minutes 29 March 2001
April 2001 March 2001 29 March 2001 28 March 2001 27 March 2001 26 March 2001 21 March 2001 20 March 2001 15 March 2001 14 March 2001 13 March 2001 |
Author: 1 IntroductionThis document provides background information about metadata, its role in the CellML project, and the possible implementations of metadata in CellML. 1.1 Need for Metadata in CellMLMetadata is usually defined as "data about data". It is the supporting information that provides context to a resource. In CellML, the model (i.e., the structure and mathematics of the model) is the resource. Information that puts the model into the larger scientific context is metadata. Metadata in CellML includes information such as the literature reference that supports the model, the identity of the creator(s) of the model, and the species for which the model is relevant. The CellML metadata project needs metadata for two primary reasons:
Metadata in CellML can be used in many different ways, such as:
The metadata structure should be flexible and extensible, because it is almost certain that we have not thought of all possible uses of CellML metadata. 1.2 The Larger Metadata PictureMetadata has become a bit of a buzzword lately. This is because people are starting to realise that we cannot get the maximum use out of the information stored on the web without metadata. It is currently not particularly easy to find a specific piece of information on the web, and once you have found the information, it is not easy to determine whether or not you should trust it. Metadata can address both of these problems. Therefore, there is a push to begin to incorporate metadata into web resources. Tim Berners-Lee has been particularly active in pushing for a "semantic web", in which resources on the web would include the semantic information necessary to allow machines to understand (not just read) them. The W3C has set up a semantic web activity. Some software projects, such as Mozilla, have begun trying to take advantage of the metadata that is currently available about web resources. The "semantic web" vision is one of the future, and not of today. Several projects are beginning to take tentative steps towards realising Tim Berners-Lee's dream, but success is by no means certain. The library science community is leading the way in implementing metadata. A consequence of this is that the tools being provided for handling metadata on the web (such as the Resource Description Framework, or RDF) have come from the knowledge management community. Like any academic discipline, that community has its own jargon, which can be a hindrance to the rest of us when we try to understand and use these tools. However, several projects are now using RDF, and a variety of tools have been created for it. These will be discussed in Section 3.2. None of the problems faced by the nascent metadata community are insurmountable. It seems very likely that something resembling the "semantic web" will come into existence, if for no reason other than the importance of the problem it is attempting to address. Therefore, we should at least consider how we can make metadata in CellML compatible with the semantic web activity. 2 Metadata in CellMLThe initial step in incorporating metadata into CellML was to determine what sorts of information modellers might want to store about their models, and what sort of information software developers might find useful to be able to store. This was done as part of the requirements gathering for version 1.0 of CellML. A list of the metadata requirements for CellML is included in the requirements document. The necessary metadata can be split into three broad categories:
3 Existing Standards for MetadataThere are two existing metadata standards that warrant our attention: The Dublin Core element set (which defines standard types of metadata) and the Resource Description Framework (which defines a means to specify metadata). 3.1 The Dublin Core MetadataThe Dublin Core Metadata Initiative came out of the library science community. It is an attempt to identify metadata that is common across a variety of types of resources, and provide a standard way to refer to this metadata. The Dublin Core group is primarily concerned with identifying the common metadata objects (such as "creator" or "copyright"), rather than specifying how the metadata should be stored. However, they have released a document entitled Using Dublin Core that gives guidelines for storing Dublin Core metadata in HTML and XML/RDF. The Dublin Core documents most relevant for the CellML project are:
Why should we care about the Dublin Core? The Dublin Core metadata element set is probably the most widely used set of metadata elements. Most projects listed on the RDF Project list use the Dublin Core metadata elements in some way or another. Using a standard vocabulary wherever possible increases the chances that our metadata will be accessible to general purpose metadata tools. For instance, someone could use the RDF Crawler to discover some basic information about CellML resources. They are more likely to be able to interpret metadata stored in a common vocabulary such as the Dublin Core element set. There are also a growing number of metadata tools designed to work with the Dublin Core element set. For instance, the DC-Assist tool provides a browsable set of descriptions and examples for the Dublin Core elements and qualifiers. See the Dublin Core's tool section for a list of other tools. 3.2 The Resource Description FrameworkThe Resource Description Framework (RDF) is a W3C recommendation for storing and exchanging metadata. It specifies a general data model for metadata and provides an XML syntax for storing metadata in this data model. The companion RDF schema recommendation specifies a syntax for defining the detailed data model for a specific set of metadata. It is expected and encouraged that people will draw from a variety of RDF schema when marking up the metadata about their documents. Because all metadata stored in RDF uses the same basic data model (described below), the various vocabularies and schema that people develop are all interoperable. In fact, an RDF schema can be used to define a new metadata element that is a subtype of an element defined in a different schema. RDF came from the knowledge representation community, and therefore has a frame of reference that is quite different from more computer science driven standards such as XML itself. The basic data model of RDF is a directed labelled graph, which can equivalently be expressed as a "three-tuple" (or triple). Readers wanting a formal definition of this model should refer to Section 5 of the RDF Model and Syntax recommendation. What follows here is an informal explanation. The thing about which you want to store metadata is the resource. The type of metadata you are storing is the property. The value of the metadata is either another resource (which can itself have metadata) or a literal. An RDF three-tuple contains a predicate, subject, and object. The subject is a resource, the predicate is a property, and the object is the literal or resource that is the value of the metadata. RDF also provides grouping mechanisms that allow you to unambiguously specify the correct interpretation of multivalued objects. There are three types of grouping containers: a bag (an unordered group of objects), sequence (an ordered group of objects), and alternative (a group of objects that specify alternative values for a single-valued object). Unlike XML (which restricts the allowed syntax of the data in a resource), RDF attempts to encode the semantics, or meaning, of the data in a resource in a machine-understandable manner. The RDF elements are all concerned with identifying the subject, predicate, and object for each metadata statement. With this information, RDF parsers can construct directed graphs and 3-tuples of the metadata, which is itself useful. The 3-tuples can be thought of as attribute-value pairs about the subject, and simply presenting these attribute-value pairs to the end user is a useful thing to do with metadata. The information in an RDF schema allows software to do more meaningful things with the metadata. For instance, an RDF schema could define a type of metadata called "author", and declare it to be a subclass of the Dublin Core type "creator". RDF-savvy software could then infer that an author has all of the properties of a creator. An RDF schema can also limit the types of resources to which a particular property can be applied. For instance, a "hair_color" property would probably not be applicable to a resource of type "car". Note that the RDF Schema recommendation only has "candidate" status, and there don't seem to be many implementations of it yet. Why should we care about RDF? The simple answer is that it is a W3C recommendation. It also provides a robust and flexible framework in which metadata can be stored, and allows people working in diverse subject areas to use each other's metadata in an interoperable way. However, it has a steep learning curve, due in large part to a difficult specification. Furthermore, because it is encoding the "subject-predicate-object" information for each piece of metadata, RDF is a bit verbose (see Section 4 for a defense of this verbosity). Perhaps the most powerful argument for paying attention to RDF is that people are beginning to use it. The RDF home page at the W3C has a list of projects and tools using RDF. Dave Beckett also has a list of RDF resources, which includes many tools and projects. The following is a list of some of the most interesting projects from these lists:
There are at least two XSLT RDF parsers, as well as tools using Java, C, Perl, Tcl, and Prolog. Anyone implementing CellML metadata support should look through these lists and see if there is anything useful there. 4 Verbosity of RDFA common criticism of RDF is that it is much more verbose than an "XML-only" language to store the same information. The content/markup ratio seems quite low. However, this misses the fact that the information content of an RDF statement is more than just the content of the RDF elements. It is the set of 3-tuples that these elements encode, as well as additional structure for the metadata content. These 3-tuples provide some very basic machine-understandable semantics of the information, whereas equivalent XML would only provide machine-understandable syntax. The information that supports the semantic interpretation of metadata must be stored somewhere if processing software is going to do anythign reasonable with the metadata. In XML, this semantic information is stored in the specification of the XML vocabulary, and therefore must also be coded into the processing applications' logic if the metadata is going to be machine-understandable as opposed to simply machine-readable. In RDF, more of the semantic information resides in the document itself and in the (machine-understandable) RDF schema, making the metadata machine-understandable to any RDF-enabled processor.
For instance, think about the simple case where we want to store the following information: "the book's author is John Doe". We could define an XML syntax for this, as shown in Figure 1. We look at this and know that it is representing the metadata about the author of a book. However, a computer could not automatically understand that. The fact that the If we store the same information in RDF (see Figure 2), the fact that the author information is a property of the book object is explicit, because this is defined by the RDF data model. Additional semantics about the relationship between an author and a book could be provided in the RDF schema for the vocabulary indicated by the "s" namespace. In fact, we can go one step further, and use a standard RDF vocabulary, such as the Dublin Core, to maximize the utility of our metadata for other applications. This is shown in Figure 3. The basic RDF data model also provides useful grouping semantics. Consider the example of a resource that has three creators. There are three interpretations of the meaning of this:
We could define an XML vocabulary that allows us to differentiate between these three options. However, the meaning of the elements and attributes that allow this differentiation would have to be coded into processing applications' logic. It would not be enough to simply parse the XML. If we had another type of metadata that needed the same set of options (for instance, editors), we would either have to store the differentiation twice, or generalize the XML elements that represent the two types of metadata. We can also differentiate between these three options in RDF. The first option is stored in an construct called a bag. The second option is stored in a construct called a sequence. The third option is the default assumption, and is represented by simply repeating the RDF element that stores the creator metadata. The meaning of these constructs is specified by RDF, and an application could understand which of the three possibilities is correct for a particular instance of metadata simply by parsing the RDF. If we had another type of metadata that needed the same set of options, we could use the same three constructs to represent the different options. 5 Storing Metadata in CellMLWe have three main options for storing metadata in CellML:
5.1 The CellML PhilosophyThe decision about how to implement metadata in CellML needs to take into account:
5.2 Comparison of Possible Implementations5.2.1 RDF Using All Possible RecommendationsFigure 4 shows an example of CellML creator, creation date, annotation, biological entity metadata for a fictitious model component stored in RDF, using the Dublin Core draft recommendation for storing qualified Dublin Core metadata in RDF. Note that we have not worked out how to store information about people, so only the creators' names are provided. A more structured set of information about the creators could be included. The encoded metadata is:
The advantages of this approach are:
The disadvantages of this approach are:
5.2.2 RDF Using an Entirely New SchemaFigure 5 shows an example of the same CellML creator, creation date, annotation, biological entity metadata stored in RDF, but using only a CellML-specific schema. We can still use the Dublin Core elements and qualifiers, but have devised our own system for encoding them in RDF. Note that we have not worked out how to store information about people, so only the creators' names are provided. A more structured set of information about the creators could be included. The encoded metadata is the same as for Figure 4. The advantages of this approach are:
The disadvantages of this approach are:
5.2.3 A New XML Application
Figure 6 shows an example of the same CellML creator, creation date, annotation, biological entity metadata stored in a new XML application, which is given the The advantages of this approach are:
The disadvantages of this approach are:
| ||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||