Modelling in CellML
Creating a CellML Model
by Catherine Lloyd and James Lawson, 20th November, 2007.
This document offers some suggestions for translating mathematical models into CellML.
Currently (November 2007) there are two main tools available for writing CellML models; Cellular Open Resource (COR) and Physiome CellML Environment (PCEnv). Both of these have their own tutorials and instructions for building a model, and they can be found, along with other model authoring, validation and simulation tools on the CellML Tools page of the CellML website. Alternatively, models can be written by hand in a text editor, such as Notepad++ in Windows, and Kate in Linux. As such, this document will not give much specific information regarding how to implement the suggestions made within.
Outlined below is an introduction to translating a model into CellML. Although best practice actions will be highlighted, it should be emphasised CellML is a flexible language and has been designed to allow users to express models in the manner that most suits them. Because CellML is a declarative language, as opposed to procedural languages such as C or MATLAB, the order in which elements are defined does not affect how the model is processed.
Table of Contents:
- What can CellML describe?
- Units in CellML
- Components and variables
- Global variables
- Math elements and equations
- The reaction element
- Grouping
- Connections and interfaces
- Model Validation
- Metadata
What Can CellML Describe?
The CellML language can be used to describe models of diverse processes, including but not limited to biological systems; indeed, multiscale modelling is a forte of CellML. The cellml.org model repository contains examples of single models which describe more than one process - embedding, for example, metabolic pathways within an electrophysiological system. At present, CellML is able to describe systems of linear algebra and ordinary differential equations and real numbers.
Units
CellML requires that all variables and numbers in a model are associated with a defined unit, and all units used in a model must be declared under units elements. The majority of these are based on the International System of Units (SI) although some non-SI units that are particularly common in biological systems are also provided. Additional units can be defined as complexes and variations of SI units.
<units name="s"> <unit units="second" /> </units> <units name="nM"> <unit prefix="nano" units="mole" /> <unit units="litre" exponent="-1" /> </units> <units name="flux"> <unit units="nM" /> <unit units="s" exponent="-1" /> </units>
Note, although this method of defining a unit may appear rather verbose, the power of CellML is that it is precise and avoids the possibility of ambiguity. The model authoring tool of course would not normally expose the modeller to the raw CellML code, so much of this complexity and verbosity would be effectively hidden.
Components, Variables and Connections
A component in a CellML model is a functional unit that may correspond to a physical compartment, event, or species, or it may be just a convenient modelling abstraction. A component contains variables and mathematical relationships that manipulate those variables. The following CellML fragment defines the environment component and the (global) variable time (please see Global variables below for a more detailed explanation of these terms):
<component name="environment"> <variable name="time" public_interface="out" units="second" /> </component>
Global Variables
Almost every model begins with a component which is called environment. This component contains all the global variables which apply to all the components in the model - usually, this is just time. Global variables must be defined only once in a model.
Math Elements and Equations
Mathematical equations are expressed using MathML 2.0, an XML-based language which is embedded within the CellML framework. All mathematical expressions defined using MathML must be placed inside a <mathml:math>
element and any variables used in an equation must be named within a <mathml:ci>
element. Similarly, all numbers used in an equation must be named within a <mathml:cn>
element and they must have units associated with them. These features are illustrated in the equation below.
Often it is useful to be able to identify a particular mathematical equation with a reference ID. For example, this ID could be the original equation number taken from the published paper to designate the same equation in the code. Alternatively, an equation ID could also be helpful during the process of model validation, allowing the quick identification of a possible error in the code. IDs are assigned to math elements rather than to the equations themselves (as shown in the fragment of CellML code below); it can therefore be useful to allocate one math element per equation to allow equations to be identified by their ID.
<math id="1" xmlns="http://www.w3.org/1998/Math/MathML"> <apply><eq /> <ci> C </ci> <apply><plus /> <ci> A </ci> <ci> B </ci> <cn cellml:units="second"> 20.0 </cn> </apply> </apply> </math>
The Reaction Element
Originally CellML contained a "reaction" element which was used to describe individual reaction steps in a pathway. This included a description of the reaction kinetics, the reactants, products and any enzyme catalysts or inhibitors. However, in practice we found implementing the reaction element in a CellML model often required the equations to be re-written, such that they no longer reflected those in the original publication. At best this created extra work and some confusion, at worst it broke the model. Consequently usage of the reaction elements is currently discouraged and it has been proposed that they be removed from the next version of the CellML specification, i.e. CellML1.2.
Grouping
There are two predefined types of grouping in CellML: encapsulation and containment. Encapsulation is an abstraction, a modelling convenience. Containment is used to describe the physical or geometric organisation of a model, such as biological structure. This type of grouping specifies that components are physically nested within their parent component, for example an ion channel may be physically embedded within a membrane.
Encapsulation should be avoided where it creates unnecessary complexity. In practice, most of the models that describe signalling pathways or biochemical processes do not require encapsulation. By contrast, electrophysiological models often have activation and inactivation gates encapsulated within ion channels. It is useful to use encapsulation in this instance because gate properties are specific to individual channels; therefore they can be hidden from the rest of the model.
Connections and Interfaces
The mapping of shared variables between components occurs via connections, with the directionality of the connection defined by the interface attributes of the variables involved. Interfaces can be public, making the variable available to all sibling components (that is, components in the same level of the encapsulation hierarchy) or private, making the variable only available to components specified by the system of encapsulation set up within the model. The CellML 1.1 specification allows just one connection between any two components.
CellML 1.1 and Imports
The primary difference between CellML 1.0 and 1.1 is the addition of the ability to import components from separate files. This feature promotes reusability of models and components and allows CellML files to be incorporated into hierarchical frameworks. For example, a complex model of a cardiac myocyte may be constructed by importing many individual models, each describing a particular process: metabolism, electrophysiology, the contractile apparatus, adrenergic signalling, etc. The use of imports eliminates the requirement for vast, monolithic models constructed by assimilating multiple models, and allows duplication of imported modules and multi-tiered hierarchies. Currently (November 2007) the CellML model repository does not support CellML 1.1 models, but a rewrite of the repository software is underway, and the new software will cater for 1.1 models.
Model Validation
Validation of the CellML models currently in the repository is an ongoing process. Newly coded models are run and checked in both PCEnv and COR, and where feasible, and the code is checked over by a second party before they are uploaded onto the repository. Where possible, the model author may be contacted and invited to help build and curate the model, including providing the original source code of the model. This process aims to resolve the problem of potential type errors in the published paper, or incomplete / incorrect parameter sets. The IUPS Physiome Project aims to make the model available in CellML as the work is published, eliminating human error in the translation process. This has already been achieved in the case of several model authors within the Auckland Bioengineering Institute, and it greatly improves the quality of models that we are able to provide in the repository.
Metadata
Metadata provides a context for the document and can include: the name of the model author, the date the model was created, key words related to its content (which later facilitate the process of searching for models in the repository), and information about the published paper the model was taken from (the citation). Although metadata is not required for a CellML document to be valid, we strongly recommend including all of the above. The process of adding metadata to a model has been made simple through use of a metadata editor which can be viewed when loading a model into the repository. Currently this editor is the only tool available for adding metadata to a model, unless the modeller uses the XML viewer in PCEnv to add metadata to the code itself.
Metadata can also be added to models to describe entities and components within the model, as described in the CellML Metadata Specification . This kind of metadata is used for annotation to provide context and information about the processes and entities being described by the model. For example, a variable called "a" may in fact represent a protein kinase enzyme; this information can be represented in the CellML file by annotating the variable with metadata. Alternatively, metadata can be used to add information to the model that may be useful for simulation, such as optimal integration parameters.
For more information and examples of the specifics of creating a valid CellML document, please refer to the CellML specifications.
If you have any questions or comments regarding translating models into CellML, or this document in general, please do not hesitate to contact the CellML community through the general discussion mailing list.