Proposal: Best current practice for including external code in CellML models

Note: This document is a proposal. The below describes the intended status: A document describing the best current practices for including external code in CellML models.

Overview

[CellML] models are very good at describing complete mathematical models in a format which can be exchanged between model authors and users. This adds significant value to a model representation, because third parties can take the model, and use it in their preferred software packages to reproduce any results the author published.
Parts of CellML models can also be interchanged, as models are broken into components, and CellML components can be imported into other CellML models. However, these mechanisms are currently only useful when the importing model is also written in CellML.

Unfortunately, not all types of model can be adequately represented using only CellML. However, it is often the case that individual parts of these models can be isolated out, allowing the remainder of the model to be expressed in CellML. Having part of a model expressed in CellML, and other parts expressed in some more generic language is still useful, because it means that the common part of the model can be re-used more easily, either by providing external code of a different kind, or, where possible by replacing the external code with MathML.

To avoid doubt, it should be noted that as much of a model as possible should be represented in CellML. Only those parts of a model which, by their nature, cannot be represented in CellML, should be represented using external code.

It is also hoped that this specification will encourage model developers to build up libraries of CellML accessible external code, which can be re-used in a range of CellML models, therefore increasing the range of modelling techniques available to CellML model authors.

This document does not attempt to define the actual mechanism by which external code is called (this will depend on the type of CellML processing software being used). Instead, it allows the interface between internal and external code to be clearly defined, and leaves the specifics of how external code is embedded into the model to the CellML processing software.

Marking up calls into external code

External code calls are allowed anywhere where other MathML operators (such as the MathML plus operator) are also allowed. Specifically, calls into external code is allowed within MathML in both components and reactions.
External code calls are expressed using the MathML csymbol element as the operator of an apply element. For example, an expression declaring that external code can compute the variable y from x1 and x2 would be written something like this:
```
<apply><eq/>
  <ci>y</ci>
  <apply>
    <csymbol definitionURL="http://www.example.org/external-code/some-noncellml-code"/>
    <ci>x1</ci>
    <ci>x2</ci>
  </apply>
</apply>
```
External code may compute more than one output simultaneously. However, the concept of a function in mathematics implies that there is only one output for each input. The usual way in mathematics to get around this is to make that one output an object, such as a vector or matrix, which contains more than one entry. Although variables in CellML 1.0 and 1.1 must have scalar real values, this is not contradicted by allowing external code to return a vector (as a special case), and assigning this to equal a vector of variables:
```
<apply><eq/>
  <vector>
    <ci>y1</ci>
    <ci>y2</ci>
  </vector>
  <apply>
    <csymbol definitionURL="http://www.example.org/external-code/some-noncellml-code"/>
    <ci>x1</ci>
    <ci>x2</ci>
  </apply>
</apply>
```
To simplify the task of CellML processing tools, the vector syntax must not be used in the case that there is only one output.

Best practice guidelines for CellML document authors

External code should be used only where a part of a model cannot be adequately expressed in CellML. External code is often non-portable, and using it reduces the re-usability of your model, and so it should only be used when needed.
External code should only perform the calculations that CellML is unable to perform, with the rest of the calculations expressed as MathML, in the CellML model. This is important, because increasings the fraction of your model can be more easily re-used by other modellers. It also means that CellML editing and visualisation software will allow your model to be edited and visualised better.
Modellers should, where feasible, separate external code into as many different sub-functions as possible. For example, if you have external code to compute y1 from x1 and x2, and y2 from x1 and x2, you should write this as two separate external function applications, unless there is a compelling reason to do otherwise (such as is the case if it is much more efficient to compute them together). Doing this makes it easier to modify the CellML model in the future, and allows the CellML processing software to determine the order in which expressions are evaluated, making your model more flexible.
External code should, by itself, meet [MIRIAM] requirements 1 and 2. This means that the external code should be encoded in a public, machine-readable format, and it should be valid and compilable.
The external code should be treated as part of the model. When a model represented in CellML is published, the external code should be published alongside it, unless it is part of a generally available library of external code.
The definitionURL used on csymbol elements should be a URL under the control of the author. It is not necessary for there to actually be a document accessible at the URL, as it is merely intended as a unique identifier.

Best practice guidelines for CellML processing tools

Note: CellML processing tools are not required to provide an interface for functions defined in external code. These guidelines only apply to CellML processing software which elects to provide a means to call external code.

CellML processing tools should clearly define the interface between external code and the CellML integrator.
CellML processing tools should generate an error, but not crash or perform other undefined operations, if they encounter CellML which calls external code with an incorrect numbers of inputs, or incorrect de-vectoring outputs.
CellML processing tools should allow any valid URL as the defintionURL, and should clearly define the interface used to associate external code with a particular URL.
CellML processing tools may also choose to make use of the encoding attribute of csymbol, in order to describe the format in which external code is represented.

References

[CELLML]: CellML: its future, present and past. Prog. Biophys. Mol. Biol. 85(2-3), 433 - 450 (2004).
[MIRIAM]: Minimum information requested in the annotation of biochemical models (MIRIAM). Nature Biotechnology 23, 1509 - 1515 (2005).