A Possible Model Repository Design
RDBMS (Relational Database Management System) for metadata management
Used to better find relationship between models.
Has unfettered access to the model store (so it can see all models, or at least the models that are desired to be shown to public by its creator).
SVN (Subversion) for model storage
So models are versioned.
Abstraction layer that binds the above two components together
Different front ends could be built by calling API provided by this layer
Zope/Plone for front end presentation
Enable website users to use it.
Most action will start from the abstraction layer, where an interface to both the database and the model storage interface (to Subversion, or even base on top of a file system if one does not care about revision history) is built. A RDBMS schema based on the (hopefully finalized on standardized technology with a proper RDF schema) CellML metdata specification will be created at the same time. Metadata would be extracted from submitted models and inserted into RDBMS. Changes to model can be done via upload or subversion check-ins. Metadata could be imported/exported both ways since they can be updated from either places, although one of the location has to be the authoritative (I vote for the RDF graph stored with the model in Subversion).
For naming convention, there shouldn't be too much impact. One possible method is to have give each user their own working directory, let them arrange the directory structure to however they like (according to some basic guidelines). It should not affect how the models get presented by the abstraction layer where the rules to present the models are placed. Actually, examples.
Usage Example 1:
User John has a working directory in Subversion svn://john/. He creates a subdirectory named 'model_1' (svn://john/model_1). He builds a model in CellML 1.1 and the main file main.cellml imports stimulus1.cellml and stimulus2.cellml. This is the file and directory structure so far:
svn://john
svn://john/model_1
svn://john/model_1/main.cellml
svn://john/model_1/stimulus1.cellml
svn://john/model_1/stimulus2.cellml
So far, this is considered a private model because it only resides in Subversion. Now this model happens to based on the paper that is authored by Sally Jane and Jun Tanaka. Proper citation metadata was added to the model (via the abstraction layer or the front end, or even manual) and the generated RDF graphs were added into the files. Keywords were also given to the models which are 'cardiac' and 'physiological' and they were also added into the RDF graph residing in those files. An extra keyword called 'stimulus' was added to the stimulus files.
At this point, John could open up the model to the rest of the world via web front end (which talks the abstraction layer, which marks the appropriate fields in the RDBMS to indicate so, or the front end manages that. Implementation of this to be discussed later). This URIs according to the naming convention could be generated.
http://cellml.example.org/models/citation/jane_tanaka/
(The /citation/ can be named something else, and I omitted the year/month for simplicity in examples)
As all the models are marked with the proper metadata, they all can be represented as a file with the above HTTP URI.
The citation index page would be quite simple at the moment, showing a brief listing of the metadata that is added into the models themselves, with links to the files. This is the basic page.
Usage Example 2:
It is possible for John to add write documentation in an HTML file complete with images, referenced by URI that could be stored as a value with dc:identifier predicate in the RDF graph of the model(s) (i.e. a pointer to the documentation which humans and machines can read; it could be achieved via dc:identifier). If the path to the HTML file is relative the directory the model resides in Subversion is assumed (i.e. svn://john/model_1/doc.html will be retrieved and be accessed at http://cellml.example.org/models/citation/jane_tanaka/doc.html assuming permissions are given). Session files and diagrams could also be added by the front end in a similar fashion. The name assumption is done once and recorded into database, to avoid file name collisions.
At this point, the URIs presented so far are:
http://cellml.example.org/models/citation/jane_tanaka/
http://cellml.example.org/models/citation/jane_tanaka/main.cellml
http://cellml.example.org/models/citation/jane_tanaka/stimulus1.cellml
http://cellml.example.org/models/citation/jane_tanaka/stimulus2.cellml
http://cellml.example.org/models/citation/jane_tanaka/doc.html
http://cellml.example.org/models/citation/jane_tanaka/diagram.png
http://cellml.example.org/models/citation/jane_tanaka/main.cellml.pcenv
Usage Example 3:
Mary is another user of the system, and her workspace in SVN is 'svn://mary'. She created a directory 'a_model', and was working on a model based on the same Jane-Tanaka paper that John was working with. She created a CellML 1.0 model and simply named it 'model.cellml', then she published it as a public model. Now the index page at http://cellml.example.org/models/citation/jane_tanaka/ will also show model.cellml as another file. She could also have a documentation file pointed by the RDF metadata where website users can open via a link.
Usage Example 4:
Mary decided to create another model based on the same paper, and she named it 'main.cellml'. She then tries to publish the model but hits a snag – there is already a model with that filename and the abstraction layer detects that. In order for her to publish that model, she either have to rename the filename, or treat her model as a branch (or fork, or variant) of the model that is named 'main.cellml'. Renaming is probably a more simple approach in this case since she only has one file and it's doubtful someone is using it.
Usage Example 5:
John works on his version of main.cellml again, but he needs to make drastic changes to the model and so he creates a branch in Subversion in his working directory (in svn://john/model_1/branch). He also thought that reviewers should review the model before merging his changes back into the original file. So he exposes the new model as a branch also through the website/abstraction layer by naming it 'john_branch'. The branched 'model.cellml' would then be accessible via http://cellml.example.org/models/citation/jane_tanaka/john_branch/main.cellml and only by model reviewers. As for the stimulus files that main.cellml imports, John could either copy them into the branch, or update the references in main.cellml to use the stimulus files that resides in the parent directory. This may or may not work as intended.
Usage Example 6:
Website user Ming (who also writes CellML models) decided to browse models by keyword. He decided to view http://cellml.example.org/models/keyword/stimulus/ and saw the file stimulus1.cellml. He decided that file suits his needs and his CellML 1.1 model can import http://cellml.example.org/models/keyword/stimulus/stimulus1.cellml. A problem, however, is that Mary decided to convert one of her models from the same paper (jane_tanaka) and names one of her stimulus files as stimulus1.cellml and was given the stimulus keyword! Now the URI http://cellml.example.org/models/keyword/stimulus/stimulus1.cellml could point to two different files, and a way to distinguish between them is to assign a sequence of numbers to the files, and so the two files will have unique URIs such as:
http://cellml.example.org/models/keyword/stimulus/2/stimulus1.cellml
http://cellml.example.org/models/keyword/stimulus/9/stimulus1.cellml
Which the keyword model index page should probably be linking to. However (this is up to debate) if Ming did make the mistake of using the original URI, the model with id #2 would be retrieved instead.
Also, using the URI based on the internal identifier of the CellML file could have added benefits. The URI http://cellml.example.org/models/keyword/stimulus/2/ could show the info page about the model, and a link to the actual CellML file can be shown there also.
Ming could also browse the models by its id, such that the URI
http://cellml.example.org/models/id/2/ will also show the informational page on stimulus1.cellml that John created.
Searching model files will return the id based URI, and if citation is desired the citation root URI can be returned.
Usage Example 7:
While this has not been defined yet, Ming should be able to access previous versions of the models via an URI. This URI http://cellml.example.org/models/id/2/stimulus1.xml?rev=3 could be a possible format candidate.
Arguments on RDBMS
Pro:
Relational databases has been established for a long time
It can be used to show relationship between models much easier than an object database like Zope DB.
Can be quite straightforward, lot easier to write queries
Con:
SQL looks ugly.
It is separate from the model storage, could cause inconsistency between metadata residing in model.
Data not necessarily versioned
Counterpoint: citations should have been immutable anyway. Spelling mistakes in author's name should not be versioned anyway and really should be corrected asap.
It could be think of the metadata stored in the RDBMS is a snapshot.
It has the advantage of correcting spelling mistakes, but this will mean all the models in the repository will have to be synchronized with the correct spelling and that can have an adverse affect on performance.
Arguments on SVN
Pro:
Revision/version capabilities built on established foundation.
If website dies data can still be accessed in theory.
Con:
Does not address the specific needs of CellML.
Any given CellML model with more than one component can have more than one serialization, rendering svn diff useless (no way to easily find difference between revisions).
Arguments on the abstraction layer
Pro:
It makes writing front ends much easier, gives flexibility
Con:
Could be complicated by having to tie SVN together with a RDBMS.
Arguments on Zope/Plone
Not very relevant I believe, as it's just a front end to the abstraction layer. It conld conceivably be written in CGI, but I doubt that is desirable.