In my previous post I outlined some problems that are encountered when performing experiments. I then introduced an experimental design methodology to follow that attempts to minimise these problems. The focus was on the following reasons why one would perform thorough documentation in the requirements analysis phase:
- it uncovers requirements that have been misinterpreted or overlooked;
- it confirms that your logic and experimental methodology are correct; and
- it lessens the workload when you are writing your report.
Now, in this post, I shall look at a project the SINAD group at the MIH Media Lab is working on, which uses these ideas to formalise our data management so that data generated by members can be combined into a digital library.
A digital library can be defined as: a potentially virtual organisation, that comprehensively collects, manages and preserves rich digital content, and offers to its targeted user communities specialised functionality on that content, based on codified policies. More information can be found here and here.
The research done on digital libraries and scientific data management has focused on large scale systems, such as the those described in this article on scientific data management. The data they are interested in usually has well structured metadata schemas such as those found in the Metadata Encoding and Transmission Standard, and is often also of the same type, for example, the DNA databases of the NCBI. However, we’re interested in running many different kinds of experiments on projects spanning a wide range of fields in natural language processing (NLP). Therefore, it stands to reason that the data that are recorded for different experiments can vary considerably. Also, we are not aiming to create a virtual organisation as the definition implies, but only create a simple set of tools to facilitate research in the lab (and hopefully to others outside of the lab eventually).
After some initial requirements analysis we started to look for an appropriate database system. The following are some of our database requirements:
- the database must store large volumes of data;
- it must handle large traffic volumes;
- you must be able to easily publish data from a local private database to a global shared database in order to share data with colleagues;
- it must be convenient to store metadata describing the data for documentation purposes, since it is easy to forget important information about why and how the data was generated;
- it must be easy to insert new fields or remove existing fields from an existing schema; and
- it must be easy to remove unwanted or erroneous data.
It became apparent that MongoDB covers our requirements quite well, and it was therefore chosen as our database of choice over relational databases and good old text files; the latter, worryingly, are still the go to method for many experimenters.
MongoDB’s collections are in some ways comparable to SQL tables, and will be used to group data from the same experiment together. The data generated by a subsequent run of this experiment can then also be stored in this collection, thus forming a larger data set. Each data point is stored as a BSON object. BSON is a binary-encoded serialisation of a JSON document, the structure (or schema) of which are easily changed and existing objects updated, if alterations need to be made at some later stage. This allows us to change the database easily when compared to relational databases and text files, even if data of the type we are changing is already present in the database. As an added bonus, APIs that make it easy to create and manipulate JSON objects in your code are available for almost every language you can shake a stick at.
Let us look at what a data point represented in a JSON object might look like. Take a simple example of a coordinate in Cartesian space, where the object only stores its own unique identifier, and the x and y values of the coordinate. Here we have the coordinate (4, 2):
"_id" : ObjectId(8fcc3ebc1277000304506170""),
"x-axis" : "4",
"y-axis" : "2"
Multiple objects such as this can be observed in some experiment and stored in a collection created for that experiment.
We create a special object for each collection to distinguish and describe the collection, which we shall call the experiment description object. This object is given a specific schema:
"_id" : ObjectId("4dcd3ebc9278000000005158"),
"author" : "Peter Hayward",
"overview" : "Data recorded for ExperimentTypeA",
"motivation" : "The reasons of why you run the experiment(s) that generated the data in this collection.",
"program" : "pipeline1",
"parameters" : "-a 10 -b 12",
"repo_name" : "firstname.lastname@example.org:MIH-Media-Lab/pjhayward.git/projectx",
"git_commit_hash" : "9969247ce1f9693d9573b797bdb516f2c35acd32",
"object_vars": "x-axis, y-axis"
The aim of this schema is to capture all the important metadata of the experiment, such as the experimenter’s/author’s name, overview, motivation, the information needed to replicate the experiment, and the attributes the other objects in this collections must have (in our example the x and y axis values).
We can then use the experiment description object to validate the data that the author (or at a later stage some other user) wants to upload to the collection. For example, if I run an experiment with exactly the same parameters again, and want to add the coordinates to the collection, then the coordinate objects must have the x and y values as specified in the experiment description object.
Arguably, the most important reason for collecting the metadata stored in the experiment description object is to be able to search the digital library for useful datasets. We aim to develop not only search tools, but also a web interface that makes it easy to browse datasets and visualise them as well. For example, you might be able to pull up a plot showing all or a sample of the coordinates we’ve been talking about thus far.
If you want to run the same type of experiment with different parameters, you can simply use sub-collections in the collection designated for the experiment type. For example, say we have an experiment type called ExperimentTypeA (as described in the JSON given above), and we want to run the experiment with two sets of parameter values. First, a=10 and b=12, and then a=20 and b=24. Logically, providing different parameter values would influence the results, so it would be wrong to clump the results of these two experiments together. Therefore, we create two sub-collections and store the data points of each experiment separately. Each of these sub-collections will also have its own experiment description object explaining exactly the difference in the parameter values.
We tick all our requirement boxes with a simple system like the one I just described. Such a system makes exploring large amounts of data more plausible, and, hopefully, makes working with experimental data less of a pain. My hope is that it will help our research group goal of working on more intergroup projects; and that other readers might get some ideas for tackling the data management problems they face.