Metadata Standards

“Soil Sample #3, July 7th”…  “Sears Tower A/C unit water sample”…  “Cancer patient lower colon sample”…

To make the most out of any microbial study, it is critical to collect detailed information about each sample in addition to trying to characterize the microbes in the sample.  This detailed information about a sample is known as “metadata.”  For example, in studies of microbes in ocean water, one might collect metadata on GPS coordinates, time of collection, water composition, how samples were collected, depth, etc.  In studies of microbes in a building one might collect information on humidity, temperature, lighting, air circulation etc.

Though any metadata can be useful, there are many potential benefits in studies of microbes to the standardization of the collection, recording and sharing of metadata.  This is true for any type of microbial study but has taken on a particular importance recently in DNA sequencing based studies largely due to rapid acceleration in the production of this sequence data.  To make the most out of the exponentially growing sequence data, having easily analyzable and diverse types of metadata is critical.

There have been some significant efforts undertaken recently to develop standards for metadata for microbial sequencing studies, many of which have come from a group known as the Genomic Standards Consortium (GSC).  Among the GSC efforts are

  • “Minimum Information about a (Meta)Genome Sequence”: MIGS/MIMS.
  • “Minimum Information about a MARKer gene Sequence”: MIMARKS, which builds on the MIGS/MIMS standards.

These efforts have been broadly focused to cover all sorts of environments and kinds of microbes.  For studies of microbes in the built environment these is still a significant need for development in terms of deciding what kinds of metadata should be collected. We note that the the MoBeDAC project will be working to create a standard reporting format for metadata on the built environment.  But to do this, of course, everyone has to figure out what metadata is potentially useful.

So – we are calling for help from the community to start to figure out what are the important variables to record about built environment samples.  Though clearly metadata needs will evolve over time, as people start to collect samples for any study the more the potentially important variables are discussed the better.  To get this discussion going we have come up with a list of variables that seem potentially important and have posted this list below.

  1. Built environment features
    1. Major type (e.g., plane, train, automobile, house, pool)
    2. Sample location (e.g., bathroom, cockpit)
    3. Age
    4. Equipment presence (HVAC, humidifier, etc)
    5. Ventilation (includes air turnover rate, and type of filters)
    6. Materials
    7. Human impacts (occupancy rates, turnover rates, and types of activities)
  2. External environment
    1. Geographic location
    2. Climate
    3. Surroundings (plants, etc)
    4. Season
  3. Sample physiochemistry
    1. Temperature
    2. pH
    3. Humidity
    4. Light
    5. Presence of known chemicals (e.g. cleaning solutions)
  4. Sampling protocol
    1. Collection
    2. Storage

Equally important is that these data are recorded in a consistent manner between experiments, and that this data is available along with the resulting sequence data.