NIH CFDE selected terminologies
authors Philippe Rocca-Serra
maintainers Philippe Rocca-Serra
version: initial draft
license: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication
Objectives:
The main objective of this section is to draw the attention to the importance of semantics in
interoperability
andreusability
as implemented in the C2M2 model and the associated task of ingesting datasets from an array of resources, each tackling a specific research area. A secondary objective is to prepare for theextract/transform/load
(ETL) processes by simply being aware of this model requirements.
Overview
For a number of attributes, the C2M2 model delegates to controlled terminologies and ontologies the associated value sets definition. This affords specification stability while allowing flexibility by outsourcing the maintenance needs for value set, typically when new values are required.
While the C2M2 model specifications clearly identifies the elements requiring values selected from a controlled terminology, the following table offers a full overview.
The table also highlights the planned implementation, with phased releases from compliance level 0 through to compliance level 2, which will see the inclusion of new types, thereby extending the interoperability
aspect of the FAIR potential of datasets made available by the DCC through the Deriva-based system following the extraction transformation and load process
from the source repositories.
C2M2 vetted Vocabularies
Domain | Resource Name | License | C2M2 Level 0 | C2M2 level1 | C2M2 level2 |
---|---|---|---|---|---|
id_namespace_string | CFDE internal CV | NA | |||
subject_role | CFDE internal CV | NA | |||
subject_granularity | CFDE internal CV | NA | |||
protocol | CFDE internal CV | NA | |||
taxonomy | NCBITax | CC0 1.0 (public domain) | |||
anatomy | UBERON | CC-BY | |||
sample_type | OBI | CC-BY | |||
assay_type | OBI | CC-BY | |||
file_format | EDAM | CC BY-SA 4.0 | |||
data_type | EDAM | CC BY-SA 4.0 | |||
disease | MONDO | CC-BY | |||
disease | DOID | CC-BY |
Which terminologies DCCs currently use?
For each of the potential data sources and for a set of core search facets, a survey of semantic resources used by representative DCCs has been summerized in the table below.
It is worth noting that the table includes a subsection (indicated in italic) which covers identification schemes used for molecular entities. These are distinct from concept annotation with ontology terms, however since they allow interoperability between resources, they have been included).
Domain | MW | LINCS | HMP | GTEx | 4D Nucleome | KidsFirst |
---|---|---|---|---|---|---|
taxonomy | free text | free text | free text | free text | free text | free text |
anatomy | free text | free text | free text | UBERON | free text | NCIT |
sample type | free text | free text | free text | UBERON | free text | NCIT |
disease | free text | free text | free text | free text | free text | HPO NCIT MONDO |
assay type | internal cv/free text | BAO | internal cv/free text | internal cv/free text | internal cv/free text | internal cv/free text |
data type | _ | free text | free text | _ | internal cv/free text | internal cv/free text |
chemical compound | pubchem CID,InChi | pubchem CID,InChi | _ | _ | _ | _ |
gene product | refseq | _ | _ | _ | _ | _ |
protein | uniprot | _ | _ | _ | _ | _ |
Conclusions:
- By explicitly identifying a number of semantic artefacts for describing key attributes, the C2M2 defines a curation framework, with the aim of anchoring free text descriptors to controlled terms, which can be exploited for query expansion or resource linking.
- The resource survey that has been carried out is an important step in the FAIRification process as it identifies potential
areas of intervention
, defined as semantic markup of free text description can deliver gains ininteroperability
andreusability
. - Taking the notion of
taxonomical descriptors
for example, the harmonization across the various sources can be easily achieved by relying on a resource such as NCBITaxonomy and the curation action is simplified by that limited diversity of species found in the different databases. - On the other hand, the harmonization tasks for domains such as
sample type
,assay type
can be more involved, not to mention the case ofphenotypic descriptions
ordisease
, even thoughlevel 0 and level 1 compliance
do not expect such a degree of integration.
What to read next?