NIH CFDE selected terminologies

version: initial draft

license: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication

Objectives:

The main objective of this section is to draw the attention to the importance of semantics in interoperability and reusability as implemented in the C2M2 model and the associated task of ingesting datasets from an array of resources, each tackling a specific research area. A secondary objective is to prepare for the extract/transform/load (ETL) processes by simply being aware of this model requirements.

Overview

For a number of attributes, the C2M2 model delegates to controlled terminologies and ontologies the associated value sets definition. This affords specification stability while allowing flexibility by outsourcing the maintenance needs for value set, typically when new values are required. While the C2M2 model specifications clearly identifies the elements requiring values selected from a controlled terminology, the following table offers a full overview. The table also highlights the planned implementation, with phased releases from compliance level 0 through to compliance level 2, which will see the inclusion of new types, thereby extending the interoperability aspect of the FAIR potential of datasets made available by the DCC through the Deriva-based system following the extraction transformation and load process from the source repositories.

C2M2 vetted Vocabularies

Domain	Resource Name	License
id_namespace_string	CFDE internal CV	NA
subject_role	CFDE internal CV	NA
subject_granularity	CFDE internal CV	NA
protocol	CFDE internal CV	NA
taxonomy	NCBITax	CC0 1.0 (public domain)
anatomy	UBERON	CC-BY
sample_type	OBI	CC-BY
assay_type	OBI	CC-BY
file_format	EDAM	CC BY-SA 4.0
data_type	EDAM	CC BY-SA 4.0
disease	MONDO	CC-BY
disease	DOID	CC-BY

Which terminologies DCCs currently use?

For each of the potential data sources and for a set of core search facets, a survey of semantic resources used by representative DCCs has been summerized in the table below.

It is worth noting that the table includes a subsection (indicated in italic) which covers identification schemes used for molecular entities. These are distinct from concept annotation with ontology terms, however since they allow interoperability between resources, they have been included).

Domain	MW	LINCS	HMP	GTEx	4D Nucleome	KidsFirst
taxonomy	free text	free text	free text	free text	free text	free text
anatomy	free text	free text	free text	UBERON	free text	NCIT
sample type	free text	free text	free text	UBERON	free text	NCIT
disease	free text	free text	free text	free text	free text	HPO NCIT MONDO
assay type	internal cv/free text	BAO	internal cv/free text	internal cv/free text	internal cv/free text	internal cv/free text
data type	_	free text	free text	_	internal cv/free text	internal cv/free text
chemical compound	pubchem CID,InChi	pubchem CID,InChi	_	_	_	_
gene product	refseq	_	_	_	_	_
protein	uniprot	_	_	_	_	_

Conclusions:

By explicitly identifying a number of semantic artefacts for describing key attributes, the C2M2 defines a curation framework, with the aim of anchoring free text descriptors to controlled terms, which can be exploited for query expansion or resource linking.
The resource survey that has been carried out is an important step in the FAIRification process as it identifies potential areas of intervention, defined as semantic markup of free text description can deliver gains in interoperability and reusability.
Taking the notion of taxonomical descriptors for example, the harmonization across the various sources can be easily achieved by relying on a resource such as NCBITaxonomy and the curation action is simplified by that limited diversity of species found in the different databases.
On the other hand, the harmonization tasks for domains such as sample type, assay type can be more involved, not to mention the case of phenotypic descriptions or disease, even though level 0 and level 1 compliance do not expect such a degree of integration.

What to read next?

CFDE C2M2 model

ETL to CFDE C2M2 model