Search

NIH CFDE selected terminologies

authors Philippe Rocca-Serra

maintainers Philippe Rocca-Serra

version: initial draft

license: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication


Objectives:

The main objective of this section is to draw the attention to the importance of semantics in interoperability and reusability as implemented in the C2M2 model and the associated task of ingesting datasets from an array of resources, each tackling a specific research area. A secondary objective is to prepare for the extract/transform/load (ETL) processes by simply being aware of this model requirements.

Overview

For a number of attributes, the C2M2 model delegates to controlled terminologies and ontologies the associated value sets definition. This affords specification stability while allowing flexibility by outsourcing the maintenance needs for value set, typically when new values are required. While the C2M2 model specifications clearly identifies the elements requiring values selected from a controlled terminology, the following table offers a full overview. The table also highlights the planned implementation, with phased releases from compliance level 0 through to compliance level 2, which will see the inclusion of new types, thereby extending the interoperability aspect of the FAIR potential of datasets made available by the DCC through the Deriva-based system following the extraction transformation and load process from the source repositories.

C2M2 vetted Vocabularies

Domain Resource Name License C2M2 Level 0 C2M2 level1 C2M2 level2
id_namespace_string CFDE internal CV NA :heavy_plus_sign: :heavy_plus_sign: :heavy_plus_sign:
subject_role CFDE internal CV NA   :heavy_plus_sign: :heavy_plus_sign:
subject_granularity CFDE internal CV NA   :heavy_plus_sign: :heavy_plus_sign:
protocol CFDE internal CV NA   :heavy_plus_sign: :heavy_plus_sign:
taxonomy NCBITax CC0 1.0 (public domain)   :heavy_plus_sign: :heavy_plus_sign:
anatomy UBERON CC-BY   :heavy_plus_sign: :heavy_plus_sign:
sample_type OBI CC-BY   :heavy_plus_sign: :heavy_plus_sign:
assay_type OBI CC-BY   :heavy_plus_sign: :heavy_plus_sign:
file_format EDAM CC BY-SA 4.0   :heavy_plus_sign: :heavy_plus_sign:
data_type EDAM CC BY-SA 4.0   :heavy_plus_sign: :heavy_plus_sign:
disease MONDO CC-BY     :heavy_plus_sign:
disease DOID CC-BY     :heavy_plus_sign:

Which terminologies DCCs currently use?

For each of the potential data sources and for a set of core search facets, a survey of semantic resources used by representative DCCs has been summerized in the table below.

:warning:It is worth noting that the table includes a subsection (indicated in italic) which covers identification schemes used for molecular entities. These are distinct from concept annotation with ontology terms, however since they allow interoperability between resources, they have been included).

Domain MW LINCS HMP GTEx 4D Nucleome KidsFirst
taxonomy free text free text free text free text free text free text
anatomy free text free text free text UBERON free text NCIT
sample type free text free text free text UBERON free text NCIT
disease free text free text free text free text free text HPO NCIT MONDO
assay type internal cv/free text BAO internal cv/free text internal cv/free text internal cv/free text internal cv/free text
data type _ free text free text _ internal cv/free text internal cv/free text
chemical compound pubchem CID,InChi pubchem CID,InChi _ _ _ _
gene product refseq _ _ _ _ _
protein uniprot _ _ _ _ _

Conclusions:

  • By explicitly identifying a number of semantic artefacts for describing key attributes, the C2M2 defines a curation framework, with the aim of anchoring free text descriptors to controlled terms, which can be exploited for query expansion or resource linking.
  • The resource survey that has been carried out is an important step in the FAIRification process as it identifies potential areas of intervention, defined as semantic markup of free text description can deliver gains in interoperability and reusability.
  • Taking the notion of taxonomical descriptors for example, the harmonization across the various sources can be easily achieved by relying on a resource such as NCBITaxonomy and the curation action is simplified by that limited diversity of species found in the different databases.
  • On the other hand, the harmonization tasks for domains such as sample type, assay type can be more involved, not to mention the case of phenotypic descriptions or disease, even though level 0 and level 1 compliance do not expect such a degree of integration.