Pilot Projects Year 2

Pilot Project 4.1 – Metadata Discovery and Integration to Support Repurposing of Heterogeneous Data using the OpenFurther Platform

Leaders: Ram Gouripeddi, MBBS, MS, University of Utah and Julio Facelli, PhD, University of Utah

Modern biomedical research, often requires reusing and combining (federation and/or integration of) data from multiple disparate sources such as clinical and electronic health record (phenotypes), genomic public and private annotations (genotypes), proteomics, metabolomics, biospecimen collections and environmental data. Each data source embeds within itself different meanings (semantic) and structural (syntactic) descriptions about the data either explicitly or implicitly. Repurposing data requires discovery of these metadata and its understanding to facilitate data harmonization. It also requires understanding the terminology within the data for mapping concepts across data sources. Current state of the art requires a great deal of human manual curation, which renders these procedures non-scalable and consequently of limited practical value in the emerging big data biomedical science paradigm.

To overcome these limitations, our project prototypes a computational infrastructure that supports automated and semi-automated mapping of metadata artifacts and terminologies. Depending on the confidence threshold chosen at specific implementations, the framework could work in an automated or semi-automated mode. The framework will be agnostic to specific mapping algorithms or tools as many of these are domain-specific and also dependent on data; and will choose the best available solution based on the mapping performance making it scalable and suitable for emerging big data applications. The proposed infrastructure will leverage existing components of the OpenFurther framework to manage, integrate and share metadata in structured non-proprietary formats. This will also allow proper reuse, federation and integration of the metadata-enriched data.


Pilot Project 4.2 – Feasibility Study of Indexing Clinical Research Data Using HL7 FHIR

Leader: Guoqian Jiang, MD, PhD, Mayo Clinic College of Medicine

The overall goal of the pilot project is to design, develop and evaluate a prototype of clinical research data discovery index (crDDI) leveraging both standards-based representation and scalable Semantic Web technologies. The ultimate goal is to advance clinical research data discovery and analytic capabilities for clinical and translational centers and investigators. HL7 Fast Healthcare Interoperability Resources (FHIR) is an emerging HL7 standard that provides a consistent, easy to implement, and rigorous mechanism for exchanging data between healthcare applications. In this pilot study, we will demonstrate the feasibility of indexing clinical research datasets through the transformation of corresponding data dictionaries into a common format using an indexing schema tailored from the HL7 FHIR. Specifically, we will leverage existing technologies with tools developed in our previous and ongoing projects to create methods and tools for 1) indexing clinical research data using standard HL7 FHIR resources and 2) exposing and validating FHIR-based metadata in clinical research datasets. We will test out our indexing schema through using the clinical research datasets available from existing NIH pilot data commons: dbGaP and TCGA. 


Pilot Project 4.3 – Distributed data discovery using GYM: GitHub, YAML and Markdown

Leader: Chris Mungall, PhD, Lawrence Berkeley National Laboratory

This project will explore the use of social coding sites such as GitHub for publishing and sharing descriptions of datasets. We will develop create a format aligned with the bioCADDIE WG3 Metadata specification that can be easily embedded in a project repository, and foster a lightweight tool ecosystem around this. This includes dynamic publishing of dataset descriptions, and automatic validation through the Travis continuous integration system. We will pilot this project by taking existing datasets and describing them retrospectively, and by working with an existing project and describing this prospectively. The goal will to ultimately have this indexed within the biocaddie system.

Progress can be monitored on the project's GitHub repository: www.github.com/cmungall/bioCADDIE-GYM.