Pilot Projects Year 1

Pilot Project 1.1 - Linking publications and underlying data sets using natural language processing.

Leader: Hua Xu, Ph.D. School of Biomedical Informatics UTHealth

Collaborators: Timothy Clark, Ph.D. and Guergana Savova, Ph.D.

This project aims to develop a scalable approach that can automatically scan millions of scientific publications and identify underlying data sets. It will not only create a better index of data sets, thus improving data discoverability, but also provide a better understanding of data citation patterns, thus helping to develop better data citation metrics. We will develop NLP methods to recognize and normalize data sets as well as their attributes mentioned in biomedical literature, thus to build linkage between publications and data sets. We will investigate various named entity recognition and relation detection approaches to identify data sets and their attributes (e.g., sample size, data collection methods, etc) and determine relations between a data set and a published article (e.g., a data set was "created" by the article vs. a data set was "used" in the article). We will evaluate the proposed methods in a narrowed domain (e.g., GWAS data sets), using existing data-publication linkage information (e.g., dbGAP).

Pilot Google Drive Folder


Pilot Project 1.2 - Rating Relationships in Imaging Reports: Machine versus Crowdsourcing Approaches

Leader: Ricky K. Taira, Ph.D. UCLA Medical Imaging Informatics Group

Collaborator: Dmitriy Dligach, Ph.D.

Natural language processing (NLP) of clinical reports is an important area of research in medical informatics. It is considered a key enabling technology for transforming unstructured Big Data from clinical repositories into a computer understandable representation that would allow for compiling phenotypic observations. However, despite the long history of medical NLP as a focused area of research, the ability to perform deep understanding of clinical notes by computers remains elusive and far from the abilities of human cognition.

In this pilot project, we explore a hierarchical semantic compositional network as a framework for deep understanding of clinical text reports. This framework allows the mapping of a free text input sentence to be factored into a number of semantic-based sub-interpretations which helps to deal with the high dimensionality issues problematic in language understanding. The semantic layers proposed include surface words, functional words, semantic equivalence word classes, ontologic concepts, ontologic propositions, and ontologic frames. This 1-year project will concentrate on creating such a deep language understanding system for a narrow set of topics (e.g., tumoral masses, edema) commonly reported in radiology brain tumor cases. Although the breath of topics to be explored is narrow, we concentrate on designing general tools to curate knowledge sources and language models for any medical finding and/or disease condition. Evaluation of the performance of the architecture on the task of relation extraction is planned and quantified using recall and precision metrics. A crowdsourcing approach will be used to assist in creating test sets for evaluation.

Pilot Google Drive Folder

Pilot Project 2.1 - Data Recommendation using Machine Learning and Crowdsourcing

Leader: Xiaoqian Jiang, Ph.D. Department of Biomedical Informatics UCSD

Collaborators: Zhaohui Qin, Ph.D., Jaideep Vaidya, Ph.D., Aditya Menon, Ph.D. (non funded), and Hwanjo Yu, Ph.D. (non funded)

Recommendation systems have witnessed a lot of successes in movie suggestion, online shopping, content searching, etc. An advanced data recommendation can promote scientific discovery and improve the healthcare quality. 

This project aims at making data recommendation based on content similarity, user background, and the context. In specific, we plan to build a hierarchical latent topic model over millions of PMC articles and propagate topic similarity to cited data to construct a graph of data. We record the pattern of query (as well as user information) and make recommendations to users based on the topological structure of the sub-graph that was visited.

Pilot Google Drive Folder

Pilot Project 2.2 - Intelligent Search expansion and Visualization of Datasets (iSee-DELVE)

Leader: Todd Johnson, Ph.D. School of Biomedical Informatics UTHealth and Hyeoneui Kim, RN, PhD Department of Biomedical Informatics UCSD

Collaborators: Jeeyae Choi, Ph.D. and Jina Huh, Ph.D.

iSee-DELVE stands for intelligent Search expansion (and) Document ExpLoration and Visualization Engine. iSee-DELVE is a collaborative project of the Department of Biomedical Informatics at University of California, San Diego and the School of Biomedical Informatics at University of Texas, Houston.

iSee-DELVE addresses two classic challenges in data search: (1) searching large databases with the sufficient level of sophistication and accuracy and (2) promoting efficient review and filtering of the returned data, including multiple and individual datasets.

In order to assist data users to effectively handle these challenges iSee-DELVE is developing new approaches to data search and review based on the strategies listed below:

  • Improving search quality via concept-based search.

  • Improving the efficiency of data review by ranking the returned record based on the relevancy and providing graphic summarization of the content (e.g., highlighting search terms in the returned record, abstracting the content of the returned record via word cloud based keyword presentation). Summaries are provided for multiple and individual datasets.

  • Enabling more focused chasing of the data of interest through faceted browsing of search results and allowing the users to execute new searches reusing the metadata of the returned record deemed relevant as search criteria.

Pilot Google Drive Folder

Pilot Project 3.2 - Development of Citation and Data Access Metrics applied to RCSB Protein Data Bank and related Resources

Leaders: Peter Rose, Ph.D. Protein Data Bank UCSD and Chun-Nan Hsu, Ph.D. Department of Biomedical Informatics UCSD

Collaborator: Cathy Wu, Ph.D. and Cecilia Arighi, Ph.D. (non funded)

The Protein Databank (PDB) is the worldwide repository of experimentally determined structures of proteins, nucleic acids, and complex assemblies, including drug-target complexes. The PDB annotates structures according to standards set by the wwPDB and provides unique identifiers and DOIs for its datasets. All journals require a prior submission of structures to the PDB as part of the publication process. This well matured process can serve as a model for other data initiatives. The PDB’s large corpus of data (~100,000 3D structures) and related citations provides an extensive test set for developing citation and data access metrics. An important aspect is the interplay of literature and data citations, and the relative importance of these two mechanisms to make data discoverable. The pilot project will apply various analysis methods to literature and data citation. The aim is to correlate various metrics of citation networks with tangible impact indicators to determine empirically which metrics are more informative. Analysis of citation and data cascades of these networks will highlight putative pathways of how data and concepts led to high impact scientific discovery. Based on the results of these analyses, we will recommend data citation and provenance practices, approaches to data citation discovery, ways of linking citations and data, and data access metrics, for the Data Discovery Index.

Pilot Google Drive Folder