Pilot Project 1.1 - Linking publications and underlying data sets using natural language processing.

Leader: Hua Xu, Ph.D. School of Biomedical Informatics UTHealth

Collaborators: Timothy Clark, Ph.D. and Guergana Savova, Ph.D.

This project aims to develop a scalable approach that can automatically scan millions of scientific publications and identify underlying data sets. It will not only create a better index of data sets, thus improving data discoverability, but also provide a better understanding of data citation patterns, thus helping to develop better data citation metrics. We will develop NLP methods to recognize and normalize data sets as well as their attributes mentioned in biomedical literature, thus to build linkage between publications and data sets. We will investigate various named entity recognition and relation detection approaches to identify data sets and their attributes (e.g., sample size, data collection methods, etc) and determine relations between a data set and a published article (e.g., a data set was "created" by the article vs. a data set was "used" in the article). We will evaluate the proposed methods in a narrowed domain (e.g., GWAS data sets), using existing data-publication linkage information (e.g., dbGAP).

Pilot Google Drive Folder