Working Group 13: Evaluation of Harvesting and NLP Pilot Projects

The primary objective of the NLP challenge was to create innovative ways for biomedical researchers to search and discover biomedical research data. Biomedical research’s increasing dependence on digital data has led to a significant increase in the number of datasets available to researchers.  Finding relevant datasets amid the massive quantity available requires new methods of information retrieval.  Dataset searches can involve specific and complex queries that are not typically answered by the metadata associated with these datasets (such as organism and assay type). An example is a user who wishes to know what datasets are available that have genome data about IDH1 and IDH2 in humans for glioma. Answering such a query requires the use of innovative search strategies that incorporate structured metadata and unstructured information, as well as other linked evidence such as related biomedical articles. The 2016 bioCADDIE Dataset Retrieval Challenge aimed to accelerate development of search strategies for published biomedical datasets (such as gene expression data, protein sequence data, or the results of bioassays) beyond utilization of the metadata submitted by the dataset providers.

The proximal goal of the challenge is to improve the indexing and searching strategies of DataMed, the Data Discovery Index (DDI) prototype developed under the auspices of the bioCADDIE project, funded by National Institutes of Health (NIH) Big Data to Knowledge (BD2K) program. DataMed facilitates searching datasets and provides information about data objects available in major data repositories or aggregators, as well as their associations, and access conditions. The existing prototype is available at datamed.org for reference. The broader goal of the challenge is to develop innovative methods and/or tools to retrieve datasets from a collection that are relevant to the needs of biomedical researchers, in order to facilitate the reutilization of collected data, and enable the replication of published results.

This challenge was conducted using a collection of metadata (structured and unstructured) from biomedical datasets generated from a set of 20 individual repositories. A set of representative example queries for biomedical data that domain experts have determined can be addressed using this collection was provided for system development. The evaluation was conducted using manually annotated benchmark dataset, consisting of a held out set of queries with relevance judgments for datasets in the provided collection. The datasets are annotated as “relevant”, “partially relevant”, and “not relevant” to the query. Evaluation followed standard TREC procedures.

Corpus: Collection of standardized metadata of biomedical datasets (in DATS format, as used by DataMed)

Sample Queries: a set of 36 representative queries including several with annotated judgments - Example queries, Example queries that have annotations, Annotations

Test Queries: 15 queries for evaluation of the search engine

Participants were expected to focus on retrieving relevant datasets that answer the specific queries. Retrieved datasets were judged relevant if they provided information pertinent to all aspects of the question posed. Participants will describe their methods in a published journal paper (Database). 

The complete details about the Challenge are availbale at the biocaddie-2016-dataset-retrieval-challenge page.

Group Members