bioCADDIE 2016 Dataset Retrieval Challenge

2016 bioCADDIE Dataset Retrieval Challenge


  • Registration: Begins September 9, 2016
  • Datasets and Sample Queries Release: September 16, 2016
  • Test Queries Release: November 14, 2016
  • System Outputs Due: November 28, 2016
  • Workshop: TBD

The primary objective of this challenge is to create innovative ways for biomedical researchers to search and discover biomedical research data. Biomedical research’s increasing dependence on digital data has led to a significant increase in the number of datasets available to researchers.  Finding relevant datasets amid the massive quantity available requires new methods of information retrieval.  Dataset searches can involve specific and complex queries that are not typically answered by the metadata associated with these datasets (such as organism and assay type). An example is a user who wishes to know what datasets are available that have genome data about IDH1 and IDH2 in humans for glioma. Answering such a query requires the use of innovative search strategies that incorporate structured metadata and unstructured information, as well as other linked evidence such as related biomedical articles. The 2016 bioCADDIE Dataset Retrieval Challenge aims to accelerate development of search strategies for published biomedical datasets (such as gene expression data, protein sequence data, or the results of bioassays) beyond utilization of the metadata submitted by the dataset providers.

The proximal goal of the challenge is to improve the indexing and searching strategies of DataMed, the Data Discovery Index (DDI) prototype developed under the auspices of the bioCADDIE project, funded by National Institutes of Health (NIH) Big Data to Knowledge (BD2K) program. DataMed facilitates searching datasets and provides information about data objects available in major data repositories or aggregators, as well as their associations, and access conditions. The existing prototype is available at for reference. The broader goal of the challenge is to develop innovative methods and/or tools to retrieve datasets from a collection that are relevant to the needs of biomedical researchers, in order to facilitate the reutilization of collected data, and enable the replication of published results.

This challenge will be conducted using a collection of metadata (structured and unstructured) from biomedical datasets generated from a set of 20 individual repositories. A set of representative example queries for biomedical data that domain experts have determined can be addressed using this collection will also be provided for system development. The evaluation will be conducted using manually annotated benchmark dataset, consisting of a held out set of queries with relevance judgments for datasets in the provided collection. The datasets are annotated as “relevant”, “partially relevant”, and “not relevant” to the query. Evaluation will follow standard TREC procedures.

Corpus: Collection of standardized metadata of biomedical datasets (in DATS format, as used by DataMed)
Sample Queries: a set of 36 representative queries including several with annotated judgments
Test Queries: 15 queries for evaluation of the search engine
Mailing list:

Participants are expected to focus on retrieving relevant datasets that answer the specific queries. Retrieved datasets will be judged relevant if they provide information pertinent to all aspects of the question posed. Participants will describe their methods in a published workshop or journal paper (venue TBD). Selected participant teams with innovative and effective approaches will be awarded a sub-contract ($50,000 direct costs) by the University of California, San Diego to incorporate their systems into the existing DDI prototype.

Organizing Committee
Hua Xu (
Trevor Cohen (
Kirk Roberts (
Dina Demner-Fushman (
Bill Hersh (

Anupama Gururaj (