Benchmark Data

Task Description

Anchor Datasets

The target dataset collection for the challenge is derived from a set of 20 repositories that was indexed by DataMed by 3/24/2016. DataMed is the Data Discovery Index (DDI) prototype developed under the auspices of the bioCADDIE project, funded by National Institutes of Health (NIH) Big Data to Knowledge (BD2K) program. DataMed facilitates the searching for datasets and provides information about data objects available in major data repositories or aggregators, as well as their associations, and access conditions. An existing prototype search engine is available at datamed.org for reference.

Dataset Identifiers

Each dataset in the collection is identified by a unique number (DOCID) that will be used for run submissions. The DOCID is specified by the <DOCNO> element within each XML file. Please note that we also have the dataset identifier obtained from the repositories in the files under <TEXT>. For the purpose of the challenge, our primary mode of dataset identification will be the DOCID.

For example, the DOCID of the dataset 100002 is specified in the dataset's XML file as follows.

<DOCNO>100002<DOCNO>

To make processing of the datasets easier, we have also renamed each dataset XML according to the it's DOCID. For example, the metadata file for dataset 100002 is named 100002.​xml.

Obtaining the Collection

The March 24 snapshot can be downloaded from bioCADDIE here:

Download update_json_folder ZIP File here

Download update_xml_folder ZIP File here

The folder contains two files that are compressed archives of JSON and XML versions of the dataset.

Queries

The example and test queries in the challenge are derived from instantiations of competency questions from three use cases collected from various sources as a part of the bioCADDIE project. We provide 30 example queries (without judgments) that are similar to the test queries. The example queries are numbered E1 through E30.

Example queries

In addition to the above 30 example queries, we provide 6 queries with retrieved results for which the relevance judgments have been annotated. The annotation guideline is available here. The example queries with annotations are numbered EA1 through EA6. The relevance judgement files are provided in the qrel and Excel formats. 

Link to example queries with annotations

Link to annotations for example

Test queries (15 queries) are numbered T1 through T15 and available at the link provided below:

Link to test queries

The complete annotations for the 15 test queries based on results from 10 teams that completed the final submission are available here:

Link to annotations for test queries