bioCADDIE 2016 Dataset Retrieval Challenge

The focus of the 2016 bioCADDIE Challenge is the retrieval of datasets from a collection that are relevant to the needs of biomedical researchers, in order to facilitate the reutilization of collected data, and enable the replication of published results.

We used a collection of metadata (structured and unstructured) from biomedical datasets generated from a set of 20 individual repositories. A total of 794,992 datasets were made available for use from the set of indices that was frozen from the DataMed backend on 3/24/2016. This set of records is available as both XML and JSON files and was provided for the challenge. The data is available at https://biocaddie.org/benchmark-data

Participants of the track were challenged with retrieving datasets that answer the specific instantiated query from the dataset corpus. Retrieved datasets were judged ”relevant” if they meet all the constraints specified in the question, or “partially relevant” if they meet some of the constraints.

The complete challenge results are available here: Challenge Results

Group

Submission

infAP

infNDCG

NDCG@10

P@10
(+partial)

P@10
(-partial)

UCSD

submit_armyofucsdgrads-3.txt

0.1468

0.5132

0.5303

0.7133

0.2400

UIUC GSIS

sdm-0.75-0.1-0.15.krovetz.txt

0.3228

0.4502

0.5569

0.7133

0.2867

OHSU

OHSU-1k-4.txt

0.2862

0.4454

0.6122

0.7600

0.3333

Elsevier

elsevier4.txt

0.3049

0.4368

0.6861

0.8267

0.4267

SIBTex

sibtex-4_0.txt

0.3458

0.4258

0.5237

0.6600

0.3267

Emory

emory-3.txt

0.2471

0.4241

0.5296

0.6933

0.2200

BioMelb

biomelb_4_0.txt

0.2568

0.4017

0.5366

0.7000

0.2733

Mayo

mayorun5_0.txt

0.1628

0.3933

0.5243

0.6667

0.2600

HiTSZ-ICRC

hitsz-des.txt

0.2576

0.3850

0.5472

0.7000

0.2800

IAII_PUT

biocaddiedphresults.txt

0.0876

0.3580

0.4265

0.5333

0.1600

Schedule

Date Note
September 9, 2016 Registration began
September 16, 2016 Datasets and Sample Queries available for download
November 14, 2016 Release of Test Queries
December 2, 2016 Submission deadline
December 15, 2016 Relevance judgments and individual evaluation scores released

Queries

The example and test queries in the challenge are derived from instantiations of competency questions from three use cases collected from various sources as a part of the bioCADDIE project. We will provide at least 30 example queries (without judgments) along with the corpus, which are similar to the test queries.

In addition to the above 30 example queries, we will provide 6 queries with retrieved results for which the relevance judgments have been annotated. The annotation guideline is available here. The primary criteria for judging the retrieved article as relevant are provided below:

  1. A dataset is relevant if it captures all required concepts in the question AND it answers the question or there is a relationship between terms or key concepts.
  2. If all key terms exist, but there is no relationship between terms, the dataset is marked as partially relevant. In other words, a result is partially relevant if it contains all of the concepts but has missing elements or doesn’t answer the question. Additionally, a result is partially relevant if it contains the majority of concepts. For example, for the question “Find datasets describing one or more mutations of the RET gene in thyroid cancer”, the following dataset would be partially relevant:

Title: Gene expression signature associated with BRAFV600E mutation in human papillary thyroid carcinoma based on transgenic mouse model (human)

Description: BRAFV600E mutation is the most frequent molecular event in papillary thyroid carcinoma. The relation of this genetic alteration with the factors od poor prognosis has been reported as well as its influence on PTC gene signature. However human material disables distinction of cancer causes from its effect. We used our transgenic mouse model of papillary thyroid carcinoma induced by the BRAFV600E mutation to select BRAFV600E-specifi gene signature which was than compared to human material The human microarray data were obtained for: 27 papillary thyroid carcinomas including 18 BRAF(+), 8 RET(+), 1 RAS(+); 18 apparently healthy thyroids. In order to find BRAFV600E-specific gene signature data obtained from our transgenic mouse model of PTC were compared to human material. The analyses were performed taking into account not only the BRAF mutation but also other PTC initiating events including RET rearrangements and RAS poin mutations. Morover the microarray data were validated with the QPCR reaction.

Rationale: While all the concepts are present, the dataset describes BRAF mutation and not RET mutation. Therefore, the relationship that is required to answer the question is missing and not directly evident.

  1. If no related concept exists, or the majority of the concepts are missing, the dataset is marked as not relevant. E.g. if 1 term out of 3 is found, the dataset will be not relevant although it mentioned one key concept. If there are 2 terms and 1 term is found, again the dataset is marked as not relevant because the majority in this case is finding both terms.

Participants are, of course, free to submit multiple runs so that they can experiment with the different features in their search engine. Each participant will be allowed up to 5 runs. Submitting more than 5 runs, will result in the extra runs being ignored by bioCADDIE assessors. 

Obtaining the Queries

The example queries are available at the provided link:

The example queries are numbered E1 through E30.

The example queries with annotations are numbered EA1 through EA6. The relevance judgement files are provided in the qrel and Excel formats. 

Test queries (15 queries) will be proved later (November 14, 2016) during the challenge and will be available at the link provided below:

The test queries will be numbered T1 through T15.

Evaluation

The evaluation will follow standard TREC evaluation procedures for ad hoc retrieval tasks post-hoc assessment, but without pooling. Judgements have already been pre-determined using a high-recall baseline system. Inferred measures will be used to account for un-judged datasets in the submissions.

Participants may submit a maximum of five automatic or manual runs, each consisting of a ranked list of up to one thousand DOCIDs. The ranked list will be evaluated against the benchmark dataset that has been already generated by the bioCADDIE team. Because we plan to use a graded relevance scale, the performance of the retrieval submissions will be measured using inferred normalized discounted cumulative gain (infNDCG). See Voorhees (SIGIR 2014) for more information on infNDCG.

Submission Instructions

The submission deadline is December 2, 2016.

Submission File Format

The format for run submissions is standard trec_eval format. Each line of the submission file should follow the form:

QUERY_NO Q0 DOCID RANK SCORE RUN_NAME

where QUERY_NO is the query number (T1–T15), Q0 is a required but ignored constant, DOCID is the Dataset identifier of the retrieved dataset, RANK is the rank (1–1000) of the retrieved dataset, SCORE is a floating point value representing the similarity score of the dataset, and RUN_NAME is an identifier for the run. The RUN_NAME is limited to 12 alphanumeric characters (no punctuation). The file is assumed to be sorted numerically by QUERY_NO, and SCORE is assumed to be greater for documents that should be retrieved first. For example, the following would be a valid line of a run submission file:

T1 Q0 251113 1 0.9999 my-run

The above line indicates that the run named "my-run" retrieves for test query number 1 document 251113 at rank 1 with a score of 0.9999.

Participants should use run names that contain a common group ID to reduce complications in judging (e.g., UTHealth-1, UTHealth-2, ...).

All participants should submit their results at this website: Submit your results here

In addition to submitting your results, you are required to submit a document describing the methods used in your system by December 5th. The description has a minimum page limit of 2 pages and a maximum of 8 pages. The refereed format for the description file is PDF. Please upload your description file at the following webpage: Description file upload webpage

Please note that your system will not be evaluated for the subcontract if we do not recieve the method description document from you.

 Evaluation

The top 10 results were evaluated with 5% of remaining results (upto top 100) due to time and personnel constraints. This has been decided since we’ve already done a large amount of judgements using multiple baseline systems.  The post-hoc judgements will largely target filling in the gaps that those systems didn’t cover.  So the top 10 results fully judged and 5% of the remaining top 100 were calculated based on how much time we have left.

RESULTS

The complete annotations for the dataset pool based on results from 10 teams that completed the final submission are available here:

https://docs.google.com/spreadsheets/d/1GTnXRI2Yk_Jw6uHUhhk0aj18BDEQ3X661BXBHbhOKmo/edit#gid=672830835

clarificaTION

Participants may be contacted with requests for more information if the challenge committee deems it necessary for review of the systems.