Working Group 3: Descriptive Metadata for Datasets - DATS model

Chair

Susanna-Assunta Sansone, PhD (NIH BD2K bioCADDIE, CEDAR and ELIXIR-UK).

This work has completed its work.

Goals

  • Deliver the DATS (DatA Tag Suite) model, describing the metadata and the structure for datasets, to underpin the NIH BD2K Data Discovery Index prototype, named DataMed.
    • The model is created via the combination of: a bottom-up approach (mapping generic and bio-specific schemas, available as a collection in BioSharing), adding missing metadata elements from use cases, via a top-down approach.
    • Like the JATS is used by PubMed to index literature, DATS is needed for a scalable way to index data sources in the DataMed prototype.
  • Deliver a schema.org annotated JSON-LD serialization of the DATS model, also as part of the bioschemas.org initiative.

 

STATUS and OUTPUTS 

1. Specification, serializations, examples, and guidelines: DATS v2.2.

2. Papers: "DATS, the data tag suite to enable discoverability of datasets", Scientific Data, doi:10.1038/sdata.2017.59 (2017); and the related sister article "Finding useful data across multiple biomedical data repositories using DataMed", Nature Genetics , doi:10.1038/ng.3864 (2017).

3. Slides: Find here an overview on DATS, as presented at the bioCADDIE workshop on 7th of August 2017.

 

WANT TO BE INDEX by DATAMED?

If you represent or maintain a data repository interested to be indexed in DataMed, check out here the criteria for inclusion.

 


Detailed Overview

This Working Group (WG) operates in phasesBetween each development phases, the model will be implemented and tested (evaluation phases) with a number of different data sources. Use cases and competency questions will be used to throughout to define the appropriate boundaries and level of granularity.

Phase 1 is scheduled to take place from May-Dec 2015; phase 2 starts early 2016 (following completion of Test phase) and has to be completed by mid 2016.

Members and synergies

This WG includes core bioCADDIE and CEDAR members and a group of invited experts (see below). The broader community will also be engaged and asked to comment on the output of this WG at several stages.

This WG operates as a joint activity between NIH BD2K bioCADDIE and CEDAR centres and as part of the NIH Commons ecosystem. It is also complementary to the wider DB2K Metadata WG, co-chaired by Mark Musen (CEDAR) and George Alter (bioCADDIE). Whilst bioCADDIE focuses on (metadata for) searches, CEDAR focuses on defining common metadata standards.

This WG is also connected to ELIXIR activities in Europe and other global biomedical and broader metadata activities.

Dependency

This WG will define the appropriate boundaries and level of granularity via use cases and competency questions collected by the bioCADDIE Use Cases Workshop and the work of WG 4: Use Cases and Testing Benchmarks

 

Phase 1 (completed)

Activities

1. Review how other groups are achieving data interoperability and exchange in a domain, have aggregations across distributed databases and are addressing cross-platform discovery of research data and other research oucome as publication, data plans, software, etc.

2. Identify relevant metadata schemas and models; map to identify common core metadata elements

3. Review, refine, and enhance these core metadata elements for their relevance to the BD2K DataMed prototype use cases (Use Cases Workshop and WG4).

4. Abstract the required core metadata fields from the use cases and map these against the selected metadata standards, identifying elements that are common/core versus those that will be part of extensions.

5. Create a collection of list of the reviewed schemas and models in the BioSharing's Standards Information Resource

Deliverables

1. Specification document 'candidate release v.1' with the core metadata requirement list (COMPLETED)

  • Solicit feedback from WG3 invited experts 

2. Specification document v.1 with the core metadata requirement list (COMPLETED)

  • Targeted audience: the bioCADDIE development team
  • Initial collection of core metadata standards initiatives in the BioSharing

Evaluation phase (completed)

During this phase, the initial specification will be implementation and tested by the bioCADDIE Core Development Team with a number of data sources and other collaborators and interested parties. The results will lead to a revised core specification, simplifying, enriched and/or modifing the metadata elements as needed, and inform the activities in phase 2.

 

Phase 2  (completed)

Activities

1. Review comments and feedback received, edit the model as needed

2. Discuss responsibilities for brokering metadata specifications for data sets or sources (e.g. data resources that have no metadata or incomplete metadata, or use a different label for the same element).  To be done with WG6 and WG10

  • Define best practices for maintenance of legacy specifications and how to deal with metadata versions at the source and at the DataMed prototype level.

3. Define a core and an extended DATS model.

4. Deliver a schema.org annotated JSON-LD serialization.

Deliverables

1. New Specification document(s)

  • Targeted audience: the bioCADDIE development team, which will carry out tests implementing it
  • Solicit feedback from from WG3 invited experts and the open community

2. Best practice guidelines to be able to interoperate with the DataMed prototype (to be done with WG6 and WG10)

  • Targeted audience: other BD2K centers, data sources, and the larger community

Group Members