Pilot Project 3.2 - Development of Citation and Data Access Metrics applied to RCSB Protein Data Bank and related Resources

Leaders: Peter Rose, Ph.D. Protein Data Bank UCSD and Chun-Nan Hsu, Ph.D. Department of Biomedical Informatics UCSD

Collaborator: Cathy Wu, Ph.D. and Cecilia Arighi, Ph.D. (non funded)

The Protein Databank (PDB) is the worldwide repository of experimentally determined structures of proteins, nucleic acids, and complex assemblies, including drug-target complexes. The PDB annotates structures according to standards set by the wwPDB and provides unique identifiers and DOIs for its datasets. All journals require a prior submission of structures to the PDB as part of the publication process. This well matured process can serve as a model for other data initiatives. The PDB’s large corpus of data (~100,000 3D structures) and related citations provides an extensive test set for developing citation and data access metrics. An important aspect is the interplay of literature and data citations, and the relative importance of these two mechanisms to make data discoverable. The pilot project will apply various analysis methods to literature and data citation. The aim is to correlate various metrics of citation networks with tangible impact indicators to determine empirically which metrics are more informative. Analysis of citation and data cascades of these networks will highlight putative pathways of how data and concepts led to high impact scientific discovery. Based on the results of these analyses, we will recommend data citation and provenance practices, approaches to data citation discovery, ways of linking citations and data, and data access metrics, for the Data Discovery Index.

Pilot Google Drive Folder