A Provenance Model for Scientific Assertions about Chemical Substances

The aim of this project is to develop a metadata specification for capturing provenance information about chemical substances and drugs. Provenance refers to the information which qualifies a particular piece of data (e.g. who generated it, when it was generated, what experimental methodology and apparatus was used to generate it etc.).

Abstract

There are a growing variety of resources and databases about chemical substances on the Web. Data from these heterogeneous sources are often made available through public APIs (Application Programming Interfaces) that can be accessed by biomedical researchers to enrich their computational analyses and scientific workflows.

A significant problem often arises when building tools and services for integration and provision of data from multiple APIs. That is, the evidence and provenance information for assertions in these sources are usually not specified, and, even if they are, there is no accepted model specifying which provenance properties should be captured for chemical substance data. This issue can make it difficult to determine the veracity and quality of the data obtained, which can hinder the integrity of associated research findings.

We propose to develop a model for capturing provenance information of chemical substance data. We base this preliminary version of the model on data from 10 prominent chemical substance resources utilised by the BioThings API (specifically mychem.info). Our model is independent of any concrete data format and implementable in popular interchange formats such as JSON-LD and RDF (Resource Description Format).