LODQuA: Large-scale RDF-based Data Quality Assessment Pipeline

Source code
This project is funded by the NCATS Biomedical Translator program

Abstract

In recent years, the Linked Open Data (LOD) paradigm has emerged as a simple mechanism for employing the Web as a medium for data and knowledge integration where both documents and data are linked. However, data published using LOD principles is not exempt from shortcomings in quality, and this often hinders its downstream uses in settings where it needs to be analyzed and processed. Currently, a key challenge is to assess and report on the quality of datasets published on the Web and to make this quality report available in a FAIR (Findable, Accessible, Interoperable, Reusable) manner. Even though data quality is an important problem in LOD, there are few methodologies proposed to assess the quality of these datasets, which either focus on a specific quality dimension and/or are cumbersome to use by non-experts.

In this project, we introduce LODQuA, a large-scale quality assessment pipeline specifically for Linked Open Data. LODQuA performs automated quality assessment on Linked Data datasets using 17 community-established metrics spread over 6 quality dimensions. LODQuA is packaged using the Docker container management system and is composed of three containers for specific sets of metrics: (i) FAIRsharing metrics, (ii) Descriptive statistics and (iii) Computational metrics. Each container can be run in parallel for several datasets, making it highly reusable and scalable. The data quality assessment results are reported using the W3C standard Data Quality Vocabulary, which allows uniform querying of data quality results across multiple datasets. This is invaluable when comparing the data quality of multiple datasets in the same domain, helping users choose optimal quality datasets for their use cases.

Project Team

Search term

LODQuA: Large-scale RDF-based Data Quality Assessment Pipeline