Completed Research Projects

KnowGraphs
Knowledge Graphs at Scale

Open Data Infrastructure for Social Science and Economic Innovations

ODISSEI

Main goals
The main goal of the project is to develop a synthetic data generator framework using artificial intelligence technologies while concurrently exploring ethical-legal perspectives in the trade-off between data privacy and the potential utilization of synthetic representations. We will study 1) the quality of synthetically generated data to real world data as a function of privacy cost, 2) the quality of preservation of multi-attribute relations in the face of increased individual variation, and 3) the utility of synthetic data in certain kinds of social science research.

Our contribution
ODISSEI's project team at Maastricht University consists of Michel Dumontier (task leader), Chang Sun, and Birgit Wouters.

Project page
ODISSEI logo

Biomedical Data Translator

Translator Red Knowledge (TReK)

Major Goals
The Biomedical Data Translator is a multi-year NIH/NCATS-supported project for the development of a comprehensive Biomedical Data Translator that integrates multiple types of existing data sources, including objective signs and symptoms of disease, drug effects, and intervening types of biological data relevant to understanding pathophysiology and developing treatments. 

Our contribution
Our team (Remzi CelebiMichel DumontierVincent Emonet, and Arif Yilmaz) will create a machine learning-based drug repositioning tool and a tool to author and contribute structured facts to the Translator ecosystem, as well as contribute to the core architecture.  

 

Project Number: OT2TR003434-01S2

Name of PD/PI: Chunhua Weng, Casey Ta, Michel Dumontier

Source of Support: NIH/NCATS

Project/Proposal Start and End Date: 01/2022-11/2023

Project page
Biomedical Data Translator picture

European Open Science Cloud for the Life Sciences

EOSC-Life

EOSC-Life is an H2020/EOSC-funded initiative that brings together the 13 Life Science ‘ESFRI’ research infrastructures (LS RIs) to create an open, digital and collaborative space for biological and medical research.

The project will publish ‘FAIR’ data and a catalogue of services provided by participating RIs for the management, storage and reuse of data in the European Open Science Cloud (EOSC). This space will be accessible to European research communities.

Our contribution
Michel Dumontier is involved in WP6 Task 1: developing a wizard-like tool to inform different kinds of users on the FAIR principles.

Project Number: 824087

Name of PD/PI: Elixir

Source of Support: Horizon 2020

Project/Proposal Start and End Date: 03/2019-06/2023

Project page
EOSC-Life logo

Automatic Semantic Enhancement of Data2Services pipeline

Internship Semantic Mapping

Theme 1 Theme 3

Abstract

Nowadays, data can have multiple different formats and structures (e.g. CSV, XML, RDB). The size and amount of available data, together with its diversity contributes to the expansion of different tools used for integrating it. Therefore, data processing is becoming increasingly difficult. Semantically enhanced data can provide more fruitful information to the user compared to other available formats. Therefore, a tool which provides integration of semantically enhanced data  should be accessible to both experts and non-experts alike. Existing RDF tools automatically produce “semantic” mappings between the source data and the target data model. However, these tools cannot tap into existing terminologies and ontologies to annotate data with specific concepts and relations. To address this limitation, we will use (semi)-automated methods that use a combination of i) NLP-based concept recognition, ii) profiling of the value space, range and lexical datatypes, and iii) machine learning method to concept assignment.

Project team: Andreea Grigoriu, Amrapali Zaveri and Michel Dumontier

Status: Completed 

Project page

Data2Services

Theme 1 Theme 2 Theme 3

Abstract

While data are becoming increasingly easy to find and access on the Web, significant effort and skill set is still required to process the amount and diversity of data into convenient formats. Consequently, scientists and developers are duplicating effort and are ultimately less productive in achieving their objectives. Here, we propose Data2Services, a new architecture to semi-automatically process diverse data into standardized data formats, databases, and services. Data2Services uses Docker to easily and faithfully execute data transformation pipelines. These pipelines involves the automated conversion of target data into a semantic knowledge graph that can be further refined to fit a particular data standard. The data can be loaded in a number of databases and are made accessible through native and auto-generated APIs.

Project team: Alexander Malic, Vincent Emonet and Michel Dumontier

Status: Completed

This project is partially funded by the NCATS Biomedical Translator program.

Source code

 

Project page

FAIR is as FAIR does

Theme 1 Theme 2

 

Abstract

The FAIR guiding principles for data management and stewardship (FAIR = Findable, Accessible, Interoperable, Re-usable) have received significant attention, but little is known about how scientific protocols and workflows can be aligned with these principles. Here, we propose to develop the FAIR Workbench that will enable researchers to explore, consume, and produce FAIR data in a reliable and efficient manner, to publish and reuse  computational) workflows, and to define and share scientific protocols as workflow templates. Such technology is urgently needed to address emerging concerns about the non-reproducibility of scientific research. We focus our attention on different types of workflows, including computational drug repositioning to illustrate fully computational workflows and related systematic reviews to illustrate mixed (manual/computational) workflows. We explore the development of FAIR-powered workflows to overcome existing impediments to reproducible research, including poorly published data, incomplete workflow descriptions, limited ability to perform meta-analyses, and an overall lack of reproducibility. We will demonstrate our technology in our use case of finding new drugs and targets for cardiovascular diseases, such as heart disease and stroke. As workflows lie at the heart of data science research, our work has broad applicability beyond the Life Science top sector.

Project team: Michel Dumontier, Remzi Celebi, Tobias Kuhn, Harald Schmidt, Ahmed Hassan and The Netherlands eScience Center

Status:  Completed

News

Source code

Project page

LEX2RDF

Theme 1 Theme 3 

Abstract

There has been significant effort in recent years to publish the metadata and information content of court decisions in online, public databases such as EUR-LEX (http://eur-lex.europa.eu). This has enabled data scientists and empirical legal researchers to investigate how court decisions evolve over time, what factors influence these decisions and if they law is being consistently applied.
However, the problem is that current case law databases use disparate terminology, data formats and data access methodologies. Furthermore, national databases remain isolated from each other as well as from European and international level databases. This makes it difficult to answer case law research questions on a global scale. Furthermore, because some databases still use legacy data formats, the data analytics tools being used by researchers are less powerful and outdated.
We propose to develop an automated pipeline to convert case law data into the semantically rich RDF (Resource Description Framework) format. RDF is the W3C (World Wide Web consortium) recommendation for representing linked data on the Web. Representing the information in RDF enables computers to understand what the data means because the terminology is defined in standardized vocabularies called “ontologies” that are also published on the Web. RDF also natively solves the problem of unlinked data because disparate terminology across databases are mapped or unified using ontologies. This enables the automatic integration of case law data across databases as well as the use of open source, actively supported RDF software to conduct advanced querying, visualisation and analytics on the data, regardless of its original source.

Project team: Kody Moodley, Pedro Hernandez Serrano, Amrapali Zaveri, Gijs van Dijck and Marcel Schaper 

Status:  Completed

Source code

Project page

Large-scale RDF-based Data Quality Assessment Pipeline

LODQuA

Theme 1 

Abstract

In recent years, the Linked Open Data (LOD) paradigm has emerged as a simple mechanism for employing the Web as a medium for data and knowledge integration where both documents and data are linked. However, data published using LOD principles is not exempt from shortcomings in quality, and this often hinders its downstream uses in settings where it needs to be analyzed and processed. Currently, a key challenge is to assess and report on the quality of datasets published on the Web and to make this quality report available in a FAIR (Findable, Accessible, Interoperable, Reusable) manner. Even though data quality is an important problem in LOD, there are few methodologies proposed to assess the quality of these datasets, which either focus on a specific quality dimension and/or are cumbersome to use by non-experts.

In this project, we introduce LODQuA, a large-scale quality assessment pipeline specifically for Linked Open Data. LODQuA performs automated quality assessment on Linked Data datasets using 17 community-established metrics spread over 6 quality dimensions. LODQuA is packaged using the Docker container management system and is composed of three containers for specific sets of metrics: (i) FAIRsharing metrics, (ii) Descriptive statistics and (iii) Computational metrics. Each container can be run in parallel for several datasets, making it highly reusable and scalable. The data quality assessment results are reported using the W3C standard Data Quality Vocabulary, which allows uniform querying of data quality results across multiple datasets. This is invaluable when comparing the data quality of multiple datasets in the same domain, helping users choose optimal quality datasets for their use cases.

Project team: Amrapali Zaveri, Pedro Hernandez Serrano, Alexander Malic, Vincent Emonet and Michel Dumontier

Status:  Completed


This project is funded by the NCATS Biomedical Translator program
 


Source code

Project page

Crowdsourcing Biomedical Metadata Quality Assessment

MetaCrowd

Theme 1 

In this project, MetaCrowd, we utilize the power of non-experts via crowdsourcing as a means for metadata quality assessment, specifically for the Gene Expression Omnibus (GEO) dataset.

Abstract

To reuse the enormous amounts of biomedical data available on the Web, there is an urgent need for good quality metadata. This is extremely important to ensure that data is maximally Findable, Accessible, Interoperable and Reusable.The Gene Expression Omnibus (GEO) allow users to specify metadata in the form of textual key: value pairs (e.g. sex: female). However, since there is no structured vocabulary or format available, the 44,000,000+ key: value pairs suffer from numerous quality issues. Using domain experts for the curation is not only time consuming but also unscalable. Thus, in our approach, MetaCrowd, we apply crowdsourcing as a means for GEO metadata quality assessment. Our results show crowdsourcing is a reliable and feasible way to identify similar as well as erroneous metadata in GEO.This is extremely useful for data consumers and producers for curating and providing good quality metadata.

Project team: Amrapali Zaveri, Wei Hu and Michel Dumontier

Status:  Completed

Results

Publication

Project page

Medical Image Annotation via non-expert Crowdsourcing

MIA

Theme 1

Abstract

Lung cancer is the most deadly cancer in the world, claiming over 2.5 million lives yearly. For cases in which surgery is not an option, chemoradiotherapy is the standard treatment modality. However, numerous other treatment options exist, such as immunotherapy and a variety of systemic anti-cancer therapies. In order to personalize treatment, we extract quantitative imaging features from the tumor of the patient.
These quantitative image features can provide information as to which treatment would be most effective for this patient, based on patient treated in the past. In order to extract these quantitative image features, however, we need to first contour the tumor on the image. This is the most time-intensive step of the entire process of treatment personalization.  
Involving experts i.e. doctors to annotate images with precise contouring is the current gold standard, however this cannot scale and with the ever-increasing amounts of clinical image data can become expensive and time consuming. Therefore, we require a scalable, affordable and quicker means of annotating large amounts of tumor images precisely.

Thus, in this project we harness the wisdom of non-experts via crowdsourcing to contour clinical images, specifically of lung cancer to precisely identify tumors. If we can do this task properly of contouring tumors in images, we will be able to identify the best treatment (in terms of survival and quality of life) for the patients. This will enable patients to be properly informed about each treatment option and has the potential to save lives and increase quality of life for cancer patients.

Project team:Amrapali Zaveri, Arthur Jochems and Deniz Iren

Status:  Completed

Project page

A Provenance Model for Scientific Assertions about Chemical Substances

Theme 1 Theme 2 Theme 3

Abstract

There are a growing variety of resources and databases about chemical substances on the Web. Data from these heterogeneous sources are often made available through public APIs (Application Programming Interfaces) that can be accessed by biomedical researchers to enrich their computational analyses and scientific workflows.
A significant problem often arises when building tools and services for integration and provision of data from multiple APIs. That is, the evidence and provenance information for assertions in these sources are usually not specified, and, even if they are, there is no accepted model specifying which provenance properties should be captured for chemical substance data. This issue can make it difficult to determine the veracity and quality of the data obtained, which can hinder the integrity of associated research findings.
We propose to develop a model for capturing provenance information of chemical substance data. We base this preliminary version of the model on data from 10 prominent chemical substance resources utilised by the BioThings API (specifically mychem.info). Our model is independent of any concrete data format and implementable in popular interchange formats such as JSON-LD and RDF (Resource Description Format).

Project team: Kody Moodley, Amrapali Zaveri, Chunlei Wu and Michel Dumontier

Status: Completed

Project page

Crowdsourcing Experimental Design

CrowdED

Theme 1

Abstract

Crowdsourcing involves the creating of HITs (Human Intelligent Tasks), submitting them to a crowdsourcing platform and providing a monetary reward for each HIT. One of the advantages of using crowdsourcing is that the tasks can be highly parallelized, that is, the work is performed by a high number of workers in a decentralized setting. The design also offers a means to cross-check the accuracy of the answers by assigning each task to more than one person and thus relying on majority consensus as well as reward the workers according to their performance and productivity. Since each worker is paid per task, the costs can significantly increase, irrespective of the overall accuracy of the results. Thus, one important question when designing such crowdsourcing tasks that arise is how many workers to employ and how many tasks to assign to each worker when dealing with large amounts of tasks. That is, the main research questions we aim to answer is: `Can we a-priori estimate optimal workers and tasks' assignment to obtain maximum accuracy on all tasks?'. Thus, we introduce a two-staged statistical guideline, CrowdED, for optimal crowdsourcing experimental design in order to a-priori estimate optimal workers and tasks' assignment to obtain maximum accuracy on all tasks. We describe the algorithm and present preliminary results and discussions.  We implement the algorithm in Python and make it openly available on Github, provide a Jupyter Notebook and a R Shiny app for users to re-use, interact and apply in their own crowdsourcing experiments.

Project team: Amrapali ZaveriPedro Hernandez Serrano and Michel Dumontier

Status: Completed

Publication

Source code

Project page

Legal Interference Program for Chatbots

LIPCHAT

Theme 1 Theme 3

Abstract

The student population at Maastricht University is among the most diverse of all European universities. As a result of the language barrier and lack of knowledge, students who rent accommodation in Maastricht and surrounding areas often face confusion in understanding their rights as tenants in the Netherlands, and are thus vulnerable to exploitation by insincere landlords and property owners.
We propose to develop a chatbot that is able to answer questions about a students legal rights in a particular situation and to identify the responsible party when it comes to disputes (financial or otherwise). Machine Learning (and in particular Deep Learning) has become a popular methodology for generating responses to users questions in chatbots. There are two general approaches: 1) selecting from a predefined set of responses and 2) generating the sentences in the response from scratch. We develop a hybrid approach combining attributes from both these methods.

Project team: Kody Moodley

Status: Completed

 

Project page

A predictive machine learning model for Alzheimer's disease

Predict AD

Theme 2

Abstract

Despite the increasing availability in brain health related data, effective methods to predict the conversion from Mild Cognitive Impairment (MCI) to Alzheimer’s disease (AD) are still lacking. As currently available and emerging therapies have the greatest impact when provided at the earliest disease stage, the prompt identification of subjects at high risk (e.g. MCI patients) for conversion to full AD is of great importance in the fight against Alzheimer’s disease. In the current era of big data, advanced computation over large amounts of brain images, electronic health records, and wearable devices may enable new insights into the working mechanisms of the human brain. A predictive machine learning algorithm based only on non-invasively and easily collectable predictors potentially allows us to identify MCI subjects at risk for conversion to full AD. Although these new technologies also reveal potential challenges and ethical questions, an applicable screening algorithm would ensure better selection of subjects to include in clinical trials for preventative treatments, and allows early identification of subjects who would benefit the most from such treatments.

Project team: Nadine Rouleaux, Massimiliano Grassi, Michel Dumontier
Technical/infrastructure support: Alexander Malic Pedro Hernandez Serrano and Seun Adekunle.

Status:  Completed


Poster AD
 


Article

 

Project page

Analyzing partitioned FAIR data responsibly

VWData (FAIRHealth)

Abstract Since health “Big Data” is extremely privacy sensitive, using it responsibly is key to establish trust and unlock the potential of this data for the health challenges facing Dutch society now and in the future. One of the unique characteristics of Big Data in health is that it is extremely partitioned across different entities. Citizens, hospitals, insurers, municipalities, schools, etc. all have a partition of the data and nobody has the complete set. Sharing across these entities is not easy due to administrative, political, legal-ethical and technical challenges. In this project, we will establish a scalable technical and governance framework which can combine access- restricted data from multiple entities in a privacy-preserving manner. From Maastricht UMC+, we will use clinical, imaging and genotyping data from the Maastricht study, an extensive (10.000 citizens) phenotyping study that focuses on the aetiology of type 2 diabetes. From Statistics Netherlands (CBS), which hosts some of the biggest and most sensitive datasets of the Netherlands, we will use data pertaining to morbidity, health care utilization, and mortality. Our driving scientific use case is to understand the relation between diabetes, lifestyle, socioeconomic factors and health care utilization, which will inform guidelines with major public health impact. The work plan is divided into two interlocking work packages. A Technical WP will first involve the development of a technical framework to make the Maastricht Study and CBS data FAIR – Findable, Accessible, Interoperable, and Reusable. We will couple FAIR data to a federated learning framework based on the “Personal Health Train” approach to learn target associations from the data in a privacy-preserving manner. The second WP will focus on Ethics, Law and Society Issues (ELSI), in which we will first set up a governance framework including the legal and ethics basis for the processing of the data held by the chosen test sites is sufficient for this specific scientific case and then a broader and scalable governance structure is developed to define and underpin the responsible use of Big Data in health.

Project team: IDS: Michel Dumontier, Claudia van Oppen, Chang Sun, Lianne Ippel, Alex Malic and Seun Adekunle
Maastro: Andre Dekker and Johan van Soest
CBS: Bob van den Berg, Marco Puts, Susan van Dijk and Ole Mussmann
The Maastricht Study: Annemarie Koster and Carla van der Kallen

 Status:  Completed

Project information
Publication
Project Github
VWData Programme
VWData Video's 

Project page

Optimal combination of humans and machine using active learning towards data quality assessment

OptimAL

Theme 1

Abstract 

Therapeutic intent, the reason behind the choice of a therapy and the context in which a given approach should be used, is an important aspect of medical practice. There is a need to capture and structure therapeutic intent for computational reuse, thus enabling more sophisticated decision-support tools and a possible mechanism for computer-aided drug repurposing. For example, olanzapine is indicated for agitation and for schizophrenia, but the actual indication stated on the label is treatment of agitation in the context of schizophrenia.
Automated methods such as NLP or text mining fail to capture relationships among diseases, symptoms, and other contextual information relevant to therapeutic intent. This is where human curation is required to manually label these concepts in the text so as to train these methods in order to improve their accuracy. However, acquiring labeled data can get expensive at a large scale. This is why limiting the number of required human labelled data could be done using machine learning (ML) methods (e.g. active learning algorithms) in order to accurately label data.
Existing active learning algorithms help choose the data to train the ML model so that it can perform better than traditional methods with substantially less data needed for training. Humans provide the labels that train the model (labeling faster), the model decides what labels it needs to improve (labeling smarter), and humans again provide those labels. It’s a virtuous circle that improves model accuracy faster than brute force supervised learning approaches, saving both time and money. The question, however, that is not yet answered is what is the optimal balance between the ML algorithms accuracy and the amount of human-labeled data required in order to achieve high accuracy.To address these issues, we propose to develop a prototype of OptimAL that aims to optimally combine machine learning with human curation in an iterative and optimal manner using active learning methods in order to i) create high quality datasets and ii) build and test increasingly accurate and comprehensive predictive models.

Team members: Amrapali Zaveri, Hidde van Scherpenseel, Pedro Hernadez Serrano, Remzi Celebi, Siamak Mehrkanoon and Michel Dumontier 

Status:  Completed

Project page

FAIRness of the ELIXIR Core Data Resources Implementation Study

FAIRness of the current ELIXIR Core Resources Implementation Study: Application (and test) of newly available FAIR metrics, and identification of steps to increase​ ​interoperability.

The FAIRness of the current ELIXIR Core resources implementation study (aka FAIRCDR) aims to put the FAIR (Findable, Accessible, Interoperable and Reusable) guiding principles  into practice on the light of the ELIXIR Core Data Resources (CDRs). The FAIRCDR implementation study enabled the collaboration of expert data curators of the ELIXIR CDRs with experts on FAIRness assessments (from the so-called “FAIR metrics group ”). FAIRness means a continuum of features, attributes and behaviors that a digital object exhibits to comply with the FAIR principles. Preliminary findings suggest that the participating ELIXIR CDR already implement the FAIR principles to a large extent. yet, it enabled the identification of some points that can be improved. Moreover, the FAIRCDR implementation study provided the opportunity to test different forms of conducting FAIRness assessments, leading to better tools for self-FAIRness assessments.

Team members: Ricardo de Miranda Azevedo, Pete McQuilton, Rob Hooft, Susanna Sansone, Michel Dumontier

Status:  Completed

Project page 

Project is funded by “Integration of Data Resources from ELIXIR Nodes: Increasing the sustainability of the ELIXIR Data Resource landscape” https://drive.google.com/file/d/0B60jEEGzhM72NWFBUzdEck1rNnc/view