In this project, we aim to investigate the optimal combination of humans (via microtask crowdsourcing) and machines (via machine learning) using active learning towards data quality assessment of drug data. The main research question we aim to address is “What is the optimal combination of humans and machines towards data quality assessment?”. We believe that determining the optimal level of interaction between human and machine would be beneficial to help decide the pool of workers needed in order to achieve optimal results thus greatly minimising costs.


Therapeutic intent, the reason behind the choice of a therapy and the context in which a given approach should be used, is an important aspect of medical practice. There is a need to capture and structure therapeutic intent for computational reuse, thus enabling more sophisticated decision-support tools and a possible mechanism for computer-aided drug repurposing. For example, olanzapine is indicated for agitation and for schizophrenia, but the actual indication stated on the label is treatment of agitation in the context of schizophrenia.
Automated methods such as NLP or text mining fail to capture relationships among diseases, symptoms, and other contextual information relevant to therapeutic intent. This is where human curation is required to manually label these concepts in the text so as to train these methods in order to improve their accuracy. However, acquiring labeled data can get expensive at a large scale. This is why limiting the number of required human labelled data could be done using machine learning (ML) methods (e.g. active learning algorithms) in order to accurately label data.
Existing active learning algorithms help choose the data to train the ML model so that it can perform better than traditional methods with substantially less data needed for training. Humans provide the labels that train the model (labeling faster), the model decides what labels it needs to improve (labeling smarter), and humans again provide those labels. It’s a virtuous circle that improves model accuracy faster than brute force supervised learning approaches, saving both time and money. The question, however, that is not yet answered is what is the optimal balance between the ML algorithms accuracy and the amount of human-labeled data required in order to achieve high accuracy.
To address these issues, we propose to develop a prototype of OptimAL that aims to optimally combine machine learning with human curation in an iterative and optimal manner using active learning methods in order to i) create high quality datasets and ii) build and test increasingly accurate and comprehensive predictive models.