Synthetic data, digital twins, and American money

Sense the Science at the Faculty of Science and Engineering 8

Artificial intelligence can become trustworthy in medicine if trained on high-quality data from a sufficiently large and divers patient population. But what happens when data is scarce because a condition or trait is extremely rare? Michel Dumontier and his team in Maastricht and the United States are addressing this by combining real and synthetic data to develop reliable AI systems. In October, their project received an $8 million US grant.

 

How does a biochemist end up as a computer scientist? Michel Dumontier explains just how this happened to him. "During my PhD research, I worked in a mass spectrometry facility until all of my colleagues left to start a company. I had two options: learn how to maintain mass spectrometry devices or do something completely different. My supervisor suggested I switch to bioinformatics." Eventually, switching to bioinformatics led to his role as a Distinguished Professor in Data Science at the Faculty of Science and Engineering at Maastricht University.

Complete control

Computer and data science indeed turned out to be something Dumontier likes. "In biochemistry, you investigate the parts and behaviour of a complex living system. Experiments often do not work, and it is not clear why. In computer science, you build the system yourself, so you have complete control. If it does not work, you did something wrong. This is very empowering and rewarding for someone who comes from the complete chaos of biology."

Despite leaving biochemistry, Dumontier continues to work in biology, specifically biomedicine. "Our goal is to develop methods and tools in data science and artificial intelligence that advance the science and the practice of medicine.​ Clinicians use their training, their experience, and their knowledge of clinical practice guidelines to care for individual patients. AI can help them by analysing large amounts of patient data to suggest possible recommendations, especially for rare or complex cases that are difficult to diagnose and treat.”

Technology

Assume you create an AI system based on medical data from people in Maastricht that accurately predicts the outcome of treatments on an individual basis. Given the international profile of Maastricht, will AI still be accurate in predicting the outcome for people from other parts of the world? "From a legal and ethical standpoint, a trustworthy AI system must be capable of making accurate predictions for all people. From the point of view of technology, you can only achieve this by exposing the AI system to data from all of the diverse subgroups of people you wish to assist with the finished system."

Michel Dumontier at work with his students

Training an AI system with sufficient data is a daunting task, especially if you are looking for data from patients with rare diseases or rare genetic traits. Researchers frequently use synthetic data, which is computer-generated data that closely resembles real-life data. The use of synthetic data is debatable. Who guarantees that it accurately mimics nature?

More accurate

Dumontier describes the new method he and his colleagues use to generate synthetic data: "Traditionally, synthetic data is generated by telling a machine what to do. We do not tell our machines what to do a priori, but rather allow them to learn how to generate optimal synthetic data while accounting for all possible combinations of variables. In the end, by using synthetic data, our AI system can make much more accurate predictions on an individual level than it would have been able to do without the synthetic data."

Michel Dumontier sitting with his laptop

Dumontier’s ultimate goal is to develop an AI system that is so accurate that it can generate a digital twin of any person. "Doctors can use this twin to answer questions such as: What if I give treatment A or B? When will the real patient benefit the most? This will allow us to take personalised medicine to a new level, where we can test treatments within a digital twin to avoid bad decisions and improve the outcome for the real patient." 

 

Text: Patrick Marx

Photography: Brian Megens

Synthetic copy

Dumontier and his colleagues demonstrated that synthetic data can accurately mimic real data. They created a synthetic version of the Maastricht Studie’s data. "Valuable personal information about study participants is not included in this synthetic version. We also demonstrated that an AI system cannot trace data back to individuals. Researchers can now use these representative data to answer their research questions. Legal agreements are only required if they subsequently want access to the real data, thereby accelerating the pace of preliminary biomedical research.

 

Sense more of our Science here