Health Data

Synthetic Data and LLMs: The New, High-powered Healthcare Research Assistant

​​​Amid all the hype about generative artificial intelligence (AI) and large language models (LLM), such as ChatGPT and Bard, healthcare professionals are eager to understand how these technologies can improve the delivery of patient care. 

While LLMs certainly hold promise for their potential to transform the healthcare industry, there is one related concern that is often overlooked: patient privacy. To achieve accuracy and deliver value to end-users, LLMs must be trained on massive amounts of data. In healthcare, that means patient data, and patient data brings privacy concerns. If LLMs are trained on real patient data, the possibility exists for the data to be compromised and the patients’ privacy violated. 

To overcome these privacy concerns, many healthcare organizations are turning to synthetic data, which frees users from potential privacy pitfalls and accelerates the ability of researchers to arrive at insights that improve care delivery. Synthetic data is non-reversible, artificially created data that replicates the statistical characteristics and correlations of real-world, raw data. Utilizing both discrete and non-discrete variables of interest, synthetic data does not contain identifiable information because it uses a statistical approach to create a completely new data set. 

The Promise of LLMS in Healthcare

An LLM is a form of artificial intelligence that can generate human-like text. Recently, LLMs gained recognition within medicine when they demonstrated the ability to pass medical exams. Separately, one study found that chatbots often provided more empathetic written responses to patient questions than physicians.  

Although generally in their infancy in terms of healthcare applications, LLMs could offer numerous important use cases for healthcare including clinical decision support, automated patient communication, personalized treatment plans, coding, and billing. Because LLMs can comprehend and summarize complex medical concepts, they may offer valuable insights when working alongside humans.  

However, while general-purpose LLMs such as ChatGPT can perform many medically relevant tasks, they have not been exposed to medical records during self-supervised training and are not specifically tailored to perform any medical task, according to a JAMA Network article by Stanford researchers.  

How Synthetic Data Helps 

Synthetic data represents a form of data anonymization. Think of it as a way to take private patient information and enable researchers and other users to access the data without compromising patient privacy. This process provides users with greater utility and gives them access to real patient data by creating a statistical model and populating it with new data points, novel patients, and synthetic ​​patients.  

The result is a data set that has the same statistical properties as the original data but doesn't contain any individually identifiable patient information.  Under the Health Insurance Portability and Accountability Act (HIPAA) regulation, there are 18 patient identifier categories of information that must be protected. These include names, dates, email addresses, social security numbers, health plan beneficiary numbers, vehicle identification numbers, and any other unique identifying numbers, characteristics, or codes. 

Despite the absence of these identifiers, synthetic data delivers full utility of the data because researchers can freely explore the information. Synthetic data offers the potential to mimic the characteristics of a real dataset, without sensitive patient information, making it a good option for analyzing large but sensitive samples of real individual-level patient data. Synthetic data differs from de-identified data in that it is built from scratch, as opposed to being based on individual patient records, which means synthetic data cannot be de-​​anonymized. Unlike deidentified data, synthetic data puts a protective layer around the original data to preserve both the privacy of the original and the underlying value of that data. 

By combining synthetic data with LLMs, researchers gain the ability to quickly parse and summarize vast amounts of patient data without privacy issues. In this regard, the combination represents a high-powered research assistant that can sift through avalanches of data to isolate the signal from the noise.  

For example, a hospital researcher seeking to understand which antibiotic leads to better outcomes for pneumonia patients could query the hospital’s electronic health records system and use technology to transform the resulting data set into synthetic data. The LLM could then be deployed as a “research assistant” – in place of a human – to rapidly examine, massage, and interpret the data. Using the same data set, within seconds, LLM could facilitate a dynamic conversation and multiple queries — a process that might take a human several hours or days. And because the data is synthetic, patient privacy is preserved. 

With an accelerated research process, healthcare organizations can more quickly deliver life-saving medications and interventions to the patients who need them, but they must account for patient privacy during research. Synthetic data, artificially created data that replicates the statistical characteristics and correlations of real-world data, is an essential component of this process. By combining synthetic data with LLMs acting as research assistants, researchers can surmount privacy barriers while decreasing the time needed to convert insights into new breakthroughs on the ​​market.  


Luz Erez is chief technology officer (CTO) for MDClone, a firm focused on unlocking healthcare data through exploration, discovery, and collaboration to improve patients’ health. Prior to joining MDClone, Erez, an engineer and scientist, served in a series of firms as CTO, vice president of development, and chief engineer.