Health Data

Explainability of AI Models to Prevent Bias in Clinical Systems

Artificial intelligence (AI) has massive potential to change and improve outcomes in every industry, but we are only just beginning to implement this groundbreaking technological power in healthcare. AI systems can help with clinical decision support (CDS), early diagnosis of diseases, identifying previously unknown rare and genetic disorders, predicting patient behavior and adherence, finding new treatments, automating billing and other tasks, and many other important functions.

However, as with any technology, there are unique challenges to address with AI implementation. One of the key problems to identify and fix in the field of AI is bias. Generally speaking, a biased AI system has a disproportionately negative impact on people of a certain race, gender, orientation, culture, region, language, religion, etc.1 While bias in decision-making is not a new issue, AI-based predictions and decisions bring new types of challenges and, in many ways, AI exacerbates the issue by magnifying biases that already exist in people and cultures.

AI can provide enormous scalability to many processes and, by doing so, it can also proliferate the problems of bias on a larger scale than ever before. The issue of bias in AI falls within the broader area of AI ethics, which, in addition to bias, deals with accountability, transparency, and safety. While it’s important to address all of these areas when we address any one of these subjects, in this article, we will primarily focus on the issue of bias.

Examples of AI Bias

AI outputs are probabilistic based on the data it has been trained on. Bias in an AI system can seep in through the data used to train it as well as through the training process itself. Dealing with bias in the training data is usually more common and sometimes a harder challenge to solve. There can be several problems with training data, such as the lack of enough representative data, skewed data based on the socioeconomic status of the subjects, too much noise in the data, inclusion of cognitive biases in the data, bad or stale data, lack of human oversight of data, etc.

As AI is becoming more mainstream, we are finding problematic outcomes in many areas. In one well-known study, it was found that an algorithm concluded that Black patients are healthier than white patients because the cost of healthcare was used as a proxy for the need for healthcare. This is obviously an incorrect assumption, especially in the US where healthcare visits are directly tied to having health insurance and the socioeconomic status of the patients. Such predictions can potentially have devastating effects on underserved patient populations through harmful policies and practices.2

In another recent study, significant discrepancies were found in predicting sepsis patients based on social determinants of health across race, marital status, insurance type, and language. In this case, the bias was primarily due to using a universal AI model created by using training data that didn’t adequately account for various segments of the population.3 This kind of bias in automated systems can also have a negative impact on the health and treatment options of underserved groups.

Structured and Unstructured Data Examples

Bias in AI systems can occur when using both structured and unstructured data for training. Examples of unstructured data include images and text, and structured data examples include numerical or discrete categorical values in an electronic health record (EHR) system. Due to the nature of data generation, there is a lot more unstructured data than structured data, and because of the subjectivity and nuances in unstructured data, there is stronger potential for bias. For example, language used by a younger population may be different in significant ways than language used by an older population. In treatment areas such as mental health where diagnoses and treatment are heavily reliant on patient experience and notes, biases based on gender and race have shown to have a potentially harmful impact if AI is used to predict mental health status.4

Another type of unstructured data are images and videos, and there is a lack of data to represent the diversity of skin color and clothing based on cultural norms. For example, images taken with expensive cellphones by wealthy populations may be significantly higher quality than images taken with less expensive cellphones. If only high-quality images are used for training a computer vision AI model, we would be ignoring a large segment of representative images. Forty-five percent of all images in ImageNet, which is a large repository of images used for machine learning, comes from the United States, while China and India together represent only 3 percent of all the images.5 There are relatively few images of dark-skinned subjects with skin cancer and dermatological diseases. This poses a serious challenge when building AI systems that can automatically detect skin cancer, Lyme disease, and other types of skin-related diseases.

Causes of Bias in AI Systems

Machine learning (ML) in general, and deep learning in particular, are statistical models that learn to associate a set of values in their input to an output value. The function that makes the association is entirely dependent on three factors: the type of model, the values of its tuning parameters, and the learning set that the model is trained with.

It is important to stress that ML models are unable to do any generalization using common sense. It follows from this that any bias present in the learning data will be learned without any complaint from the ML model. This is one of the most treacherous areas of ML, and researchers know that the results of their model are only as good as the data they have available for training. In particular, the quantity of samples, the distribution of data points in the model space and its representativeness, the accuracy of labeling of the target variable, and any biases that are already contained in the data will have an impact on the model that is created.

Each one of the above can be a source of bias in the resulting model. Let us imagine a scenario in which a model is being trained to recognize COVID-19 in images of lungs obtained through computed tomography (CT).

Perhaps the most common type of bias is selection bias, which happens when the training data contains examples across a certain class, for example a patient’s age, but their distribution does not reflect the distributions found in the environment in which the model will be used. In other words, certain subclasses are either overrepresented or underrepresented. It is important to note that, in certain cases, the class for which there is bias is not one that is represented in the data. For example, there is racial bias but there is no variable for race in the data, which makes it impossible to detect statistically. While not recording some types of data, like race, is often done with the best of intentions, it also means that if a race was under-represented, we’d have no statistical means to find out.

Another source of bias is data that contains an insufficient number of samples to represent thoroughly the possibilities and variations encountered in real life. In this instance, the model will only work in certain cases. This could be problematic because typically the training set and the test set (the one used to assess the accuracy of the model) are taken from the same set and split before the learning. So, the model testing in the laboratory will be completely oblivious to this bias. A certain fraction of the data is under-represented and, therefore, the model is less accurate with it. An extreme example is when a subset of real-world data is not represented at all. Suppose, for example, that the training images are all from older individuals. The model will appear to work well, but when used in practice and fed with images of individuals of all ages, it may perform poorly for younger patients.

Moreover, most ML classifiers work better when the sets of different classes are numerically balanced. For example, when training a model to recognize which images of lungs show signs of COVID-19, we should ensure to include a sufficient number of images of lungs from healthy patients.

A subset of this type of bias takes place when a category of data is excluded during the preprocessing stage. In the example above, there could be a set of images that are poor quality and are therefore excluded, but it could turn out that there was a significant statistical link between these and the target variable, and, therefore, by excluding the images, we have introduced bias in the model. It should always be ensured that any data that is excluded from a training set is statistically neutral in respect to the target variable. The problem is that this may not be at all obvious.

Another way of introducing bias is when the conditions used for the training data differ from those in the real world. For example, all images in a training set are taken with a certain model of machine, or with particular machine settings, different from the machines that will provide data in the real world. This is a common cause of models working well in laboratories but then failing to perform when used in normal clinical settings (the problem of poor transferability).

The next source of bias is inconsistent labeling. If the training images are labeled by two doctors, for example, and they have different diagnosing criteria, we may incur this problem. Finally, there may be underlying biases in the training set that simply reflect the biases and inequalities in the social world in which the set has been collected. In our case, this would happen if our dataset came from doctors who were more reluctant to diagnose patients of a certain gender or social status with a positive COVID-19 diagnosis.6

Often these biases affect underserved populations. For example, as certain regions are less likely to be equipped with sophisticated CT scan machines, data used to train the models will be structurally biased against these regions’ populations.7 It is important to stress that any of these biases can be declined in its much darker and antisocial version, when the variable being biased is a variable of social relevance.

Addressing the Bad Data Problem

There are several aspects of data quality to consider, such as the sources of data, the accuracy, and recency of data. We should also understand if the data is “balanced” and if it accurately describes the population in question for an AI model. For example, in a dataset used to predict the probability of diabetes, we should look for the balance of genders, frequency distribution of age and race, and other variables that may affect the probability of a diagnosis of diabetes.

Systemic and cultural biases creep into the data as well, and it can often be difficult to identify these biases. It is important to have diversity in the implementation team to look at various aspects of the data as well as the models that are created using the data.

Very interesting insights on bad data came because of the COVID-19 pandemic. In its scathing report on the effectiveness of AI tools to fight COVID-19, the Alan Turing Institute concluded that none of the hundreds of ML models developed in the hope to fight the pandemic had any valuable impact. This followed other studies that also concluded none of the hundreds of developed models were fit for actual clinical use.8-10

Researchers found that, in the urgency to create models, researchers had used data of dubious or unknown provenience, data that was mislabeled, and datasets that were spliced together from completely different sources. This included many instances of undetected duplication, leading models to be trained and tested on duplicates of the same instances, which made them look as if they were performing better than they actually were. In some cases, data had uncertain origin and content. In one case, researchers unwittingly used images of healthy children’s lungs, leading the ML to learn to simply identify children rather than COVID-19. In a study conducted at the University of Cambridge,11 a dataset was used that contained scans from patients both standing and lying down. However, patients who were lying down were generally those in more severe conditions, and consequently the model learned to simply associate illness with position. In another case, the ML was detecting the font of the labels used by hospitals with a higher incidence of serious cases and associating the font with the seriousness of the condition.

The research then highlights several instances of bias introduced at the labeling stage. For example, many medical images were labeled according to a doctor’s opinion, rather than a test result, and that embedded any bias the doctors had into the ground truth.

While these types of mistakes seem trivial and easy to fix, the problem is often to even understand that they are happening.

Explainability of AI Models

The most obvious way of combating bias is to improve data quality and sampling quality, as well as being careful not to introduce bias during data wrangling. Unfortunately, in most practical situations—and in particular in a time of emergency as it has been with the recent pandemic—researchers have to rely on whatever data that is pre-existing, and in many cases, there just are not the resources to do better.

In these cases, ML is often trained with the best possible data available, or collectable. It is fair to speculate that most ML systems are built in less-than-ideal situations and with less-than-ideal data. Yes, bias should and could be avoided at the source, but being able to detect it more easily down the line is extremely valuable. It gives reassurance that if something was overlooked in earlier phases, or simply could not be done better, there is still a failsafe mechanism to alert that something went wrong.

In the consequent discussion of what qualities of ML-based systems makes it easier to discover bias, explainability and transparency play a major role.

While some ML algorithms, like decision trees, are inherently easy to explain, many other ML algorithms are “black box.” In other words, they provide an answer but no rationale of why they chose it. This is particularly true of deep learning due to the number of parameters commonly used and the intricacy of its mechanisms. But, in many cases, it would be obviously very useful to have an idea of “why” the model chose an answer.

Recently, there has been a lot of research about XAI or Explainable AI, which is a set of techniques that set out to provide an explanation of why an ML model provides a certain answer. These range from providing a list of the parameters that have had more influence in the decision, to providing a set of samples from the training set that most resemble the one in question, to providing a visual explanation, and many others.

While XAI is commonly discussed in terms of trust or confidence that humans have in the system, it also has a less discussed but important role in catching bias. If a system is a black box, it is hard to detect unwanted behaviors, while a system that offers more insights into its reasoning is also more open to scrutiny. This includes ML communicating its limitations clearly. Often it is in the interest of the makers of ML systems to envelope their creations with mystique, and to encourage users to adopt some sort of “magical thinking” about them. Any system bias will be much harder to discover in these conditions. Implementers of ML should adopt a policy of demystification and openness about the shortcomings and limitations of the technology. While in less critical types of applications, the mystery can help to sell, healthcare is the type of environment that values things that are clear, well understood, and perceived as robust and reliable. Eliminating suspicions of bias can go a long way in fostering acceptance and correct use of ML.

Well-designed AI, on the other hand, can also help to overcome human bias. AI that works well is immune to the vagaries of human psychology. This means that when deployed in tandem with humans, a robust and transparent model can make humans more aware of the biases they may unconsciously adopt by presenting a different perspective and justifying in a way as to make the human question themselves.

Transparency Is Key

AI is woven into the daily lives of every individual in the modern world, and it is important to approach implementing AI solutions in the proper manner rather than to stay away from it altogether. The benefit of AI surpasses the risks in most situations.

Although bias is a serious problem for the adoption of ML in clinical environments, we have seen that it is possible to prevent bias with good ecology of ML. First, we can ensure that training and testing data are free from bias. Second, we can make sure that bias is not introduced during the data wrangling phase, and, finally, we can adopt a policy of transparency about the capabilities of AI and adopt models that can be explained in plain language.

Explainability and transparency of AI systems make them less subject to bias themselves and help the humans who work with them also less subject to bias. Explainability and transparency are, therefore, two topics that should be kept at the forefront of the preoccupations of any manager looking into adopting AI in a clinical setting.

Notes

1. Daniel Zhang, Nestor Maslej, Erik Brynjolfsson, John Etchemendy, Terah Lyons, James Manyika, Helen Ngo, Juan Carlos Niebles, Michael Sellitto, Ellie Sakhaee, Yoav Shoham, Jack Clark, and Raymond Perrault, “The AI Index 2022 Annual Report,” AI Index Steering Committee, Stanford Institute for Human-Centered AI, Stanford University, March 2022.

2. Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447–453. https://doi.org/10.1126/science.aax2342

3. Wang, H., Li, Y., Naidech, A. et al. Comparison between machine learning methods for mortality prediction for sepsis patients with different social determinants. BMC Med Inform Decis Mak 22, 156 (2022). https://doi.org/10.1186/s12911-022-01871-0

4. Chen, I. Y., Szolovits, P., & Ghassemi, M. (2019). Can AI help reduce disparities in general medical and Mental Health Care? AMA Journal of Ethics, 21(2). https://doi.org/10.1001/amajethics.2019.167

5. Shankar, S., Halpern, Y., Breck, E., Atwood, J., Wilson, J., & Sculley, D. (2017, November 22). No classification without representation: Assessing geodiversity issues in open data sets for the developing world. arXiv.org. Retrieved July 20, 2022, from https://arxiv.org/abs/1711.08536v1

6. Daneshjou R, Smith MP, Sun MD, Rotemberg V, Zou J. Lack of Transparency and Potential Bias in Artificial Intelligence Data Sets and Algorithms: A Scoping Review. JAMA Dermatol. 2021;157(11):1362–1369. doi:10.1001/jamadermatol.2021.3129

7. Luengo-Oroz M,  Bullock J , Pham KH, Lam CSN and Luccion Ai, From Artificial Intelligence Bias to Inequality in the Time of COVID-19 in IEEE Technology and Society Magazine, vol. 40, no. 1, pp. 71-79, March 2021, doi: 10.1109/MTS.2021.3056282

8. Mateen B, Wooldridge M, von Borzyskowski I, Mazumder A. 2021. Data science and AI in the age of COVID-19. Alan Turing Institute Report https://www.turing.ac.uk/sites/default/files/2021-06/data-science-and-ai-in-the-age-of-covid_full-report_2.pdf

9. Roberts M., Driggs D., Thorpe M. et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nature Machine Intelligence 3, 199–217. 2021. https://doi.org/10.1038/s42256-021-00307-0

10. Wynants L, Van Calster B, Collins G S, Riley R D, Heinze G, Schuit E et al. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal BMJ 2020; 369 :m1328 doi:10.1136/bmj.m1328

11. Roberts M., Driggs D., Thorpe M. 2021.


Abidur Rahman is the vice president of innovation at Eversana and the Head of the Cognitive Core AI.

Gabriel Scali is an affiliate professor at ESCP and director at Reckon Digital.