Unstructured Data: An Important Piece of the Healthcare Puzzle

Unstructured Data: An Important Piece of the Healthcare Puzzle

By Kapila Monga and Harpreet Singh


With the deluge of Big Data and plethora of new avenues for collecting information, structured data is no longer the only place from which meaningful information can be derived. With the technology advances in computer power and infrastructure that allow users to handle larger volume, velocity, and variety of datasets, it has become even more feasible for businesses to analyze unstructured data sources. In healthcare, we are on the cusp of going mainstream with unstructured data.

All information that can be useful for decision-making purposes is not available in structured format—this fact is not new. We are at a point in healthcare where care costs are high and yet come with no guarantee of a correspondingly good quality of care; we need a fix. We need to know whether people are at risk before people get sick. We need to know how to manage and/or slow the pace of chronic conditions. Providers need to stay up to date with the new researches in the medical field, and thus provide the right care for patients. Last but not least, unstructured data has a pivotal role to play in fraud detection.

In some cases, structured data proves to be a gold mine of data. In others, using structured data to solve business problem seems to be like finding needle in haystack. For example, identifying opioid abuse from just structured data only partially solves the problem, and can turn up a high number of false positives and misses for capturing early indications. At these times answers to our questions can potentially lie in unstructured datasets like audio files, physician free text notes, call logs, videos, and images.

There are two emerging use-cases of using un-structured data in healthcare that are being talked about in the industry today:

  1. Opioid Abuse Identification: Let’s consider a scenario where two patients visit an emergency room with severe pain symptoms. One of the patients describes the pain, including the cause fo the injury, and asks for the doctor’s opinion on how to treat the pain. The second patient takes an aggressive stance, asserts that his pain gets aggravated even by air, and asks the doctor to prescribe a specific opioid (with dosage specification as well). The second patient shows signs of being a “drug seeker.” Identifying signs that indicate a patient may be abusing opioids is an important step in helping to channel needed treatment options and resources to help the patient overcome the addiction. When we analyze the claims data or encounter data for a patient like the second one described here, we can detect retrospectively that the patient may be abusing opioid prescriptions. But to capture this behavior relatively early we need to mine physician notes using advanced natural language processing (NLP) techniques
  2. Social Determinants of Health Identification: It is a well-researched fact in healthcare that only 20 percent of the health outcomes are determined by the clinical care and close to 40-60 percent of the outcomes are determined by non-clinical social and economic factors factors such as such as homelessness, food insecurity, lack of transportation, domestic violence, social violation, etc. Essentially, it means that two patients with the same diagnosis—despite receiving the same clinical care—can have two very different paths to recovery based on their social determinant disparities. Information on these social determinants of health is not available readily in any structured dataset. Though the Z codes in ICD-10 can be used to identify social determinant issues, they are optional—and not all social determinants have Z codes. Unstructured datasets like physician notes, care manager notes, social worker notes, and behavioral contact center notes are of value in this scenario as sources of information on social determinants of health (SDOH). Timely identification of patients impacted by SDOH can enable health systems help those patients by connecting them to appropriate community and/or philanthropic support groups. This will in turn help improve health outcomes and reduce overall care costs.

Examples where unstructured datasets within the healthcare industry can come to rescue include:

  • Medical journals can be read by machines to extract the most relevant information to be made available for providers.
  • Contact center agent notes can be analyzed to identify drivers of positive or negative patient sentiments and for identifying opportunities for reducing call handling time, call volume, interactive voice response drop-out, and Repeat Calls.
  • Physician notes can be mined for readmission prediction, disease onset prediction, medical documentation accuracy, and more.

The core of any text mining solution is the ontology/“Bag of Words” that tells the solution what to mine for. For disease onset prediction, this “bag of words” could consist of disease-specific symptoms, early indicators, and relevant medications. Coming up with a business problem-specific ontology typically takes the most time in any NLP exercise and requires strong collaboration between machine learning engineers and subject matter experts. At times, medical journal-based research is also done (using text mining as well) to come up with a comprehensive ontology. Once this bag of words is developed, a critical thing is to enable the engine handle context, temporality, and negation while mining for words/information from this ontology.

Since there are seemingly infinite ways in which text can be written and interpreted, text mining is easier said than done. However, sufficient research has now been done on the foundations of building a text mining engine, and academia is doing more on how to make these systems more and more intelligence every passing day.

Designing the text mining engine is just one part of putting it to use. A key consideration of using a text mining engine effectively is giving users the ability to modify the “bag of words” as needed—with the right level of governance applied. Often, the ontology defined when an engine is first implemented doesn’t capture all the correct phrases, or requires the addition of new phrases to remain relevant. Thus, flexibility is needed for the engine to remain usable over time.

Unstructured data in healthcare offers immense value when it comes to solving key industry problems. Exploiting the available unstructured datasets is a worthwhile consideration when using machine learning and artificial intelligence solutions to solve business problems.


Kapila Monga (Kapila.monga@gmail.com) and Harpreet Singh (dhaliwals.harpreet@gmail.com) are healthcare analytics professionals with Cognizant Technology Solutions. The opinions expressed in this article are the authors’ own and do not reflect the views of the authors’ employer or any other corporate entity.