By Kelly Canter, MHA, RHIT, CCS

In today’s environment, data is collected across everything we do. For every click, swipe, tap, or search we perform, at least one data point is generated. It is estimated that every human is generating 1.7 megabytes (MB) of data per second with our day-to-day technological usage, and healthcare data is no exception. Knowing this, we have to ask ourselves, how can the health information management profession tap into this data to help drive meaningful clinical and administrative organizational goals and outcomes? Carly Fiorina, the former CEO of Hewlett Packard, said it best: “The goal is to turn data into information and information into insight.” By understanding how to tap into and manipulate data to tell a story, we can transform information into insight and insight into action.

Learning the characteristics of data and the associated challenges that accompany it is the foundation to successful data-driven programs. Before using healthcare data outside of its traditional uses, it is best to have a comprehensive understanding of what data is available, how it can be accessed, and whether there are any assumptions that may need to be made. Once the basics of the data are understood, tools and methodology can be applied; hypotheses can be formed and tested; corrective actions can be performed; and continuous quality improvement can be monitored, supported, and reported on quantifiably.

Data Characteristics and Challenges

The most discussed data characteristics are volume, variety, velocity, and value. Each of these comes with its own unique set of challenges that need to be addressed at some point during an analysis. Volume is the amount of data, variety is the type, velocity represents the speed of which the data is being received, and value denotes whether the data is meaningful or not.

As technology evolves, the volume of data coming multiplies exponentially. It was estimated that in 2020, healthcare-generated data reached 25,000 petabytes, or the equivalent of 500 billion physical filing cabinets. By that perspective, the amount of data being received makes finding useful data a challenge and the first obstacle to overcome. Wading through the vast amounts of data with an understanding of what is being received, collected, and stored in a readily accessible format is inevitably the most important part of the data assessment journey.

When thinking of format of the data, the challenge with the data variety arises. In this context, there is structured and unstructured data. Structured data is data that is consistently reproduceable, following the same guidelines or rules for each data point, much like all back-end revenue cycle administrative data. Unstructured data is all others forms, such as physician notes or any free text in a health record. Most healthcare data being generated by providers tends to be unstructured, which makes data mining difficult due to the need for manual review. However, there are tools and technology available that can perform text mining to assist with the manual review. Text mining is the process of transforming unstructured text into meaningful and actionable information, or even in some cases, structured data.

The most common technologies currently used for text mining are natural language processing (NLP), artificial intelligence (AI), and machine learning (ML). All three of these technologies generate useable data by converting text to a structured format using different techniques but inherently the same process. The software pre-processes the text and converts it to structured data, and the end user selects what data is needed and mines the newly structured data. The data is then evaluated, conclusions are drawn, and a plan of action is designed and executed. This process continues for the life cycle of the project being monitored.

Another consideration that arises when accessing the data is the speed and timing of the data coming in, or the velocity. In some systems, the data is updated in real time, meaning it is data being collected at the exact time the data point is generated, such as heart rate on a smartwatch. In others, the data is delayed due to many different factors, including lack of interfaces and manual data entry requirements. For example, ICD-10-CM codes are not generated at the point of care, but some type of data entry must occur for the codes to attach to that encounter. Determining the velocity of the data being used is inherent to a consistent and current data-mining project. The rule of thumb is to always compare similar time frames using the most current available data.

The last but probably most important characteristic is value. Is there value in the data being collected? As discussed, there is an immense amount of data being accumulated at an intense rate of speed. To put it in perspective, Nate Silver, the founder and editor-in chief- of FiveThirtyEight, a website that reports on statistical analysis of daily life, highlighted that “every day, three times per second, we produce the equivalent of the amount of data that the Library of Congress has in its entire print collection … But most of it is (like) cat videos on YouTube.” The end user must be able to find the value in the data they are reviewing and not be distracted by the vast amount of useless data. A data-driven project should have a purpose or a value to the organization and not just be a review for the sake of reviewing. Value in healthcare data can usually be identified by one or more of three different levers: operational efficiencies; business process enhancements; and clinical and/or administrative process improvement. If the data story being told does not drive improvement in those areas, the initial purpose for the data mining should be revisited to determine the “why” of the project.

Basic Tools

In many healthcare systems today, there is an analytics department whose function is to help end users define and design a data project and ultimately also do the data mining to provide the end user with a concise data set that they can they use to make decisions or action plans. This model has its pros and cons for various reasons. The pros are that the analytics team are experts in addressing the challenges previously discussed. The cons can be that the team has a backlog of reporting to do so a new project can be pushed months or years out waiting for data. In addition, your data analysts are usually not subject matter experts and sometimes can design something not as useful as anticipated by not understanding what the end user needs. For small projects where data is readily available, or in spaces where a data analytics team is not available, there are a few common tools that can be very useful for a beginner data user to learn.

The first and most common is Microsoft Excel. Excel is a spreadsheet tool that includes calculation, graphing tools, pivot tables, and macro programming language. One of the most useful tools relied upon is the pivot table. A pivot table is a powerful tool that can simplify large data sets by summarizing the data of a more extensive table, allowing extraction of the significance of the data set. The format can include all of the data elements or just the portions the user is interested in. Table one is an example of a pivot table that summarizes 65,534 lines of claims data to determine how many radiology CPT codes are associated with each of the five levels of Emergency Department Evaluation & Management codes. It should be noted that additional Excel functionality in the form of the “Count if” function was applied to determine the number of CPT codes in range 70000-79999 prior to pushing the data into a Pivot Table.

Table 1

Excel is not the most user-friendly software for large data sets and is also not the best tool for data collection across users. Although Excel can function as a data collection tool, it is a manual process for multiple end users’ data to be combined. In these instances, Microsoft Access may be the more practical program to use. Access is a database management system that helps to store and manage data easily, giving the option for multiple databases or spreadsheets to be automatically combined, analyzed, and manipulated together. It can handle much larger data sets than Excel but interfaces seamlessly with it. It also manipulates data differently in that basic query wizards, visual basic, and structured query language (SQL) queries are all common functionality used within an Access database.

In addition to its data storage and manipulation capabilities, it is also useful for data collection of nonstructured text when NLP is not available. Forms can be created for end users to collect data and feed a database in real time. Not only is this a viable and cost-effective manual replacement for NLP, it can also serve to collect the data to train AI and ML programs on. For both AI and ML, the data the software trains on needs to be as accurate as possible so it can learn correctly. If the data training the software is error prone, the resulting algorithm from AI or ML will be just as error prone; therefore, accurate data collection is imperative in the design phase.

As mentioned, as a function, Access is capable of handling SQL. SQL is a specialized language for updating, deleting, and requesting information from databases such as Access, Oracle, and the Microsoft SQL server. It is widely used in all industries and academia for enormous, complex databases and also to interpret data as a standard relational database allowing users to manipulate that data, using tables, columns, rows, and fields. It is recommended in this day and age of data that a basic understanding of SQL be acquired if any type of data-analysis position is of interest. SQL can be as basic as its base commands of order by, group by, count, etc., and as complicated as multiple layers of subqueries and selections. Specialized training can be obtained to gain advanced SQL skills.

Clinical Data Telling a Story

Once it has been determined what data will be analyzed, for what purpose, and what tools will be used, it is time to allow the data to tell the story. In the example to be discussed, a multiprovider practice was noticing an upward trend in their diabetic hierarchical condition category (HCC) reporting for acute and chronic complications year over year for the previous three years. The practice manager reviewed the Healthcare Effectiveness Data and Information Set (HEDIS) reporting measure for Comprehensive Diabetes Care (CDC) and noted that the current score is 91.2 percent, indicating that the best practices for preventative and screening services were being performed across the practice uniformly. Although the bump in the risk adjustment factor (RAF) scores is helpful financially, it is not necessarily the outcome one would expect from best practice standards. With that in mind, she wanted to understand the root cause and, if possible, implement an action plan that would close any care gaps causing this trend.

When looking at the structured data in Table 2, the obvious trends can be spotted. The total number of diabetic patients seen by the provider group is increasing year over year and, with that, an increase in complications. However, the percentage of patients with a chronic complication rises 9 percent, indicating that the diabetic population’s disease management appears to be not as effective as it once was. The hypotheses or proposed explanations for the root cause were:

  1. Better and more accurate documentation and code capture
  2. New patients are being diagnosed with complications
  3. Established patient noncompliance
  4. Provider practice habits
  5. Patient access to or price of meds
Table 2
Diabetic HCC 2017 2018 2019
Acute 320 21% 370 23% 400 22%
Chronic 660 44% 800 50% 960 53%
None 520 35% 430 27% 450 25%
Total 1500   1600   1810  


To rule out hypothesis number 1, that the increase in RAF score is due to better and more holistic documentation and code capture, there are two data elements required: the claims from 2017 to 2019 with the targeted ICD-10-CM diabetes mellitus (DM) codes and the documentation for those claims. In this case, there is both structured and unstructured data to work with. Using the structured data, the patients who have diabetes coded at least once per year should be removed; and the patients who had it coded in a previous year but are missing in the subsequent years and those that have it diagnosed later in the patient relationship should be targeted. NLP can be used to expedite the documentation review of the unstructured data in this next step, but in the cases where it is not available, manual chart review is required. The records should be reviewed for missed DM codes and date of diagnosis to determine newly diagnosed patients. The findings in this case were that there were 188 new DM patients from 2017 to 2019, 35 of who were diagnosed with borderline and/or pre-diabetes that evolved into the full disease process. There were 52 missed opportunities in 2017 that qualified as no complications and four that qualified as chronic, but none after the targeted DM coding and documentation education was implemented in January of 2018. Hypothesis number one is not the case; better documentation and subsequent code capture is not the cause of the chronic conditions trending up.

Hypothesis number 2 is that the newly diagnosed patients are being initially diagnosed with complications. First and foremost, the claims data and corresponding documentation for the newly diagnosed DM patients should be acquired. Using this data, determine of those newly diagnosed in the years 2017, 2018, and 2019 what was the status of their DM upon presentation: acute, chronic, or stable? Specifically, also review those who were documented as borderline or pre-diabetes for insight into their disease progression. Immediately after the chart review, it was obvious that this was not the root cause of the RAF score lift. Of the 188 new DM patients identified, 49 presented with an acute condition and 75 presented with a chronic condition, adding 124 new patients to the numerator. Those patients only account for 9 percent of the increase in complicated DM HCC reporting, which is not the main driver of the shift. However, some additional items to explore here to improve early detection and prevention are warranted as a subproject. Exploration into what could be done to mitigate the complications in the new patients should be addressed. For example, is appropriate screening and/or education being conducted on at-risk patients, or are there socioeconomic or geographic issues underlying?

The other 91 percent is now what needs to be solved for, as that is the main problem at hand. The first step was to trend the year-over-year acuity of the established DM patient population. This resulted in the graph in Figure 1 showing that the main shift year over year is the chronic complications are increasing in this population.

Figure 1

The next step was to examine the documentation and collect the data that would help understand what the data was showing. The data elements collected during chart review were:

  • What are the most common chronic condition reported?
  • Who are the physicians?
  • What medications were prescribed?
  • What insurance carrier?

These elements would drive the results of the three remaining hypotheses: patient noncompliance; physician diagnostic behaviors; or insurance coverage related.

Once the data was collected in Access, multiple pivot tables were applied to determine the answers to the above questions. The story the data revealed is that hyperglycemia is the most common condition being reported as a chronic complication. There was no pattern to the physician treating the patients with hyperglycemia; however, there was a correlation with the medication the patients were prescribed. At this point, the data had revealed it was a clinical issue, and the clinicians and providers should be brought in to further determine the root cause. The conclusions to explore to formulate an action plan to improve the quality of care and patient outcomes for the clinic’s DM population are:

  • Is the drug being prescribed appropriately? (We missed an opportunity to pull the dose and route data in our initial pull.)
  • Is it being taken as prescribed? If not, why?
  • Is patient education needed?
  • How many in the uncomplicated diabetic population are also on this drug?

After those questions are followed up on an action plan to address the care gaps, it can be implemented and followed up on by tracking the DM RAF scores month over month to see if there is improvement among the DM population.

The Importance of Good Data Practices

As more and more healthcare data is generated, it is imperative that good data practice standards and behaviors are formed. There are technological advancements that have opened the gateway to being able to leverage unstructured data quickly and easily in addition to the traditional structured data. This allows for a more complete story to be told using the data and therefore more beneficial to an action-based organization wanting to use real data to drive quality solutions and outcomes.

Kelly Canter ( is a revenue cycle solution design manager at Optum.

Leave a comment

Send a Comment

Your email address will not be published. Required fields are marked *