By James Hoover

Two pressing problems have emerged from using electronic health records (EHRs) over the past decade: the accumulation of small master patient index (MPI) data errors and the resulting inability of EHRs to locate patient records.

At best, data cleanups are temporary. Errors continue to occur one at a time—dozens per month, and hundreds or more per year. Each error seems small, almost minor: a simple typing error, transposed characters, a misheard word. Accumulated errors cause EHRs to fail to find medical records or patients to be confused with other patients; they result in duplicate or overlaid records.

Well-informed health information management (HIM) professionals may know that the reported industry average for duplicate records is eight percent. There is a one out of 12.5 chance of selecting a record that has a duplicate record in the average EHR. Furthermore, eight percent is only the average. Many healthcare providers have not kept up with duplicates over time, and as duplicate rates increase, so do the odds of selecting one. An increase to 10 percent or 15 percent lowers the odds of picking a duplicate record to one out of 10 or one out of 7, respectively.

With COVID-19, the costs of selecting a duplicate record have never been higher. Consider the consequences if the wrong version of a duplicate record is selected and it’s the version that is missing a COVID-19 test positive result. Searching patient records should not be like gambling. The same applies to any infectious disease, such as influenza or Zika.

Addressing the Problem

Finding duplicates and user training are necessary, but these efforts miss the fundamental problem. Perfectly formed MPI queries will still fail when errors are in the demographic fields for the desired records. It is the MPI version of “death by a thousand cuts.”

Users must enter just the right amount of text—but not too much, or the record you are seeking will not be found because it does not match exactly. Enter too little, and too many results are returned. Wildcard matching (using a partial string “SMI” to match a longer string “SMITH” for example), a little finesse, and data in just the right fields might yield the patient records being sought if the user has the time, skill, and a bit of luck. EHRs assume (mostly) perfect data, and the responsibility to find the patient records is on the user.

Given the industry’s duplicate record rates, it’s clear that the current EHR search approach is not working well. We cannot expect frontline healthcare staff to become IT specialists to learn how to trick EHRs into returning the correct patient record—or records, in the case of duplicates.

Finding and fixing minor errors to prevent duplicate records requires an active quality assurance process to isolate and correct common spelling errors, missing or malformed data, incorrect defaults, and other issues. Periodic audits help identify new types of data defects and should be incorporated into data governance programs.

With modern science, it would seem that search should overcome typing errors to find the patients we intended to find, even with typos. Spell checkers know what we meant, why not EHRs? Thus, to take into account typing errors, search results should return records that look like something we might have been seeking. A perfect solution should look effortless and not require any action on the user’s part—it should just work.

Key Ideas—Normalization and Deep Cleaning

Few cleanup efforts target the root cause, which are errors within data elements. Data normalization and deep cleaning both target imperfections within the EHR data to improve medical record matching and patient search.

Data entered into and stored within an EHR are inherently flawed and failures to accurately locate patient records is our proof. Search and matching are substantially improved when using normalized data. Normalization ensures text entered as “Smith-Garcia” also matches “1SMITH&GARCIA.” Typical normalization processes include capitalizing all letters, removing extra spaces, removing punctuation, and standardizing data formatting.

Deep cleaning is related to data normalization and helps to ensure data inside the fields are as intended, not as typed. Potential typing errors may be flagged by standard spell-checking algorithms. Using frequency analysis techniques, potential misspellings of common names can be flagged, invalid pseudo default values can be corrected, and missing information can be merged in from other data sources. It should be noted that normalized and deep cleaned data values should be stored in auxiliary fields, not in the fields as typed by patient access so as to not change the values of the designated record set without human involvement.

Phonetic translations for patient names should use approaches that take the potential language of origin into account, as letters may be pronounced differently from language to language. Many patients register under nicknames, and thus normalization includes guessing given names. For example, many first names in English and Spanish are derived from longer names such as Ro from Rose or Beto from Roberto. Systems can take advantage of known nicknames and look for both the “first name” and the “given name guess” when searching for patients.

Similarly, addresses can be normalized using USPS standards, and street names can be spell checked against a list of names for all areas serviced by the provider. Corrected addresses are also valuable for proximity analysis when implementing contact trackers from patients’ homes, work, and family addresses.

Precautionary duplicates are those records created from an abundance of caution when a patient is unable to validate the information in older records. Corrected address data improves the chance of patient confirmation and reducing precautionary duplicates.

Data validation should check the validity of phone or Social Security number data and flag invalid defaults. Search results will dramatically improve when the same granular data cleanups and normalizations are performed on search criteria entered by healthcare workers. The cleaner the data, the better the search results.

To Err is Human

Search systems must assume poor MPI data quality. Simple text search is terrific if working with curated data, such as dictionaries. Medical records, however, are not clean. Patient data is entered under trying circumstances and in busy environments. Exacerbating matters, EHRs often have many points of registration: specialty clinics, labs, medical practices, jails, and so forth. Each registration point may have divergent policies, practices, politics, and training, yet all still contribute to the same MPI impacting data quality and patient safety.

Even under the best conditions, MPI data contain errors mostly due to the human factor. The industry must accept data errors will always exist, and MPI search functions must assume errors in their processing.

Webpages, like medical records, are full of spelling and grammar errors, yet relevant content is returned within microseconds by internet search. Healthcare search—which requires the same robustness and relevance that internet searches are capable of—should borrow some techniques from search engines.

Internet Search

In 1995, Digital Equipment Corporation (DEC) introduced the first modern search engine, Alta Vista, three years before Google was formed. Internet search works amazingly well for the large amount of unstructured data on the web. With hundreds of billions of webpages, the scale of the web is five to six orders of magnitude above EHR search, and the search domain of EHR is limited strictly to patient demographics. Yet EHR search has much more difficulty finding patients with much less data, even when the subject of the search is present. The dichotomy of web versus EHR search performance is striking. Health information technology (IT) must improve.

Internet search also understands context. Google and other internet search services infer structure on web data by building “knowledge graphs” that impute contexts such as education, music, travel, or sports, to the information. Synonyms and topic freshness are considered in the relevance of search results. Analogously, giving context to MPI searches may be valuable for name matching.

New approaches should infer semantic connections between data fields when searching patient records. In the circumstance where a nickname was entered in the patient’s medical record first name field, using a guess for given name and the value for the first name together, better record matches will be made. Guarantor and next of kin (NOK) field comparisons can be combined with frequency distributions to take advantage of matches on these fields for uncommon names.

Despite its complexity, internet search is also easy to use. When was the last time you took a class on how to use Google, Bing, or Yahoo? Without training, new users harness the power of internet browsers every day, searching for song lyrics, news, or products, and never consider data errors. MPI search needs to be similar.

Fault Tolerance

Some systems use special indexing techniques to improve fault tolerance in search results. All data is normalized then decomposed into text fragments to capture the “essence” of the text. This technique introduces a deep fault tolerance when compared to standard equivalence matching as in older EHR search methods.

Highly tunable, each field has distinct fault tolerance sensitivity. In practice, date of birth (DOB) should be less sensitive to variations than the last name field. Properly tuned, poorly spelled words can match given enough surrounding data.

Matching “Robert Montague” with “Rotert Mantiglue” is potentially desirable and providing this level of matching flexibility will greatly assist healthcare workers in locating difficult-to-find patient records due to underlying data errors. Fault-tolerant MPI search enables more relevant records to be found, therefore reducing duplicates and overlays.

As part of the normalization process, MPI data can index text fragments for all relevant search fields. Queries will be normalized and fragmented similarly. The time for the healthcare industry to adopt internet-grade search is long overdue.

Modern Search in Healthcare

Provider staff is diverse, coming from different educational, cultural, and language backgrounds. Search engines must do the hard work of finding patient data no matter how bad the underlying data quality is or how poorly the search criteria are entered.

Modern search involves mathematical calculations and is hidden from users. To users, search is as simple to use as the internet. The search user interface can be a simple one-field data entry area such as an internet search interface or multi-field search forms as in current EHRs. With fault-tolerant search, multi-field forms are not necessary but can be used to mimic current EHR search methods.

From the perspective of search, medical records become data patterns made up of text fragments with semantics applied to each search region as desired—for example, linking maiden names to last names.

Beyond Search

In healthcare, search technology is usually not considered beyond patient lookup. Current healthcare search is unable to locate patients except under ideal circumstances. By adopting technologies from other industries such as fault-tolerant matching, healthcare search suddenly can tackle other more challenging health IT problems

Capturing the essence of any patient data opens up solutions to unique population health solutions. This search technique can be leveraged to blind clustering patients after grouping text fragments from social determinants, vitals, demographics, allergies, labs, medicines, attending physicians, or any other factor. Previously hidden patient correlations, partially hidden previously by data anomalies, will provide insights into treating the patient population as a whole.

Fault-tolerant address cleanup and matching can give providers early warnings of potential carriers of infectious diseases even if the patient is asymptomatic. Relatedly, normalized data coupled with semantic inferences applied to name fields are beneficial for opioid usage and benefits eligibility checking.

Beyond these examples, fault-tolerant matching is the foundation for “query by example” solutions. Document clustering is the original use case for text fragmenting technologies and is directly applicable to grouping clinical documents with template language or copy/pasted text and is very useful for clinical documentation improvement.

Final Words

EHRs are large, complex systems and cannot excel in every functional category. No amount of user training or data cleaning will ever create an error-free MPI, and older technologies that assume clean data have outlived their usefulness for healthcare searches. Search/matching techniques need to assume poor data quality to reduce duplicates and overlay records.

Modern pattern matching techniques are better suited to the characteristics of MPI data. Pattern matching on text fragments rather than entire words works well for several classes of healthcare IT problems. There is an opportunity to improve health data quality at the same time that patient safety, satisfaction, and revenue cycle are improved.

Lastly, the Office of the National Coordinator for Health IT’s EHR interoperability regulations were published in May, and the anti-blocking provisions are substantial for providers, vendors, and the newly qualified healthcare information networks (QHINs). Failing to find a patient record, returning a partial patient record, or worse, returning another patient’s record, all carry potentially hefty fines for “interfering” with a patients’ access to electronic data access. There has never been a better time to update patient search technologies.


James Hoover ( executive vice president, strategy, of Medarcus and co-chair of the health data integrity committee, California Health Information Management Association.

Leave a comment

Send a Comment

Your email address will not be published. Required fields are marked *