Health Data

The Role of Observational Research and Common Data Model

Large-scale observational research has unlocked new ways to study health and healthcare delivery. By using real-world data, researchers and health systems can evaluate patterns not captured in traditional clinical trials. Yet, data quality remains a critical challenge for health information (HI) professionals.

The Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), developed and maintained by the Observational Health Data Sciences and Informatics (OHDSI) consortium, has become a widely accepted framework for representing large-scale observational data. By transforming electronic health records (EHRs), claims, and other clinical data into a standardized structure, the CDM enables researchers and health systems to run consistent analyses across institutions and countries. This shared model supports studies on everything from adverse drug events to population-level comorbidities, offering opportunities to generate insights that are difficult to capture through traditional clinical trials.

To model data in this format, the consortium and its members have developed a series of scripts that execute the extract, transform, and load (ETL) processes required to populate the CDM. In simpler terms, ETL is the process of moving data from one system into another in a consistent, standardized format. The consortium also maintains a vocabulary service that maps local source codes to standard vocabularies, with SNOMED CT® as the primary “standard” concept.

Expanding Relevance for HI Professionals

OMOP CDM is more than a research framework—it is becoming part of the operational fabric of healthcare. The same skills that HI professionals use daily in coding, compliance, auditing, and data governance are directly relevant to OMOP adoption. Through the AHIMA DQM model, these competencies can be mapped across the CDM lifecycle: terminology mapping supports the collection of standardized data, ETL quality checks strengthen warehousing, and secondary data validation ensures reliable analysis.

As organizations increasingly rely on artificial intelligence (AI), automation, and secondary use of EHR data, the stakes of getting data quality right are higher than ever. HI professionals are uniquely positioned to protect the integrity of clinical intent by ensuring that terminologies, mappings, and pipelines reflect the nuance of provider documentation. In doing so, they not only enable trustworthy research but also support reimbursement, compliance, and governance functions that impact everyday operations.

This convergence of traditional HI expertise with OMOP-driven analytics highlights an expanded professional role—one that bridges clinical accuracy, regulatory requirements, and data science methods to meet the demands of a rapidly evolving healthcare data ecosystem.

Here are some of the key benefits of using the CDM:

  • For HI professionals - more accurate risk adjustment and new career opportunities in informatics and data governance.
  • For hospitals and health systems - stronger population health analytics, easier participation in research consortia, and improved quality reporting.
  • For healthcare as a whole - standardized models enable cross-country comparisons, accelerate drug safety studies, and create faster pathways for generating real-world evidence.

Why Data Quality Matters

The value of the CDM depends on the accuracy of the data that feeds into it. If terminology mappings are incomplete or imprecise, critical clinical details can be lost in translation. For HI professionals, these gaps directly affect the reliability of downstream analytics and the trustworthiness of their research.

While OHDSI’s approach provides a strong foundation, relying solely on the CDM mapping tables can reduce granularity. Nuances such as laterality, specificity, or emerging diagnosis concepts may be lost when data passes through limited crosswalks. This is especially important since physicians select diagnoses using interface terminology within the EHR that might not map cleanly to a single SNOMED CT® code.

Enhancing these mappings creates an opportunity to improve the usability and reliability of insights generated from OMOP-based studies.

Challenges at Multiple Levels

Here are some of the challenges when using the CDM:

  • For HI professionals: potential loss of clinical detail in mappings, the need for upskilling in structured query language (SQL) and data science, and increased responsibility for ensuring governance and compliance.
  • For hospitals and health systems: complex ETL processes, resource and cost constraints, and variability in source data quality.
  • For healthcare as a whole: ongoing need to align OMOP with existing standards like Fast Healthcare Interoperability Resources (FHIRs), gaps in terminology coverage that affect comparability, and challenges in achieving consistent adoption across organizations and countries.

To address this, our study team incorporated Intelligent Medical Objects (IMO) terminology services directly into the ETL process. Rather than mapping ICD-10-CM codes first, this approach leveraged IMO’s Precision Normalize™ API to expand mappings at the source. We hypothesized that this method would enrich the relationships among interface terminology items, ICD-10-CM, and SNOMED codes before loading into the CDM.

The team tested the method by running two ETLs on a synthetic dataset based on over 163,000 unique diagnoses from Weill Cornell Medicine’s EHR. One ETL used the “legacy” ICD-10-first approach; the other used IMO-enhanced mapping. They then measured the additional patients captured through the enhanced pipeline.

What the Research Shows

The results were striking. The IMO-enhanced ETL identified 4,473 ICD-10-CM codes where additional patients were captured compared to the legacy approach. Some codes showed increases of thousands of patients. For example, Z40.00 – Encounter for prophylactic removal of unspecified organ jumped from fewer than 10 patients to over 6,336 with the enhanced process. In practice, this means thousands more patients were correctly identified, which can have a major impact on reporting and quality programs.

These findings suggest that leveraging advanced terminology services within the ETL process more faithfully represents clinical intent—critical for research, quality reporting, and population health analytics.

Table 1. Five ICD-10 codes most impacted by alternate ETL method

ICD-10 code

Diagnosis name

Count of patients with legacy ETL method

Count of patients with enhanced ETL method

Z40.00

Encounter for prophylactic removal of unspecified organ

3

6,336

H44.50

Unspecified degenerated conditions of globe

1

8,519

Z00.71

Encounter for examination for period of delayed growth in childhood with abnormal findings

0

783

H44.9

Unspecified disorder of globe

0

6,922

S76.001A

Unspecified injury of muscle, fascia and tendon of right hip, initial encounter

1

162

Lessons for HI Professionals

This study was conducted using a synthetic dataset derived from Weill Cornell Medicine’s EHR, which allowed us to evaluate its ETL approach without involving patient-level data. While the results are promising, they may not fully capture the complexity of real-world clinical documentation. Work is continuing with real patient EHR data and exploring how these techniques can be integrated into HI operations.

As AI and automation reshape healthcare, HI professionals can leverage their existing strengths in coding, compliance, and governance to remain essential connectors in new realms—including research-facing systems like OMOP. Their expertise in ensuring documentation integrity, validating mappings, and safeguarding compliance translates directly into roles such as clinical data quality specialists, ETL/Pipeline reviewers, and terminology analysts, all of whom help maintain data quality across the lifecycle.

To support these responsibilities, HI professionals can expand their skills through training and practical experience in areas such as:

  • Understanding the OMOP CDM—its purpose, structure, and role in research and operations
  • ETL processes—loading data into the CDM and validating extract-transform-load pipelines
  • Standards and vocabularies—SNOMED CT, LOINC, RxNorm, ICD-10-CM, and their mappings within OMOP
  • Data science foundations—SQL, Python, R for querying and validating large datasets
  • Cohort analytics tools—designing and validating patient cohorts using platforms like ATLAS
  • Interoperability frameworks—FHIR and USCDI for aligning OMOP with broader industry standards
  • Governance and privacy practices—embedding compliance and ethical use into secondary data workflows

OHDSI and partner organizations offer training resources that cover many of these topics, providing hands-on guidance for professionals who want to strengthen their organization’s data quality practices using OMOP.


This article was written by Evan Sholle, MS, David Haines, MBA, Chandan Ravishankar, BS, Tejaswini Viswanath, BS, Merlin Simoes, BS, Daniel Timke, MS, Sajjad Abedian, MS, Curtis Cole, MD, FACP, Frank Naeymi-Rad, PhD, Jennifer Mueller, MBA, RHIA, SHIMSS, FACHE, FAHIMA, FACHDM, and Trayanna Whiting, MSN, MS, RN, NI-BC, CHDA. Shoelle, Abedian, and Cole are with the Department of Population Health Sciences at Weill Cornell Medicine in New York. Haines, Ravishankar, Timke, and Naeymi-Rad are with Leap of Faith in Libertyville, IL. Viswanath and Simoes are with the Illinois Institute of Technology in Chicago. Mueller is AHIMA Senior Vice President, Health Information Career Advancement, and Whiting is Education Director-Data Analytics & Informatics at AHIMA.