For decades, healthcare researchers have faced a frustrating obstacle before they can even begin their work: deciphering what their data actually means. Medical records use proprietary codes that vary from hospital to hospital, requiring researchers to spend hundreds of hours manually mapping these codes to standardized formats before analysis can begin.
It's a problem that has persisted despite efforts from major technology companies and federal mandates - until now. Gary Farrell, a Backend Developer at University of Pittsburgh Cloud Innovation Center (CIC), powered by AWS, has developed an AI-powered solution that could unlock millions of dollars in research capacity across the country.

The Challenge: Standardized vs. Proprietary Codes
When Dr. Christopher Horvat, associate professor of critical care medicine, of pediatrics, and of biomedical informatics, approaches a new research project, he faces the complex task of data mapping. Electronic health records (EHRs) store critical patient information like lab results, vital signs, and medications using codes that are unique to each healthcare system.
“Take something as simple as a sodium level. If you search the EHR, you’ll find 329 different variables with ‘sodium’ in the name. One might look right, but unless you actually query and validate it, you could be looking at a urine sodium, a sodium supplement, or finally the blood sodium you meant to find,” explains Horvat.
This data mapping challenge affects researchers across the country. Federal legislation has mandated that healthcare systems adopt standardized coding vocabularies including LOINC (Logical Observation Identifiers Names and Codes) for lab results and clinical observations, SNOMED (Systematized Nomenclature of Medicine) for clinical terminology, and RxNorm for medications, but implementation has been slow. Meanwhile, researchers continue to spend significant portions of their project timelines on data preparation rather than actual research.
"We've spent hundreds of hours over the years making sure we're actually getting the data we think we need," says Horvat. “This complexity is the single biggest barrier to meaningful use of EHR data, and a major reason why hospitals still struggle to share data reliably.”
The Solution: AI Powered Mapping
Dr. Horvat brought this challenge to the Pitt CIC with a vision: could artificial intelligence and cloud computing automate the initial mapping process?
Working with synthetic data derived from thousands of real patient encounters, Farrell developed a solution that uses embedding-based semantic similarity to narrow the candidate pool while maintaining over 98% accuracy that the correct code remains in the subset. A large language model (LLM) then evaluates these candidates using the average values or categories associated with the proprietary code, along with the relative frequency of each standard code in real-world usage when available, to create a standardized mapping.
"By using a two-step pipeline, we were able to drastically increase our accuracy while cutting our costs, evaluating 30 candidates instead of thousands," Farrell says.
The solution doesn't just map codes, it outputs results following FHIR (Fast Healthcare Interoperability Resources) standards, the format required for seamless data exchange across healthcare systems. It also provides reasoning and a "frequency rank" for each mapping, allowing researchers to quickly validate the suggestions rather than starting from scratch.

True to the CIC's mission of solving real-world challenges for the public good, the EHR Code Mapper is available as an open-source solution on GitHub. This means any health care researcher can access, use, and build upon this work.
"This project broadened my horizons to see that every field has needs for computer and data science. Healthcare wasn't on my radar before, but now I see that any industry dealing with data has problems I can help solve,” says Farrell.
Broader Applications
While this solution was designed for healthcare research, the underlying approach, using AI to map proprietary or inconsistent data formats to standardized vocabularies, has applications far beyond medicine. Any field dealing with data standardization challenges, from manufacturing quality control systems to financial transaction codes, could benefit from similar intelligent mapping techniques. The combination of embedding and large language models could help organizations across industries unlock the value of their siloed data systems.
Supporting Artifacts
Interested in seeing how the EHR Code Mapper could benefit your research? Access the code, a demo, and technical documentation on GitHub.
Have your own project idea? The Pitt Cloud Innovation Center accepts project proposals from University of Pittsburgh staff and faculty in health sciences and athletics. Submit your idea today to see how cloud innovation can accelerate your work.
The University of Pittsburgh Cloud Innovation Center, powered by AWS builds impactful, scalable solutions using cloud computing, artificial intelligence, and machine learning. With a focus on health sciences and athletics, the Pitt CIC delivers open-source proof-of-concept solutions that address real-world challenges in the fields of health science and sports analytics.
Learn more at digital.pitt.edu/cic and follow along with us on LinkedIn.