Before healthcare researchers can analyze patient data, they face a critical obstacle: removing all protected health information (PHI) to protect patient privacy. This de-identification process isn't just a formality; it's a legal requirement for research under HIPAA.
For researchers working with millions of clinical notes, this becomes a massive operational challenge. Dr. Gilles Clermont, Professor of Critical Care Medicine and Vice Chair for Research Operations, brought this problem to the University of Pittsburgh Cloud Innovation Center (CIC), powered by AWS, and the result is a solution that cuts processing time from weeks to hours while giving researchers the flexibility to configure privacy rules for different studies.
The Challenge: Balancing Privacy, Speed, and Flexibility
When Dr. Clermont approached the CIC, his team spent roughly a week processing millions of clinical notes, relying on a static list of PHI or context-specific information. The existing system worked, but the team knew it could be improved.
"Processing millions of clinical notes for de-identification took us about a week, and that was just for one configuration. Every time a new research study came along with different privacy requirements, we had to start over from scratch. It wasn't just time-consuming; it was limiting what research we could realistically take on."
— Dr. Gilles Clermont, Professor of Critical Care Medicine, Mathematics, Clinical and Translational Science, and Industrial Engineering
The CIC student interns had three main problems to solve. First, the new solution needed to be scalable. Processing millions of records took at least a week, creating issues that could delay research timelines. Second, the new solution needed to be flexible. With different studies requiring different privacy levels (Safe Harbor vs. Limited Datasets), the solution needed to seamlessly switch between configurations without reprocessing the data. Third, the new solution needed to easily differentiate between clinical notes in various formats from different EHR systems.
The Solution: Configurable, Scalable PHI De-Identification
CIC student interns Ava Luu and Misran Mohammed developed the PHI De-Identification tool; a comprehensive de-identification system that addresses each of these challenges through intelligent design and cloud-native architecture.
The solution features a user-friendly frontend that allows reviewers to view original and redacted text side by side, edit the redaction when needed, and approve the note.

"When working with healthcare data, privacy protection is a top priority. Our main focus was ensuring the solution was reliable with exhaustive testing and evaluation.”
— Ava Luu, Student Developer, Pitt CIC
Behind the scenes, the system leverages a sophisticated PHI identification mechanism that detects the 18 categories of personally identifiable information defined by HIPAA, including names, addresses, dates, medical record numbers, and other sensitive identifiers.
The solution is built on AWS serverless architecture that directly addresses the original challenges. Parallel processing allows multiple records to be handled simultaneously, dramatically reducing processing time; the system seamless handles structured and unstructured metadata from different EHR systems; and human-in-the-loop validation presents detected PHI to researchers for confirmation.
“Anyone building AI for healthcare at scale must balance accuracy and speed – sacrificing one for the other isn’t an option when patient privacy is at stake. Clinical notes are large, messy, and full of edge cases across different EHR systems. We built on AWS’s HIPAA-eligible infrastructure and created a synthetic dataset from open-source FHIR bundles to validate both. Realistic notes let us measure detection accuracy against known PHI while stress-testing the system at its limits.”
Misran Mohammed, Student Developer, Pitt CIC
"This tool will fundamentally change our data preparation workflow. We'll be able to process large datasets once with high confidence, rather than rerunning them for every request. This will significantly increase our research capacity by eliminating the long wait researchers currently face for each request,” says Dr. Clermont.
Broader Applications: Privacy-Preserving Research Across Disciplines
While developed for clinical research, this solution has applications across any field that handles sensitive personal information. Legal research, social services, educational studies, and government programs all face similar challenges in balancing data utility with privacy protection.
The configurable nature of the PHI De-Identification means it can be adapted to different privacy frameworks beyond HIPAA, and the architecture scales to handle datasets of virtually any size.
Supporting Artifacts
Interested in implementing this solution for your research? Access the code, technical documentation, and demo on GitHub.
Have your own project idea? The Pitt Cloud Innovation Center accepts project proposals from University of Pittsburgh staff and faculty in health sciences and athletics. Submit your idea today to see how cloud innovation can accelerate your work.
The University of Pittsburgh Cloud Innovation Center, powered by AWS, builds impactful, scalable solutions using cloud computing, artificial intelligence, and machine learning. With a focus on health sciences and athletics, Pitt CIC delivers open-source proof-of-concept solutions that address real-world challenges in the fields of health sciences and sports analytics.
Learn more: digital.pitt.edu/cic