Blog

Extraction of Structured Data From Electronic Health Records Using Natural Language Processing

Extraction of Structured Data From EHR Using NLP - 47Billion

Electronic Health Record (EHR) is a health record documenting each patient visit called to encounter, followed by supplemental documentation such as laboratory results, radiology results, patient handouts, etc. Each visit record contains the patient’s demographics, medical history, family history, vital signs, diagnoses, medications, treatment plans, immunizations, allergies, radiology images, laboratory and test results, and administrative and billing data.

Providers send medical record documents periodically to the insurance company in PDF, TIFF, or TXT format as proof of services provided. Each visit information is appended and sent together as a single file. EHR provides a complete patient’s medical history across multiple providers. An EHR’s collaborative nature is its main advantage. It is made to be shared among healthcare professionals, enabling patients to move their records from one clinic to another with them (including labs, emergency rooms, and pharmacies). Each EHR typically contains hundreds of pages. A Medicare patient with one chronic disease sees an average of nine to 14 different providers in a given year. Providers capture this data in their EHR and then share it with the payer.

Download case study here – HEDIS Audit Management ICD Engine

EHR contains valuable information about the patient. The information consists of patient demographics, visit date/time, family history, Diagnoses, Diseases detected and corresponding ICD-10 codes, Drugs prescribed, Procedure codes, Provider information, etc. However, as noted by the physician, this data is usually in an unstructured, free-text format. EHR systems from different vendors have different formats, making it difficult for healthcare providers to access and share patient data. Interoperability standards like FHIR (Fast Healthcare Interoperability Resources) are becoming popular. However, supporting them requires converting historical patient data in a free-text format in traditional EHR systems to FHIR format for sharing. Also, due to the unstructured nature of EHR data and non-standard formats, it’s not possible to directly run predictive analytics on the patient data or perform aggregation for patient population analytics. To support such analytics, the text needs to be converted to structured data by interpreting the conversational-style language used by the Provider in the EHR.

A typical Progress Note within an EHR looks like this –

Recent Natural Language Processing (NLP) advances, augmented with deep learning and novel Transformer-based architectures, offer new avenues for extracting structured data from unstructured clinical records. This structured data can then be used for descriptive and predictive analytics. The various natural language processing techniques used to extract structured EHR data are listed below.

Pre-processing File

The first step in extracting structured data from EHR is fully extracting the text while preserving its context.

The digital or scanned PDF or the TIFF file is converted into page images.
Page images are sharpened and straightened using de-skewing.
Text and associated meta-data are extracted from the image using Optical Character Recognition (OCR), image processing, and core NLP. The OCR engine meta-data includes blocks, paragraphs, lines, and word bounding boxes. The image processing metadata includes header, footer, layouts (single or split paras), tables, figures, and signatures; the NLP metadata contains sentence boundaries.
The OCR text from different blocks on a page is aligned based on coordinates. An intermediate file is created with all the metadata embedded and the OCR output.

Document Layout Detection

The documents can have different layouts based on the EHR systems they came from. Page images are sent through the layout detector (based on image processing), and header, footer, title, fonts, figures, tables, signatures, and stamp areas are detected. This data is embedded as metadata in the intermediate file.

Document boundaries and type Detection

Each EHR file consists of multiple documents like Progress Notes, Prescriptions, Lab reports, Physician letters, etc. The progress note contains patient visit information and is the most crucial document. Document boundaries are detected based on page numbers in headers and footers and also the start of the following document based on its title. All continuously incrementing page numbers are considered part of a single document. Document type is detected based on the repository of titles matching the doctype.

Table Detection and Cell Extraction

Tables are information-rich structured objects. However, OCR engines tend to jumble table data, and the structure needs to be recovered. To add back the structure, table and cell boundaries are detected, and the word boundaries from OCR are matched to reproduce the table structure in the intermediate file.

Progress Note Detection

Progress Note is the most essential document which contains all the information on the patient’s visit. The Progress Note documents from different EHR files look different and have different titles. A repository of these titles for Progress Notes as well as other documents is used to match the document titles and detect Progress Note documents and their boundaries. The section names on the page indicate that the document is of type Progress Note. A TF/IDF approach to detect commonly occurring words on a page is also used to detect whether the page belongs to a particular document type.

Section Detection

EHR contains Progress Notes, including sections like Chief Complaint, Discharge Summary, Present Illness, Personal Histories, Family Histories, Physical, Laboratory Exams, Radiological report, Impressions, and Recommendations. These sections have headings, but each EHR system follows its naming convention and hierarchy for section headings. For example, the section chief problem may be indicated by headings “chief complaint,” “presenting complaint(s),” “presenting problem(s),” “reason for encounter,” or even the abbreviation “CC” in different EHRs. A repository of mapping of section headings to their respective normalized section names is kept, and a common section name is derived using this mapping. The linear chain Conditional Random Field (CRF) model is used to recognize the boundaries of sections sequentially.

Patient Demographics Detection

Patient demographics like age, gender, location, ethnicity, and race are essential factors in individual and population health analysis. These attributes are extracted from Progress Note using Named Entity Recognition.

Provider Information Detection

Provider name, designation, and degree are essential attributes extracted from EHR. These are extracted similarly to demographics using custom Named Entity Recognition.

Concept Recognition and Detection

Natural Language Processing breaks down the text into smaller components, identifies the links between the pieces, and explores how they are combined to create meaning. Named Entity Recognition (NER) is one of the vital entity detection methods in NLP, which automatically detects the concept from a free text and classifies tokens (words) into pre-defined categories.

Different entities from biomedical documents are extracted using a query database, named entity recognition, and linked to a concept in a biomedical database such as UMLS (Unified Medical Language System). UMLS is a set of files and software that combines many health and biomedical vocabularies and standards to enable interoperability between computer systems. UMLS contains 1.2M different concepts; on average, two different names are assigned to each concept. For example, the concept with id C0006826 has 16 different assigned names, including cancer, tumor, malignant neoplasm, malignancy, and disease. On average, 90% of these names link to more than one concept in UMLS. Consequently, it is impossible to link a detected entity to a biomedical concept based only on the name. A concept database (CDB) and vocabulary (VCB) files are essentially required for linking the extracted biomedical entity to a database.

Concept Database (CDB) is built from a biomedical dictionary (e.g., UMLS & SNOMED-CT) and stores all the linking-related information like name, sub name, Concept Unique id (CUI), Type id, etc. All the concepts we want to identify and link to are stored there.

Vocabulary (VCB) is a list of all terms that might appear in the texts we wish to annotate and is contained in a VCB. In our instance, it is primarily utilized for spell-checking.

A Knowledge graph can be built using Named Entities (nodes) and Relation Classification (edges). Such a knowledge graph can be used for various purposes, like predictive analytics.

Here is the clean text OCR portion of an EHR –

After NER has detected all the medical terms –

After linking the entities to a biomedical database ID, each entity is assigned and linked to a biomedical database (UMLS) concept. The concept IDs are, in turn, associated with disease codes like ICD-10 codes.

Drug-Disease mapping

The extracted disease entity is linked to its targeted drug entity if the drug is prescribed. Drug-disease associations, drug features, and disease semantic information are used to detect this. The drug detection model detects multiple drug-related concepts such as dosage, drug names, duration, form, frequency, route of administration, and strength.

Negation and history detection

The challenge with clinical NLP is that the meaning of a clinical entity is heavily influenced by modifiers such as negation. Therefore, negation detection is essential to identify conditions associated with valuable clinical decision support and knowledge extraction. A mention of a disease in a biomedical document does not necessarily imply that a patient suffers from that disease. Since documents explain a diagnostic process, a document may detail a test being performed to determine whether or not a patient has a condition, or it may relate to a discussion of the arguments for whether or not a patient has a problem or state that a family member has it. Therefore, in EHRs, the detection of negated entities is essential.

Here is the output of negation detection –

Signature and Hand-written text Detection

Payers validate the Progress Note if the Provider signs it. The Provider can physically or electronically sign the document, indicated by the text “Electronically signed by,” at the end of each progress note.

Conclusion

Electronic Health Records have many free-form, conversational-formatted, unstructured information about the patient. Once this information is converted to structured data, it can be used in various use cases.

Patient and Provider Data Analytics

The data extracted from EHR combined with claims, enrolment, and member data can be used to draw insights into the patient journey and historical trends using description analytics. This data can also be used to perform predictive analytics to understand patients’ proclivity towards potential diseases and rehospitalization. The data from many patients can be used to do Population analytics and explore Social Determinants of Health.

HEDIS Hybrid Measurements

HEDIS quality measurements are used to track value-based care by providers, and ratings are assigned to providers based on compliance. A few of the hybrid measures require attributes that are only captured in EHR documents. Combining these attributes with data from claims allow Payers to track hybrid HEDIS measures.

Payment Integrity

EHR contains details of services provided during a visit. These details can be compared against the claims data to ensure that the payment claims by providers match the services provided during the patient visit.

Risk Adjustment Coding

For high-risk diseases, HCC (Hierarchical Condition Category Coding) codes are assigned to some ICD-10 codes. These codes allow for estimating future healthcare costs for patients. With some government programs, Payers get compensated additionally to treat high-risk patients. Risk Adjustment Coding using EHR data proves high-risk patients and treatments and allows payers to get compensated accordingly.

Also Read –Talend is helping in redesigning the data analytics pipeline in enterprises in various industries

You might also like:

The Hidden Economics of Care Management: Why Identifying High-Risk Patients is no Longer Enough?

June 25, 2026

AI-Powered Loan Decision Systems: Why Lending Has Become a Race Against Time?

June 10, 2026

Why Banks Need Operational Intelligence, Not More Dashboards?

June 3, 2026