A natural language processing pipeline for identifying pediatric long COVID symptoms and functional impacts in freeform clinical notes: a RECOVER study

Sep 8, 2025JAMIA open

Using language analysis to find long COVID symptoms and effects in children's medical notes

AI simplified

Abstract

Moderate accuracy was found in comparing assertions from subject matter experts and a natural language processing pipeline analyzing health records.

  • The analysis included 48,287 outpatient notes from 10,618 pediatric patients.
  • Notes were evaluated between 28 to 179 days after a COVID-19 diagnosis.
  • The pipeline identified 21 symptoms and 4 functional impact categories related to .
  • Long COVID concept categories appeared more frequently in patients with Long COVID compared to those with acute COVID.
  • Differences were observed between symptoms identified in clinical notes versus structured data.

AI simplified

Key numbers

7771
Patients with identified
Out of 476,231 pediatric COVID patients in the RECOVER database.
48,287
Notes analyzed
From 10,618 pediatric patients across 12 institutions.
0.90
F1 score for high-confidence assertions
Based on 2,043 high-confidence assertions evaluated.

Key figures

Figure 1.
Steps in a natural language processing pipeline for identifying symptoms in
Frames the structured process used to extract Long COVID symptom features from unstructured clinical text
ooaf089f1
  • Single schematic panel
    Pipeline starts with note selection and text cleaning, followed by a Spark NLP pipeline with sentence detection, tokenization, , three (NER) models, and an ; then applies regular expression term filters and assertion filters
Figure 2.
vs COVID: prevalence and odds ratios of symptoms and functional impacts.
Highlights which symptoms like myalgia and physical impairments are notably more common in Long COVID versus COVID patients.
ooaf089f2
  • Panel single
    Adjusted odds ratios () and 95% confidence intervals () for 25 Long COVID features comparing Long COVID to COVID patients; features with aOR above 1 and CIs not crossing 1 are significantly more prevalent in Long COVID.
Figure 3.
Prevalence of symptoms in versus
Highlights that many Long COVID symptoms appear more frequently in clinical notes than structured data, emphasizing notes’ value
ooaf089f3
  • Panel single
    Adjusted odds ratios () with 95% confidence intervals () for Long COVID features comparing prevalence in notes versus structured data; features with aOR > 1 and CIs not crossing 1 are more common in notes, including Irritability (aOR 32.03), Appetite Loss (15.1), and Fever (11.71); features with aOR < 1 and CIs not crossing 1 are more common in structured data, including Physical Impairments (0.1) and Skin Symptoms (0.09)
Figure 4.
Feature distribution between structured and across patient symptoms and impacts
Highlights varying agreement levels between structured and unstructured data, emphasizing the value of combining both for symptom identification
ooaf089f4
  • Single heatmap
    Heatmap shows agreement levels between structured (rows) and unstructured (columns) data for 25 symptom and functional impact features; diagonal cells have highest agreement (darker red) indicating matching features across data types
Figure 5.
Prevalence of patient symptoms by data source: unstructured only, structured only, or both.
Highlights how captures many symptoms missed by structured records, emphasizing data source complementarity.
ooaf089f5
  • Panel single
    Bars show patient counts for each symptom concept from unstructured data only (orange), only (blue), or both sources (yellow). Pain has the highest counts, with many patients identified only in unstructured data.
1 / 5

Full Text

What this is

  • This research developed a natural language processing (NLP) pipeline to analyze pediatric clinical notes for identifying symptoms and functional impacts.
  • The study analyzed 48,287 outpatient progress notes from 10,618 pediatric patients across 12 institutions.
  • It compared symptom prevalence between patients with and those with acute COVID but no , revealing significant insights from unstructured data.

Essence

  • The NLP pipeline identified 25 clinical concepts related to , showing higher prevalence in patients diagnosed with compared to those without. This underscores the value of analyzing clinical notes to capture nuanced patient experiences.

Key takeaways

  • The NLP pipeline demonstrated moderate accuracy in identifying symptoms, with an F1 score of 0.80 overall and 0.90 when considering only high-confidence assertions. This indicates that the pipeline can reliably extract relevant clinical information from unstructured notes.
  • Patients with exhibited markedly more symptoms and functional impairments in clinical notes compared to those with acute COVID. This finding emphasizes the need for comprehensive data analysis to understand the full impact of in pediatric populations.
  • The study revealed that many patients identified through the NLP pipeline were not captured by structured EHR data alone, suggesting that traditional coding may overlook critical symptoms and impairments associated with .

Caveats

  • The study focused on feature prevalence rather than incidence, which may not adequately capture the evolving nature of symptoms over time. Future research should address this limitation.
  • The NLP pipeline's accuracy was evaluated on a limited sample, which may not fully represent the broader pediatric population. A more extensive validation is necessary for generalizability.
  • Potential biases in the NLP model could arise from the specific language used in clinical notes, which may not be consistent across different institutions or patient demographics.

Definitions

  • Long COVID: An infection-associated chronic condition post-SARS-CoV-2 infection, lasting at least 3 months with varied symptoms affecting multiple organ systems.

AI simplified

what lands in your inbox each week:

  • 📚7 fresh studies
  • 📝plain-language summaries
  • direct links to original studies
  • 🏅top journal indicators
  • 📅weekly delivery
  • 🧘‍♂️always free