The Biomedical Advanced Research and Development Authority (BARDA), in collaboration with other Department of Health and Human Services (HHS) partners, on April 6 announced $200,000 in total awarded for two winners of the Pediatric COVID-19 Data Challenge.
The Pediatric COVID-19 Data Challenge is sponsored by BARDA, in partnership with the National Institute of Health’s National Center for Advancing Translational Sciences (NCATS), the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD), and the Health Resources and Services Administration’s (HRSA) Maternal and Child Health Bureau. Administration and quantitative analysis of the challenge were managed by Sage Bionetworks.
The Challenge was designed to help equip healthcare providers with the information and tools they need to identify pediatric patients at risk, implement earlier interventions, and improve patient outcomes. The goal was to unlock critical insights hidden in the healthcare data ecosystem.
The healthcare ecosystem continuously generates detailed data in electronic health records (EHRs) through the various patient touchpoints with medical providers. This real-world data represents incalculable potential to help providers triage care in overcrowded emergency departments (EDs) and clinics and provide real-time trends across the nation.
“Data captured within EHR systems offer robust and comprehensive views of a patient’s medical journey,” noted Sandeep Patel, PhD, Director of BARDA’s Division of Research Innovation and Ventures (DRIVe). “Leveraging this commonly available healthcare data to better identify risk of disease severity, especially in the context of a pandemic like COVID-19, is one of the many ways organizations can drive innovation in care delivery.”
Patel also noted that the challenge’s focus on near real-time analysis of data could not have been timelier for its target patient population.
“We launched the Pediatric COVID-19 Data Challenge during a critical time in the pandemic, when pediatric patients largely didn’t have access to vaccinations and were highly vulnerable as a result,” Patel said. “This challenge provided an opportunity to develop and test algorithms from commonly available data that could empower clinicians with better insights to predict severe outcomes and hospitalizations more accurately so they can make critical decisions to reduce hospital burden and improve pediatric patient outcomes. We are excited about the potential for the results of this challenge to create new capabilities that can be available in the future.”
Forging Multidisciplinary Partnerships to Fuel the Creation of Innovative Computational Models
The challenge asked participants to develop, train, and validate computational models to predict and identify pediatric patients at risk for hospitalization, ventilation, and cardiovascular interventions, utilizing the de-identified electronic health record data available through NCATS National COVID Cohort Collaborative (N3C) Data Enclave, which is the largest COVID dataset in the U.S. This de-identified NCATS data set combined diverse data types, such as demographic, diagnoses, medication, procedure, laboratory results, vital signs and county-level social determinants of health data.
Participants were asked to address two tasks using the data provided in the N3C data enclave to develop their computational models. More than 200 participants joined, 88 teams were formed, and 55 models were submitted for both tasks. Participants consisted of both small and large teams, including academic institutions, large and small businesses, as well as citizen scientists. Submitted models were scored for model performance and generalizability, feature interpretation, method clarity, timeliness of predictions, clinical utility, and reproducibility among other evaluation metrics. The evaluation of the computational models was also an unprecedented collaboration across government. Program officials, subject matter experts, clinicians, and data scientists from four agencies interrogated the most promising models to identify the most quantitatively and qualitatively useful mode to assess pediatric COVID-19 Severity. The highest scoring model was selected to be a winner from each task.
A Model for Predicting Need for Hospitalization
In task 1, teams developed computational models to predict the need for hospitalization among pediatric patients who test positive for COVID-19 in an outpatient setting.
The winning team of Task 1 was the Department of Biostatistics & Medical Informatics (BMI) at the University of Wisconsin-Madison. UW Madison will be awarded $100,000 for their high-performing gradient boosting method and handcrafted features extracted from multisite EHR data.
The team tailored a widely used machine learning approach (gradient boosting), reduced the dimensionality of EHR data, and enhanced model interpretability by summarizing patients’ medical conditions and drug exposures using medical meaning concepts such as International Classification of Diseases (ICD-10) and Anatomical Therapeutic Chemical (ATC) codes. Not only did they perform the best of the scored models quantitatively, but they also used a subset of COVID-19 related lab measurements and recent values (prior to the patient’s COVID-19 diagnosis) and customized the model training/tuning procedure, so that the model was resistant to sample size bias, making it more generalizable across multiple sites.
Post-challenge, the team is interested in refining the model to tie into therapeutic interventions for high-risk groups and incorporate additional information such as clinical notes into the model.
“We really appreciate all the efforts frontline workers do to protect us from COVID-19. As biostatisticians and data scientists, we also want to make a little contribution to the fight against COVID-19," said Guanhua Chen, PhD, a member of the University of Wisconsin team. "We hope models like ours can be further refined and implemented in practice to improve health care delivery.”
A Model for Predicting Need for Respiratory and Cardiovascular Interventions
Vir Biotechnology, Inc., the winning team of Task 2, was awarded $100,000 for their high-performing gradient boosted tree classifier, capable of extracting patterns from the complex set of EHRs. The team focused on extracting data from laboratory measurements, disease conditions and past medical interventions to employ manual data cleaning, creation of new aggregate variables, and further harmonization of the data model.
Not only did this group have the highest quantitative score, they also employed a missingness aware classifier, capable of learning from the patterns of data availability and which avoids the imputation of missing data and overfitting by evaluating their trained classifier. When their model was evaluated to simulate a live clinical scenario, their model maintained its high performance.
The team hopes to further evaluate the model in clinics and create standards and privacy-preserving analytics to foster a new generation of decision support tools. They envision similar models in the future with the ability to accurately forecast the burden of disease for patients and hospital systems to become critical components of pandemic preparedness and real-time response.
Honorable Mentions Emphasize that These Challenges Impact Everyone, but Solutions Can Come from Anyone
A team from the Oregon Health & Science University received an Honorable Mention for Feature Interpretability & Design. The team used a common set of predictors including demographics, laboratory values and associated diagnosis codes to employ an ensemble classifier that combined individual predictions from logistic regression, random forest, gradient boosted tree, and artificial neural network models. They used Shapley Additive Values to provide individual-level and population-level explanations for model predictions.
This high-performing approach provides clinicians with an outcome prediction and an individualized explanation with predictors for intervention. The team began to explore how the model could be applied to patient populations to help clinicians prioritize allocation of monoclonal antibodies and would like to further optimize their model to address different sub-populations that may have underlying biases (e.g., racial or socioeconomic disparities), as well as validate their model further to provide early intervention to high-risk children to prevent severe outcomes.
A retired physicist and electrical engineer at Wind City Applied Research in rural New Hampshire, B. L. Cragin, PhD, received an Honorable Mention for Clinical Utility. Cragin noticed that model features derived from existing electronic health record codesets as defined in the National COVID Cohort Consortium Data Enclave gave consistently better performance than those based on machine-selected codes, allowing him to also benefit from the extensive clinical expertise of that community.
In developing his model, he applied an open-source "extreme boosting" algorithm called XGBoost that has proven to be a top performer in earlier predictive modeling challenges. The XGBoost code also facilitated the introduction of a modern Shapley Value analysis technique that generalizes the concept of a vector of the feature importance of a model to a feature importance matrix, each row of which applies to an individual case or patient, thus allowing clinicians to identify specific population sub-cohorts for which any given feature is expected to be an especially good indicator of increased risk. Cragin hopes to establish an informal association with an existing academic or industry research team to join their effort to make a marketable product.
“For the past 10 years, I have spent time in my retirement furthering data models for the public good," Cragin said. "By leveraging existing feature sets from the National COVID Cohort Consortium Data Enclave’s existing codesets, I was able to use clinical information created by others to develop a model that is both quantitatively and qualitatively useful for clinicians.”
A team from ARI Science received an Honorable Mention for Computational Methodology. The team took into account clinical and laboratory indicators from pre-visit and during-visit data that was normalized by age, gender and other demographic attributes and fed into Random Forest, Neural Network, Regression-based, Naïve Bayes and Neighborhood-based artificial intelligence (AI) models to create ensembles of predictions. The team hopes that the sub-model of their ensemble of ensembles can identify the highest risk children even prior to COVID-19 infection.
“The way we designed our AI-based disease severity prediction algorithm can be applied to specific age cohorts, to unvaccinated populations pre-COVID exposure, and to other diseases due to the flexible AI architecture we created," said Joy Alamgir of ARI Science."One can think of our ensemble of ensemble approach as plug-and-play for structured clinical data.”
A Brighter, Collective Future for Pediatric COVID-19 Patients
Although geared toward driving innovation in care delivery for pediatric COVID-19 patients specifically, the computational models submitted by design teams have the potential to be further developed and validated for use in ED settings, and could be applicable against future public health threats. Together with our partners, DRIVe continues to build out an ecosystem of restless innovation, driven by industry and the entrepreneurial community, to address the nation's greatest health security threats.
“We have a great deal to learn about how COVID-19 infection affects children,” said Alison Cernich, PhD, deputy director of the NICHD. “Our hope is that the winning computational models will allow us to prepare for the most severely ill cases so that we can refine the interventions needed to help them.”