FLC | NSF-funded COVID-19 genomic database grows faster than expected

NSF-funded COVID-19 genomic database grows faster than expected

May 7, 2020

Scientists don't yet know how the SARS-CoV-2 virus, which causes COVID-19, will evolve. But a multicenter effort, funded in part by the National Science Foundation (NSF), to collect and analyze volumes of genomic data as part of the COVID-19 fight is proceeding faster than expected.

"These important scientific findings underline the value of NSF's nearly 40-year investment in advanced cyberinfrastructure resources and services to enable national and international research collaborations that address critical problems," said Manish Parashar, director of NSF's Office of Advanced Cyberinfrastructure.

About 100 organizations worldwide, mainly academic labs and genome sequencing facilities, have already contributed genomic data to the study of the pandemic. Genomic data are critical because they help identify how the virus is evolving and, in turn, how it might be stopped.

"The community wasn't expecting this much data this quickly," said Sergei Pond, a biologist at Temple University in Philadelphia.

Pond and Anton Nekrutenko of Penn State are collaborating on the Galaxy project, one of the world's largest, most successful, web-based bioinformatics platforms. Galaxy employs the Bridges platform at the Pittsburgh Supercomputing Center for genome assembly jobs that require large amounts of shared memory. These systems are allocated through the Extreme Science and Engineering Discovery Environment, which awards supercomputer resources and expertise to researchers and is funded by the NSF.

"Galaxy uses open source tools and public cyberinfrastructure for transparent, reproducible analyses of viral datasets -- it's free and promotes good practices," said Nekrutenko, a biochemist and molecular biologist. "We run hundreds of thousands of analyses per month, and we're spiking now in terms of usage and viral analyses."

Since 2013, the Texas Advanced Computing Center (TACC) has powered data analyses for a large percentage of Galaxy users, allowing researchers to solve tough problems quickly and seamlessly in cases where their computers or campus clusters are not sufficient.

"We have quite a user base, and have numerous instances across the world -- the biggest instance here in the US is run out of the Texas Advanced Computing Center," Nekrutenko said.

The researchers perform the majority of their analyses on TACC's Stampede 2 and Jetstream supercomputers, using parallel processing and big data analytics.