A large collaborative effort led by researchers at Argonne National Laboratory combines artificial intelligence with physics-based drug docking and molecular dynamics simulations to rapidly hone in on the most promising molecules to test in the lab.
Doing so turns the challenge into a data, or machine-learning-oriented, problem, Arvind Ramanathan, a computational biologist in the Data Science and Learning Division at the U. S. Department of Energy's (DOE) Argonne National Laboratory and a senior scientist at the University of Chicago Consortium for Advanced Science and Engineering (CASE).
"We're trying to build infrastructure to integrate AI and machine learning tools with physics-based tools," Ramanathan said. "We bridge those two approaches to get a better bang for the buck."
The project is using several of the most powerful supercomputers on the planet to run millions of simulations, train the machine learning system to identify the factors that might make a given molecule a good candidate, and then do further explorations on the most promising results. The Texas Advanced Computing Center (TACC) and its Frontera supercomputer in particular have been critical for the team's work, Ramanathan said.
The team began by exploring one of the smaller of the 24 proteins that COVID-19 produces, ADRP (adenosine diphosphate ribose 1" phosphatase). Scientists do not entirely understand the protein's function, but it is implicated in viral replication.
Their deep-learning plus physics-based method is allowing them to reduce 1 billion possible molecules to 250 million; 250 million to 6 million; and 6 million to a few thousand. Of those, they selected the 30 or so with the highest "score" in terms of their ability to bind strongly to the protein and disrupt its structure and dynamics — the ultimate goal.
They recently shared their results with experimental collaborators at the University of Chicago and the Frederick National Laboratory for Cancer Research to test in the lab and will soon publish their data in an open access report so thousands of teams can analyze the results and gain insights. Results of the lab experiments will further inform the deep learning models, helping fine-tune predictions for future protein-drug interactions.
The team has since moved on to the COVID-19 main protease, which plays an essential role in translating the viral RNA, and will soon begin work on larger proteins which are more challenging to compute, but may prove important.
The team's work uses DeepDriveMD — Deep-Learning-Driven Adaptive Molecular Simulations for Protein Folding — a cutting-edge toolkit jointly developed by Ramanathan's team at Argonne, along with Shantenu Jha's team at Rutgers University/ Brookhaven National Laboratory (BNL), originally as part of the Exascale Computing Project.
Ramanathan and his collaborators are not the only researchers applying machine and deep learning to the COVID-19 drug discovery problem. But he says their approach is rare in the degree to which AI and simulation are tightly-integrated and iterative, and not just used post-simulation.
"We built the toolkit to do the deep learning online, enabling it to sample as we go along," Ramanathan said. "We first train it with some data, then allow it to infer on incoming simulation data very quickly. Then, based on the new snapshots it identifies, the approach automatically decides if the training needs to be revised."
The system first establishes the binding stability of potential molecules in a fairly simple way, then adds more and more complex elements, like water, or performs finer analyses of the energy profile of the system. "Information is added at different funneling points and based on the results, it might need to revise the docking or machine learning algorithms."
Its complex workflows are carefully orchestrated across multiple supercomputers using RADICAL-Cybertools, advanced workload execution and scheduling tools developed by computational experts at Rutgers/ BNL.
"The workflows have sophisticated requirements," said Shantenu Jha, chair of BNL's Center for Data-Driven Discovery and the lead of RADICAL. "Thanks to TACC's technical support, we were able to achieve both the desired levels of throughput and scale on Frontera and Longhorn within a couple of days and start production runs."
The team had some advantages in getting the research off the ground. The U. S. Department of Energy operates some of the most advanced x-ray crystallography labs in the world, and collaborates with many others. Those labs were able to quickly extract the 3D structures of many of the COVID-19 proteins — the first step in doing computational modeling to explore how such proteins respond to drug-like molecules.
Though AI is frequently considered a black box, Ramanathan says their methods do not just blindly generate a list of targets. DeepDriveMD deduces what common aspects of a protein make it a better candidate, and communicates those insights to researchers to help them understand what is actually happening in the virus with and without drug interactions.
"Our deep learning models can hone in on chemical groups that we think are critical for interactions," he said. "We don't know if it's true, but we find docking scores are higher and believe it captures important concepts. This is not just important for what happens with this virus. We're also trying to understand how viruses work generally."
Once a drug-like small molecule is found to be effective in the lab, further testing (computational and experimental) is required to go from a promising target to a cure.
"Developing vaccines takes such a long time because molecules need to be optimized for function. They must be studied to determine that they're not toxic and don't do other harm, and also that they can be produced at scale," Ramanathan said.
All of these further steps, the researchers believe, can be accelerated by the use of a hybrid AI- and physics-based modeling approach.