Materials scientists at Lawrence Berkeley National Laboratory (LBNL) – scientists who normally spend their time researching things like high-performance materials for thermoelectrics or battery cathodes – have built a text-mining tool in record time to help the global scientific community synthesize the mountain of scientific literature on COVID-19 being generated every day.
The tool, live at covidscholar.org, uses natural language processing techniques to not only quickly scan and search tens of thousands of research papers, but also to help draw insights and connections that may otherwise not be apparent. The hope is that the tool could eventually enable “automated science,” said Berkeley Lab scientist Gerbrand Ceder, one of the project leads.
COVIDScholar was developed in response to a March 16 call to action from the White House Office of Science and Technology Policy that asked artificial intelligence experts to develop new data and text mining techniques to help find answers to key questions about COVID-19.
The Berkeley Lab team had a prototype of COVIDScholar up and running within about a week. Now a little more than a month later, it has collected more than 61,000 research papers – about 8,000 of them specifically about COVID-19 and the rest about related topics, such as other viruses and pandemics in general – and is attracting more than 100 unique users every day, all by word of mouth.
This week the team released an upgraded version ready for public use, which gives researchers the ability to search for “related papers” and sort articles using machine-learning-based relevance tuning.
The team has built automated scripts to grab new papers (including preprint papers), clean them up, and make them searchable. At the most basic level, COVIDScholar acts as a simple search engine, albeit a highly specialized one.
“Google Scholar has millions of papers you can search through,” said John Dagdelen, a UC Berkeley graduate student and Berkeley Lab researcher who is one of the lead developers. “However, when you search for ‘spleen’ or ‘spleen damage’ – and there’s research coming out now that the spleen may be attacked by the virus – you’ll get 100,000 papers on spleens, but they’re not really relevant to what you need for COVID-19. We have the largest single-topic literature collection on COVID-19.”
In addition to returning basic search results, COVIDScholar will also recommend similar abstracts and automatically sort papers into subcategories, such as testing or transmission dynamics, allowing users to do specialized searches.
Now, the team is taking another step toward "automated science." For example, they can train algorithms to look for unnoticed connections between concepts.
“You can use the generated representations for concepts from the machine learning models to find similarities between things that don’t actually occur together in the literature, so you can find things that should be connected but haven’t been yet,” Dagdelen said.
COVIDScholar's developers are also working with researchers in Berkeley Lab’s Environmental Genomics and Systems Biology Division and UC Berkeley’s Innovative Genomics Institute to identify new connections based on genetic links between diseases and human phenotypes.
The entire tool runs on the supercomputers of the National Energy Research Scientific Computing Center (NERSC), a Department of Energy Office of Science user facility located at Berkeley Lab. That synergy across disciplines – from biosciences to computing to materials science – is what made this project possible. The online search engine and portal are powered by the Spin cloud platform at NERSC; lessons learned from the successful operations of the Materials Project, serving millions of data records per day to users, informed the development of COVIDScholar.
Also key is that the group has built essentially the same tool for materials science, called MatScholar, a project supported by the Toyota Research Institute and Shell. They published a study in Nature last year in which they showed that an algorithm with no training in materials science could uncover new scientific knowledge.
COVIDScholar is supported by Berkeley Lab’s Laboratory Directed Research and Development (LDRD) program. The materials science work that served as the foundation for this project is supported by the Energy & Biosciences Institute (EBI) at UC Berkeley, the Toyota Research Institute, and the National Science Foundation.
Read more: https://newscenter.lbl.gov/2020/04/28/machine-learning-tool-could-provid...