CHAMPAIGN, Ill., 11/24/19: In a project to build an index containing the names of all biological species found on earth, Illinois Natural History Survey (INHS) informatician Dmitry Mozzherin and the HathiTrust team scanned one-tenth of all published human knowledge on occurrences of scientific names in less than a day.
The ability to scan all entire published work ever produced is within our grasp, Mozzherin said.
Using the HathiTrust digital library, Mozzherin is creating an index to list billions of entries for scientific exploration. The entries link the scientific name to the volume and page where the name is found in publications.
Mozzherin developed an app to create the index, which ran for 9 hours on 50 computers. As a result, he generated a scan of 18 million books and journals, which, according to Google, is one-tenth of all published work. Index users looking for information about a species can enter the name to receive a list of papers, books, and articles on that species.
“We wanted to have the infrastructure that allows us to connect all known information about a species through its scientific name and make it globally accessible,” Mozzherin said. “We have taken it there, so that’s good news.”
Throughout the ongoing project, the massive number of publications to be scanned for information has been an issue as well as the index quality. Species names present inherent challenges.
“One problem is that names are not stable,” Mozzherin said. “People make mistakes. Scientists call the same species a different name or discover a ‘new’ species many times. In this project, we’re trying to make sense out of the multiple names and different spellings chosen for the exact same species.”
One application for the index is the ability to identify volumes that may be medically relevant, for example, by identifying all volumes containing the scientific name for the mosquito that carries the Zika virus. Publications can also be grouped into clusters and prioritized by the extent that they cover data on a species of interest.
The program for name finding in the HathiTrust library is in the Github repository: https://github.com/gnames/htindex. This project is funded by the National Science Foundation. Mozzherin also received an award from the HathiTrust Research Center for his advanced collaborative support project.
----
Media contact: Dmitry Mozzherin, mozzhei@illinois.edu
Tricia Barker, Associate Director for Strategic Communications, 217-300-2327, tlbarker@illinois.edu