Skip to main content

HathiTrust Research Center receives NEH Digital Humanities Advancement Grant to build open research tools

New tools will allow users to more easily interact with the HathiTrust Digital Library's collection

BLOOMINGTON, Ind. — January 13, 2022 — The National Endowment for the Humanities has awarded the HathiTrust Research Center (HTRC) a $325,000 grant. This award enables development of a next-generation web-based, interactive visualization and analysis tool that allows users to more easily interact with the HathiTrust Digital Library's collection, which is made up of more than 17 million volumes.

Tools for Open Research and Computation with HathiTrust: Leveraging Intelligent Text Extraction (TORCHLITE) will be directed by John Walsh (HTRC director and associate professor of information and library science at the Indiana University Luddy School of Informatics, Computing, and Engineering) with partners from the University of Illinois. 

“Our goal is to drastically increase the capability of tools that scholars use to interact with the HathiTrust Digital Library,” said Walsh. “Our focus has historically been on creating and providing research capabilities for exploring unprecedented amounts of data. With TORCHLITE, we’ll improve the immediacy, interactivity, and user friendliness of accessing and analyzing the data.” 

These scholarly data reside in the Extracted Features (EF) dataset, which contains metadata and statistical information from the full HathiTrust corpus, documenting every word on every page, including the number of times the word appears, allowing for many forms of full-text analysis—even on copyrighted materials. The EF dataset contains nearly 3 trillion tokens representing more than 6 billion pages, making it arguably the largest open dataset of its kind readily available to digital humanities and other scholars in the world. 

TORCHLITE will enable retrieval of specific volume-level metadata elements—such as title, publisher, date of publication, genre, and page count—and page-level metadata, such as algorithmically determined language of a page, and page-level counts of tokens, parts of speech, lines, and sentences.  

“Through the HTRC, Indiana University continues to help researchers turn information into knowledge," said HTRC founder and Pervasive Technology Institute Executive Director Beth Plale. "The Extracted Features dataset synthesizes into one place important features about the volumes that make up the HathiTrust Digital library. The tools developed through TORCHLITE will make it easier for researchers to find and analyze the material they seek. This is an important development in advancing our understanding of the world’s knowledge as captured in our research libraries.” 

In addition to creating interactive, easy-to-use tools and dashboards, the project will promote broad community engagement through a workshop featuring a hackathon event, as a way to encourage individual users to develop their own tools using the project’s application programming interface (API).  

ABOUT THE HTRC 

The HTRC is the official research arm of the HathiTrust, a consortium that centrally collects image and text representations of library holdings digitized by the Google Books project and other mass-digitization efforts. Its mission is to contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge. 

ABOUT THE NATIONAL ENDOWMENT FOR THE HUMANITIES 

Created in 1965 as an independent federal agency, the National Endowment for the Humanities supports research and learning in history, literature, philosophy, and other areas of the humanities by funding selected, peer-reviewed proposals from around the nation. Additional information about the National Endowment for the Humanities and its grant programs is available at: www.neh.gov.