Skip to main content
SearchLoginLogin or Signup

BHL's Big Data Challenges

Published onAug 20, 2023
BHL's Big Data Challenges

As strong as the BHL community’s shared vision has proven to be, the backlog of user requests, metadata curation work, and the ever-present digitization request queue now represent many lifetimes of work for a disproportionately small group of engaged BHL staff. To rise to the looming challenges presented by climate change, BHL’s data management strategies will need to pivot to embrace automation, crowdsourcing, machine learning, and the adoption of emerging semantic web standards. A recent review of BHL's information architecture by consulting firm Index Data yielded this final prescient insight:

“Aggregating biodiversity information is too big a job to be left to a relatively small cadre of information professionals.”(Taylor, Wilson, Hammer , & Gorrell, 2019)

Today, BHL faces three big data challenges that must be solved to make the data in its corpus truly open, actionable, FAIR, and 5-star.

Challenge 1 — Correcting and Transcribing OCR Text

When a book is digitized, an unstructured text file is created by an Optical Character Recognition (OCR) engine. This unstructured text file is created alongside the page image and metadata files. Currently, BHL’s OCR corpus is sizable: 289,000+ digital objects, comprising 60 million+ pages, amounting to 40+ gigabytes of data, silently awaiting conversion, normalization, and crowdsourcing. 

According to the 5-star rating scheme (Hausenblas, 2012), unstructured OCR text only ranks as 2-star data:

  • It is not machine-readable;

  • It does not contain URIs;

  • It does not link to anything; and 

  • It is error-ridden.

BHL’s Technical Coordinator and Lead Developer and the BHL Transcription Upload Tool Working Group (TUTWG) are working together to improve the quality of OCR text in the BHL corpus by

  • reprocessing the corpus with Tesseract, an open-source OCR engine; 

  • experimenting with cutting-edge handwritten text recognition (HTR) engines for handwritten materials; and 

  • analyzing transcription platforms for their ability to extract data while hastening partner transcription initiatives.

Nevertheless, a sustainable, scalable workflow to liberate machine-readable data locked in BHL’s OCR text files has yet to be forged.

BHL’s uncorrected OCR, particularly in data-rich archival materials, is “dark data.” Image: (Dearborn & Kalfatovic, 2022)

Challenge 2 — Improving Search Precision and Retrieval

BHL users have asked for many search enhancements that require additional metadata which does not exist. Library resource description is constrained to collection, title, or item-level metadata for books, journals, and archival materials. More granular level record types, frequently referred to as "named entities,” such as articles, species, images, events, locations, nomenclatural acts, taxons, authors, and publishers, are described to a lesser extent. BHL’s suite of technical features like full-text search, taxonomic intelligence, and data model upgrades for scholarly articles, have improved access to previously uncatalogued content. Nevertheless, search functionality for under-described entities means a plethora of unique information in BHL’s collection is still quite difficult to retrieve. To make all of BHL’s content discoverable and reusable — beyond books and journals — programmatic metadata clean-up and enrichment, must become the strategic focus.

Challenge 3 — Linking BHL Data to Global Knowledge Bases

The final tenet of the 5-star data scheme asks that “you link your data to other data to provide context.” Findings from BHL’s 2017 User Needs Assessment conducted by National Digital Stewardship Resident (NDSR), Pamela McClanahan found exactly this. “Linking” was consistently cited as a top BHL user request (McClanahan, 2018). For this reason, BHL’s Persistent Identifier Working Group (BHL-PIWG) has been actively registering DOIs (Digital Object Identifiers) to journal articles in BHL thereby bringing this content into the modern linked (5-star) network of scholarly research. 

How do PIDs work? Image: (UCSB Research Data Services, 2020)

PIWG Chair and Manager at Biodiversity Heritage Library Australia, Nicole Kearney explains the benefits of persistent identifiers further:

“In modern online publishing, PID assignment and linking happens at the point of publication: DOIs (Digital Object Identifiers) for publications, ORCIDs (Open Researcher and Contributor IDs) for people, and RORs (Research Organization Registry IDs) for organisations. The DOI system provided by Crossref (the DOI registration agency for scholarly content) delivers reciprocal citations, enabling convenient clicking from article to article, and citation tracking, enabling authors and institutions to track the impact and reach of their research output. Publications that lack PIDs, which include the vast majority of legacy literature, are hard to find and sit outside the linked network of scholarly research” (Kearney et al., 2021).

Additionally, BHL’s Cataloging and Metadata Committee is proactively adding persistent identifiers to BHL author records with a recent harvest of 88,000+ URIs from Wikidata in 2022 (Dearborn & Leachman, 2023). The hard work to interlink BHL’s persistent identifiers with external authoritative identifiers is underway and a comprehensive URI policy for all entities in BHL is currently being drafted collaboratively by BHL working groups. Luckily, a powerful information broker has emerged to hasten BHL’s progress: Wikidata.

No comments here
Why not start the discussion?