Skip to main content
SearchLoginLogin or Signup

Name Disambiguation

Published onAug 20, 2023
Name Disambiguation

The problem of identity in the historical record is a perennial issue for the scientific community. An article published in Earth Science Informatics assessing various identifier schemes reports:

“The problem of identity has vexed humanity throughout all of recorded history. A wide variety of methods; from assigned identifiers to taxonomic techniques and beyond; have historically been used to resolve the issue of whether this thing, whatever or whomever it may be, is what it purports to be” (Duerr et al., 2011).

To help solve the problem, Wikidata proves immensely useful, acting as a powerful identifier broker and universally accessible global name authority registry. Wikidata’s appeal over other authority services comes down to the registration process — it is easy, instant, and almost anyone can do it. This lack of gatekeeping challenges traditional hierarchical notions of “authority” as in typical Wikimedian spirit, the de-facto “authority” is always the semi-anonymous, self-correcting crowd.

Collecting identifiers linked to BHL Creator IDs is a tactic employed by BHL’s Cataloging and Metadata Committee and the Tech Team. This strategy helps to eliminate duplicate author names.

Who’s who? Metadata aggregated from hundreds of contributors results in duplicative names in the BHL database. Image: Dearborn, 2022.

Curating and collecting persistent identifiers for people and organizations in BHL goes beyond disambiguation. Ample evidence that the curation (and proliferation) of BHL Creator IDs, uncover the contributions of under-represented groups such as women scientific illustrators like Mary K. Spittal (Duerr et al., 2011).

Serendipitous interaction on Twitter surfaces Mary K. Spittal, scientific illustrator, in BHL and Wikidata. Image: (Marshall, 2021)

Mary K. Spittal is no longer obscured by missing data points. With a BHL Creator ID added to both BHL and Wikidata, those records have been interlinked, and the publications Spitall helped create now surface through a simple Google search. Spittal and her illustrations are now part of Wikidata’s growing biodiversity knowledge graph.

Mary K. Spittal, once an obscure data point, now has meaningful semantic interlinkages to other Wikidata entities. Image: Dearborn, 2022

Despite many promising developments on the horizon for identity management, the un-disambiguated data that is in BHL must be dealt with now. BHL researchers continue to encounter information dead-ends and big data aggregators that consume BHL data for their discovery layers (OCLC, DPLA, CrossRef, OpenAlex) only replicate the same problem for their end-users.

In the recent paper “People are Essential to Linking Biodiversity Data,” written by BHL’s Persistent Identifier Working Group Chair, Nicole Kearney, and BHL advisors Dr. Rod Page and Siobhan Leachman et. al., argue that providing users with the basic ability to differentiate between people is fundamentally important:

“Person data are almost always associated with entities such as specimens, molecular sequences, taxonomic names, observations, images, traits and publications […] these entities are also useful in validating data, integrating data across collections and institutional databases and can be the basis of future research into biodiversity and science” (Groom et al., 2020).

To help solve the tumult of duplicate author names after they have come into BHL as name strings from numerous sources, BHL’s Cataloging and Metadata Committee has piloted three workflows, as part of the group’s long-standing Author Merge project.

  1. OpenRefine Wikidata Extension

  2. Wikidata’s Mix’N’Match Tool

  3. Round Tripping Persistent Identifiers from Wikidata

Workflow 1 — OpenRefine Wikidata Extension

Diana Duncan, former BHL Cataloging and Metadata Chair and Lead Cataloger at the Field Museum, has documented an OpenRefine reconciliation workflow. The purpose is to match BHL free-text author name data with identifiers from external services like VIAF, ORCID, and Wikidata. (Appendix 2: Interviews) Of these services, Duncan likes the Wikidata extension best because it provides on-screen images, extra metadata points, and the ability to edit queries on the fly. These features help immensely to expedite the name reconciliation process.

Reconciling BHL Author Names using OpenRefine and the Wikidata Reconciliation service. Image: Dearborn, 2021.

For each reconciled batch of roughly a thousand names, BHL staff merge records, add administrative notes, and the matched URIs to BHL’s administrative back-end.

Merging authors in the BHL Administrative Dashboard, adding URIs and extra data points is a painstaking investigative and curation process for BHL’s catalogers. Image: Dearborn, 2021.

Workflow 2 - Wikidata’s Mix’n’match Tool

Wikidata's Mix’n’match tool, created by veteran Wikimedian and Media Wiki Developer Magnus Manske, is providing a low-barrier way of interlinking BHL Creator IDs to corresponding Wikidata items.

Records from over 3,400 institutional datasets are being converted to linked open data and assigned identifiers by the Mix’n’match tool. Two of BHL’s datasets are going semantic: BHL’s Creator IDs and Bibliography IDs. Since July 2017, over 37,000 BHL Creator IDs have been processed.

Siobhan Leachman, a Wikimedian and active BHL volunteer, is a big advocate of the tool. Leachman has made over 15,000 author name matches alone! In addition to gamifying Wikidata statement creation, the tool supports catalog updates, status reports, and tracks us

The Mix’n’match status report for BHL Creator IDs Image: (Manske, n.d.)

To make matches, the interface presents an auto-match to be inspected more closely by the player. Reconnaissance work is done to confirm the match.

The Mix’n’match interface. Image: (Manske, n.d.)

In a recent interview, Leachman endorses Mix’n’Match as the perfect entry point for BHL staff who have no experience with Wikidata: it’s easy, it’s fun, and the linking factor is incredibly powerful. (Appendix 2: Interviews)

Workflow 3 - Round Tripping Persistent Identifiers from Wikidata

In a recent blog post, “Biodiversity Heritage Library is Round Tripping Persistent Identifiers with the Wikidata Query Service BHL” BHL’s Committee members documented their workflow that enabled BHL’s harvest of 88,507 author identifiers (Dearborn & Leachman, 2023).

The BHL-Wikidata Round Trip, was an experimental data pipeline piloted by BHL Committee Members in Spring 2022, made possible by BHL Staff and Wikimedian curation efforts. Image: (Dearborn & Leachman, 2023)

This workflow later resulted in the development of an author sidebar on the BHL platform to expose the author data in BHL’s database to users doing name research and disambiguation work. So far, the feedback has been overwhelmingly positive.

BHL has added author data to BHL’s front-end to help Wikimedians do their Wikidata work more effectively. Image:(Dearborn & Leachman, 2023)

No comments here
Why not start the discussion?