In the past decade, open data initiatives have cropped up worldwide under increasing pressure from citizens and governments to make public data freely available on the web (Attard, Orlandi, Scerri, & Auer, 2015). In a recent report, “Digitization of the World,” it is predicted by the International Data Corporation that the “global datasphere will grow from 33 zettabytes (ZB) in 2018 to 175 ZB by 2025.” For reference: one zettabyte is a trillion gigabytes (International Data Corporation, 2018). Information standards, schemas, and open formats are the conduits that have been put in place to help make this deluge of data interoperable, usable, and connected.
In parallel with the open data movement, the field of biodiversity informatics has burgeoned (Peterson, Soberón, & Krishtalka, 2015). Computational analysis methods are connecting disparate biodiversity data and allow scientists to infer new patterns about our natural world. A rapidly changing digital landscape means that traditional modes of information delivery and exchange will no longer suffice. New models are now replacing outdated ones.
Despite valiant efforts, eliminating the structural barriers that hinder the true potential of data has proven difficult. Much of the world’s data remains locked up, siloed, and underutilized. Several studies estimate that only 1-5% (Burn-Murdoch, 2017) of data is analyzed, and in 2016, IBM reported that over 80% of data is dark data, with that number expected to grow to 93% by 2020. (Trice, 2016)
Unstructured data — “dark data” — accounts for a majority of all data generated. How much untapped potential and hidden knowledge lies within BHL’s 60-million-page textual corpus, waiting to be unlocked? Image: (Schroeer, 2017)
In 2001, Scientific American published a pithy article entitled “The Semantic Web: A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities.” In it, Sir Tim Berners-Lee offers a glimpse into the future where the banal bureaucracy of our daily lives almost disappears, taken care of by computational agents acting on information from the semantic web. In this new information paradigm, the old web has been retrofitted to gain logic and all data is interoperable and connected. Berners-Lee predicted that someday soon, documents and data will not be merely displayed — but also processed, understood, and acted upon by both humans and machines (Berners-Lee, Hendler, & Lassila, 2001).
Thirty years ago, Berners-Lee asked that documents be put on the web and linked using HTML (Hypertext Markup Language); now he asks that data be put on the web and linked using RDF (Resource Description Framework) (W3C, 2014). His ask has become the rallying cry for the Semantic Web Community, who assert humanity’s unadulterated right to all publicly available data sets, chanting:
“Raw Data Now!”
(Berners-Lee, 2009)
To realize the potential of the semantic web, data will need to become more than open; it will need to become 5-star Open Data.
The 5-star data rating scheme. Image: (Hausenblas, 2012)
According to Berners-Lee, unlike documents on the web which are rendered by a web browser using HTML (Hypertext Markup Language), data on the web should be described using the Resource Description Framework (RDF). RDF is a data markup language whose syntax is expressed in a subject–verb–object linguistic pattern, one atomic unit of data in RDF is called a statement, also referred to as a triple, and follows this known language typology:
RDF Syntax: <subject> <predicate> <object>
Example: <Margaret Mead> <employer> <American Museum of Natural History>
The predicate connects the subject and the object to form a graph that can be visualized as a network of nodes. Each component of the triple is represented by a resolvable uniform resource identifier (URI). These URIs allow data to be interlinked across the web, thereby eliminating silos and connecting globally distributed data stores.
Margaret Mead’s Wikidata knowledge graph. Image: Wikidata Query Service, 2022
According to the semantic vision roadmap, if data practitioners describe data in RDF and abide by the 5-star data scheme, the world’s dark data will come into the light, data silos will evaporate, and a new era for human knowledge will dawn.
Written in response to the dark data problem, specifically regarding the sciences, the FAIR data initiative is aimed at ensuring scientific reproducibility and society’s ability to derive maximum benefit from public research investments. Naturally, humans are important stakeholders that stand to gain a lot from FAIR data. However, machines are increasingly the primary target of “FAIRified” data:
“…‘computational stakeholders’ are increasingly relevant, and demand as much, or more, attention as their importance grows. One of the grand challenges of data-intensive science, therefore, is to improve knowledge discovery through assisting both humans, and their computational agents, in the discovery of, access to, and integration and analysis of, task-appropriate scientific data and other scholarly digital objects” (Wilkinson, 2016).
The scientific community has been working to better steward scholarly data and make science reproducible. Published in 2016, “The FAIR Guiding Principles for scientific data management and stewardship” have now been widely endorsed around the world (Wilkinson, 2016). FAIR is a mnemonic acronym that stands for its four guiding principles:
Checklist for FAIR organizational compliance. Image: (Open Access and Fair Principles, n.d.)
While the scientific community has been focused on making science reproducible, open, and FAIR, libraries have taken a brass tacks approach to dark data. A decade after Berners-Lee presented his new vision for the web to the world, the Library of Congress and semantic consulting firm Zepheira co-developed an RDF ontology called Bibliographic Framework Initiative (BIBFRAME) with the goal of creating a bibliographic standard expressed in the W3C RDF standard that would replace the current cataloging standard, MARC (Machine Readable Cataloging Format)(Miller, Ogbuji, Mueller, & MacDougall, 2012).
Long-cherished for its granularity and stability, MARC has served as the de-facto bibliographic data standard since the 1960s. Its success has made library cataloging a collaborative, efficient, global venture. Nevertheless, modern librarians lament that MARC is not web-friendly. Roy Tennant, the author of the now-infamous article “MARC Must Die” writes:
“Libraries exist to serve the present and future needs of a community of users. To do this well, they need to use the very best that technology has to offer. With the advent of the web, XML, portable computing, and other technological advances, libraries can become flexible, responsive organizations that serve their users in exciting new ways. Or not. If libraries cling to outdated standards, they will find it increasingly difficult to serve their clients as they expect and deserve (Tennant, 2021).”
When the first version of BIBFRAME was completed in the Fall of 2012, the new data format was introduced as “the foundation for the future of bibliographic description that happens on, in, and as part of the web and the networked world we live in” (Miller, Ogbuji, Mueller, & MacDougall, 2012). To facilitate the conversion process from MARC to RDF, the Library of Congress has created many conversion tools to assist organizations with their data transformation needs (Library of Congress, n.d.).
BIBFRAME is informed by the FRBR model (Functional Requirements for Bibliographic Records) and maintains the granularity of MARC. The standard is expressed in RDF’s expected triple format and the vocabulary consists of three core classes: work, instance, and item (Schreur, 2018). In a side-by-side comparison below, the same bibliographic record is shown in BIBFRAME and MARC. One is web-friendly and contains computational logic, one does not.
A bibliographic record for The Arcturus Adventure is expressed in BIBFRAME as a knowledge graph (left); the same bibliographic record is expressed in MARC (right). Image: Dearborn, 2022
Adoption of BIBFRAME has gained recent momentum with a 1.5 million dollar grant awarded by the Andrew W. Mellon Foundation to fund Linked Data for Libraries, dubbed LD4L Labs (2016-2018), distributed to Columbia, Cornell, Harvard, Library of Congress, Princeton, and Stanford. These six institutions are leading the charge by
developing standards, guidelines, and infrastructure to communally produce metadata as linked open data;
developing end-to-end workflows to create linked data in a technical services production environment;
extending the BIBFRAME ontology to describe library resources in specialized domains and formats; and
engaging the broader library community to ensure a sustainable and extensible environment (LD4P Partners, 2016).
To achieve these ambitious goals, the LD4P partners have organized a global community of practice around linked open data. Wikidata is playing a key role by providing librarians with an open, free, collaborative testing ground for semantic experimentation. The next two phases of the grant, Linked Data for Production: Pathway to Implementation (LD4P2) and Linked Data for Production: Closing the Loop (LD4P3) aim to mainstream BIBFRAME production with Sinopia, a global linked data creation environment (Stanford University Libraries Digital Library Systems & Services, 2018).
Working in close collaboration with LD4P, Casalini Libri, an Italian library services firm, is converting entire institutional catalogs from MARC to BIBFRAME through their ShareVDE conversion service. Additionally, known named entities are being harmonized across catalogs with the assignment of globally persistent identifiers, called Supra Uniform Resource Identifiers (SFIDs.)
Matthew Alexander Henson, an African American Arctic Explorer’s SUID is https://svde.org/agents/951654264154896. Image: (Henson & Share-VDE, n.d.)
As of September 2021, the new version Share-VDE 2.0 beta went live with plans to bring the system into production in May 2023. The beta site already allows users to search nine catalogs in six languages, across 35,702,678 works. The user interface features accessible themes for vision-impaired persons, custom domain names, and branded skins for member institutions (About Share-VDE, 2021). Converted records are delivered back to adopting institutions as triple-store databases and delta loads of records from participating institutions are slated to occur in 2023.
At the Smithsonian Institution, the Share-VDE BIBFRAME conversion effort is being led by Descriptive Data Manager, Jackie Shieh, who is a huge linked data advocate and has worked with the Casalini Libri Team to usher nearly 2 million records from the Smithsonian Libraries and Archives catalog into the semantic web. In a recent blog, she writes:
“SVDE’s back-end technology prepared the Smithsonian Libraries and Archives data for indexing, clustering, searching and representation. Their work helps data providers like the Smithsonian reap the benefits of linked data and connect library collections on the web in a user-friendly and informative manner" (Shieh, 2022).
As the library world moves headlong towards BIBFRAME, it would be wise for BHL and its partners to proactively follow suit. One catalog at a time, Casalini Libri is helping libraries break free from the shackles of MARC and de-silo library data across the globe with a centralized semantic portal, Share-VDE.
The BIBFRAME initiative, FAIR principles, and 5-star linked open data have the potential to launch a gestalt leap for information, and ultimately human knowledge. These specifications are about harnessing, using, and stewarding data in ways that allow humans to find meaning in the zettabytes of noise.