Wikisource, is a digital library of freely licensed source texts and historical documents, including poetry, government documents, constitutions of many countries, and general literature. But Wikisource is not just a digital library of open-source texts – it is a free, powerful transcription platform. Originally called “Project Sourceberg,” Wikisource was deployed in 2003 to host a small collection of historically significant source texts that underpin article creation in Wikipedia. Today, there are 74 language editions of Wikisource; in the aggregate, there are 988,369 items and 381 active users (Appendix 3: Statistics).
Wikisource volunteers, sometimes called “Wikisourcers,” are involved in complex librarian-like tasks like cataloging, uploading, proofreading (transcribing), validating, and categorizing historical texts from the public domain.
The ability to proofread, transcribe, and validate texts is driven by the Proofread Page MediaWiki extension.
Wikisource Feature Set:
Documentation: a robust user documentation center
Multiple export formats: EPUB 3, EPUB 2, HTML, MOBI, PDF, RTF, Plain text
Statistics: granular metrics at the project, page, and user-level
Version control: all edits stored as a version with diff capabilities, rollback, and review as needed
Language editions: 74 communities to tap into
Interwiki links: the ability to link titles, authors, pages, words in a text to corresponding items from Wikidata, Wikimedia Commons, and Wikipedia
Optical character recognition (OCR) engines: Tesseract and Google Cloud Vision
In 2023, BHL Technical Team prioritized its OCR text files in aggregate and placed considerable efforts towards improving the data quality of BHL’s 60-million-page textual corpus. BHL’s Technical Coordinator, Joel Richard, has been running an OCR reprocessing project to improve legacy OCR text using the same open-source OCR engine that Wikisource uses, Tesseract [1].
For the BHL Technical Team, the OCR reprocessing work is motivated by two main goals:
Surfacing and interlinking scientific species names found in BHL text
Perhaps BHL’s biggest selling point over other platforms in the biodiversity literature space, is the strength of its collaboration with the Global Names project. Dmitry Mozzherin, Scientific Informatics Leader at Marine Biological Laboratory, and Dr. Geoff Ower, Database Manager at the Catalogue of Life (CoL), are the genius duo behind a suite of web services that:
find and index biological scientific names found in texts and,
interconnect found names to over 200 other taxonomic and biodiversity knowledge bases on the web [2].
In-text species name recognition powered by Global Names’ algorithms and APIs has given the BHL Technical Team a major reason to prioritize the quality of BHL’s OCR text so that BHL can surface and interlink even more previously hidden scientific names found in the BHL corpus to all the world.
Improving text output accuracy with rapidly improving technology in the OCR engine space
BHL’s OCR reprocessing project is working through 120,000 legacy OCR files to improve its OCR text outputs which thereby will surface more scientific names found in the text while also improving BHL’s full-text search functionality for other named entity searches.
Despite these exciting improvements that improve the accuracy of BHL’s OCR text files, the BHL Technical Team estimates the project will take nearly two years to complete due to the processing power and computational resources required. Additionally, BHL’s Transcription Upload Tool Working Group (BHL-TUTWG) reports that the sub-corpus of handwritten content in BHL is roughly one million pages. This may seem like a small percentage of the corpus. However, it includes BHL’s collection of scientific field notes, which are rich sources of taxonomic names, species occurrence records, and ecological, geolocation, climate, and weather data.
Unfortunately, archival, handwritten, and incunabula content will likely benefit much less from the BHL OCR reprocessing work. Instead, more technical resources to surface species names, climate, and biological data in these materials are required. A few examples of data that could benefit from being processed by a Handwriting Text Recognition (HTR) engine found in BHL’s Scientific Field Notes Collection:
Further extraction of this data could be used to directly support conservation and climate change research. Katie Mika, Data Services Librarian at Harvard University and former BHL National Digital Stewardship Resident, has been collaborating with the BHL Technical Team to pilot a data extraction pipeline aimed at freeing this valuable data from BHL’s field notes. [3](Appendix 2: Interviews)
BHL Staff have been watching Wikisource for a long time after having met Andrea Zanni, former Wikisource sysop and President of Wikimedia Italia, at Wikimania 2012. He envisioned a library of the future which he called the “hyper-library” – where texts interlink with other texts, other Wikimedia sister projects, and the broader Web. This interlinking allows us to collectively create a virtual library space where the serendipitous discovery of new knowledge happens.
To make Andrea’s vision come to life and make Wikisource the go-to GLAM transcription platform, Wikimedia needs to make further investments in Wikisource.
It is important to note that Wikisource faces steep competition. There are already many platforms for text transcription and choosing the right one is already a daunting, often paralyzing decision for capacity-constrained GLAMs. Compared to other transcription software available on the market, Wikisource’s user interface feels clunky and crowded. However, for what it lacks in the aesthetics department, it does make up for it with a powerful feature set.
Wikisource’s strategic investment should focus on five key areas:
Overhauling its user-Interface,
Creating a GLAM project-based tool kit,
Building a structured data extraction extension,
Improving the Wikisource API and/or OAI-PMH feed to facilitate text ingestion between databases, and
Underpinning these key areas is the Wikisource community. The Wikimedia Foundation will need to do work to galvanize, retain, and build the numbers and strength of the Wikisource community which are incredibly sophisticated and underutilized. Without needing to be semantic web experts, Wikisourcers already intuitively understand interlinking’s impact on knowledge discovery:
“the power and potential of [Wikisource] is mind-blowing”: where one work refers to another (via wiki-linking), thereby contributing to the “great conversation.” One author, one book leads to another author, another book—another idea.” [4]
Compared to other sister projects, Wikisource exhibits a much higher attrition rate.
Retaining an active user base should be at the forefront of strategic planning work for Wikisource. (Image: Wikisource Statistics, 2023)
BHL has a lot of homework to do as well. BHL must broaden its perspective on OCR text files. This data is not just a driver of full-text search on the BHL platform or something that makes the BHL book viewer more useful. BHL’s OCR text in the aggregate is BHL’s greatest asset: 500+ years of human scientific knowledge as a 60 million page dataset. The untapped potential of this dataset boggles the mind. To make it useful to computational researchers and proliferate its contents on the web, BHL needs to continue to invest in:
Improving overall text recognition accuracy,
Hastening the upload of already transcribed materials from BHL partners,
Extracting tabular data and depositing it to impactful knowledge bases, and
Exploring new features and ways to interlink named entities in the text internally and externally to other related resources.