To get at this dark data, Lead Developer Mike Lichtenberg and BHL’s Transcription Upload Tool Working Group have been testing the accuracy of three cutting-edge HTR engines that can handle a multitude of pre-modern type scripts and human handwriting styles:
Google Cloud Vision.
Text engines have come a long way just in the last few years. HTR Engines are deploying machine learning techniques that are finally allowing BHL to extract OCR from handwritten texts, incunabula, and early manuscript black letter typescripts like Fraktur.
Incunabula, Fraktur, and handwriting present additional challenges that will need to be overcome if the desire is to have a full-text search for all of the BHL corpus. Image: Dearborn, 2023
A sticking point with these cutting-edge services for BHL is that they have a cost attached and it needs to be determined whether the outputs are “good enough” to justify the cost for the Consortium.
While these experiments are in progress, BHL partner institutions can use, free-of-charge, Google Cloud Vision by way of Wikisource. Wikisource, which has deployed the Wikimedia OCR extension, actually offers GLAM communities both Tesseract OCR and Google Cloud Vision HTR. For institutions that would like to hasten the transcription work of digitized archival materials in their collections, or simply improve the OCR for published works using older typescripts, Wikisource indeed provides.
A side-by-side comparison of outputs via various text engines in Wikisource. HTR isn’t perfect by any means but certainly comes a long way in delivering better data. The results will only improve as HTR engines advance. Image: Wikisource & Dearborn, 2023