Computational companionship for large-scale digitization efforts
The Digital Age has brought with it large-scale digitization of historical records. The modern scholar of history or of other disciplines is often faced today with hundreds of thousands of readily-available and potentially-relevant full or fragmentary documents, but without computer aids that would make it possible to find the sought-after needles in the proverbial haystack of online images. The problems are even more acute when documents are handwritten, since optical character recognition does not provide quality searchable texts.
In this talk I will describe our experience working with digitized images of the Cairo Genizah manuscripts. Using computer-vision and machine-learning algorithms, we have been able to automatically classify Genizah manuscripts (which are spread out in more than seventy collections worldwide) by script style and thus identify tens of thousands of new “joins,” that is, matches between leaves in the same hand that were originally part of the same manuscript. This has led to the uncovering of numerous new humanities research directions concerning this collection, thus revolutionizing Genizah studies -- this despite the fact that these documents and fragments have already been extensively studied throughout the twentieth century.
The developed tools are applicable to other collections as well and I will also present results obtained with images of the Dead Sea Scrolls, and Tibetan Buddhist texts and manuscripts.