Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin

https://openmethods.dariah.eu/2020/04/28/journal-of-data-mining-digital-humanities-6264-evaluating-deep-learning-methods-for-word-segmentation-of-scripta-continua-texts-in-old-french-and-latin/ OpenMethods introduction to: Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin 2020-04-28 14:00:00 Introduction: Thibault Clérice reports on the successfulness of recognizing word boundaries in scripta continua (typically late classic and early medieval Latin). This will not be easy reading for many a philologist and classicist, but it is well worth trying to bridge the gap. Next to explaining and evaluating Thibault Clérice releases the software Boudams used for his research. [Click ‘Read more’ for the full post!] Joris van Zundert https://jdmdh.episciences.org/6264/pdf Blog post English [INFO]Computer Science [cs] [SHS.CLASS]Humanities and Social Sciences/Classical studies [SHS.LANGUE]Humanities and Social Sciences/Linguistics Array via bookmarklet

Introduction by OpenMethods Editor (Joris van Zundert):

The following will not be easy reading for many a philologist and classicist, but it is well worth trying to bridge the distance between paleography and natural language processing (NLP). Thibault Clérice explains how text without spaces – typically something found in late classicist and early medieval documents – can be parsed into separate words when turning source material into computer processable information.

Tokenization of modern and old Western European languages seems to be fairly simple, as it stands on the presence mostly of markers such as spaces and punctuation. However, when dealing with old sources like manuscripts written in scripta continua, antiquity epigraphy or Middle Age manuscripts, (1) such markers are mostly absent, (2) spelling variation and rich morphology make dictionary based approaches difficult. Applying convolutional encoding to characters followed by linear categorization to word-boundary or in-word-sequence is shown to be effective at tokenizing such inputs. Additionally, the software is released with a simple interface for tokenizing a corpus or generating a training set.

Source: Journal of Data Mining & Digital Humanities – #6264 – Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin

Author: Author on Source

Drs. Joris J. van Zundert (1972) is a senior researcher and developer in humanities computing. He holds a research position in the department of literary studies at the Huygens Institute for the History of The Netherlands, a research institute of The Netherlands Royal Academy of Arts and Sciences (KNAW). His main interest as a researcher and developer is in the possibilities of computational algorithms for the analysis of literary and historic texts, and the nature and properties of humanities information and data modeling. His current research focuses on computer science and humanities interaction and the tensions between hermeneutics and ‘big data’ approaches.