Introduction by OpenMethods Editor (Joris van Zundert):
The following will not be easy reading for many a philologist and classicist, but it is well worth trying to bridge the distance between paleography and natural language processing (NLP). Thibault Clérice explains how text without spaces – typically something found in late classicist and early medieval documents – can be parsed into separate words when turning source material into computer processable information.
Tokenization of modern and old Western European languages seems to be fairly simple, as it stands on the presence mostly of markers such as spaces and punctuation. However, when dealing with old sources like manuscripts written in scripta continua, antiquity epigraphy or Middle Age manuscripts, (1) such markers are mostly absent, (2) spelling variation and rich morphology make dictionary based approaches difficult. Applying convolutional encoding to characters followed by linear categorization to word-boundary or in-word-sequence is shown to be effective at tokenizing such inputs. Additionally, the software is released with a simple interface for tokenizing a corpus or generating a training set.