Natural Language Processing techniques applied to historical languages have been attracting an increasing interest in the academic community. Many online resources, like annotated corpora, are already available, as well as various methodologies and tools. However, for digital philologist dealing with these languages it appears important to rely on a specific pipeline, on a sequence of working steps and applications thanks to which accomplishing an effective text analysis.
This need is addressed by the Classical Language Toolkit (CLTK – http://cltk.org/), as illustrated by Patrick J. Burns in his contribution “Building a Text Analysis Pipeline for Classical Languages” (https://www.degruyter.com/downloadpdf/books/9783110599572/9783110599572-010/9783110599572-010.pdf).
The article highlights the concept of pipeline; then, it describes the main pipeline frameworks currently in place, from the most general, like GATE (General Architecture for Text Engineering) to the most oriented to digital philology, like StanfordCore NLP and Natural Language Toolkit; moreover, it provides a comprehensive overview of the tools, from lemmatizes to part-of-speech taggers, which digital classicists can already exploit for their purposes and, finally, it introduces the features of the Python-based framework Classical Language Toolkit – CLTK, aiming at the development of a pipeline specifically targeted for historical languages.
“CLTK has made progress in recent years collecting corpora for a wide variety of historical languages covering ancient, classical, and medieval Eurasia and building out the basic resources to support these languages across the entire text analysis pipeline”.
Patrick J. Burns, “Building a Text Analysis Pipeline for Classical Languages”:
The Classical Language Toolkit: http://cltk.org/