Automatic annotation of incomplete and scattered bibliographical references in Digital Humanities papers

The reviewed article presents the project BILBO and illustrates the application of several appropriate machine-learning techniques to the constitution of proper reference corpora and the construction of efficient annotation models. In this way, solutions are proposed for the problem of extracting and processing useful information from bibliographic references in digital documentation whatever their bibliographic styles are. The mentioned project is supported by a Google Grant for Digital Humanities (2011).

This project is part of the research on bibliographic reference, which focuses on providing automatic links between related references in citations of academic articles, i.e. cross-links. Among the different existing approaches and methods, Conditional Random Fields (henceforth CFRs) are employed here for labelling and extracting fields from research papers, as previous studies have shown this method to offer a better performance and actually most of publicly accessible on-line services are based on this technique. The present project defines effective labels and features for CRF learning in order to obtain a discriminative probabilistic model able to automatically label the reference fields of the 38 sample articles selected from Revues.org (part of the OpenEdition on-line platform), the oldest French online academic platform, offering more than 300 journals in the humanities and social sciences of electronic resources in the humanities and social science.

One crucial aspect of this approach is that, unlike previous applications of such a technique, it aims at employing CFRs to Digital Humanities bibliographic data, which are not restricted to well-structured data with simple format such as a bibliography at the end of scientific articles, but rather consist of much less structured bibliographical parts and various different formats, including implicit references integrated in the body of text as well as less-formulaic references contained in footnotes –actually, another machine learning technique, Support Vector Machine (henceforth SVM), is used to select those footnotes containing relevant bibliographical information.

In short, this project, whose progress is illustrated by means of a series of experiments, proves the usefulness and high degree of accuracy of CRF techniques, which involve finding the most effective set of features (including three types of features: input, local and global features) of a given corpus of well-structured bibliographical data (with labels such as surname, forename or title). Moreover, the present project does not only prove efficient when applied to such traditional, well-structured bibliographical data sets, but it also originally contributes to the processing of more complicated, less-structured references such as the ones contained in footnotes by applying SVM with new features for sequence classification.

Source: Kim Young-Min, Bellot Patrice, Faath Elodie, Dacos Marin, “Automatic annotation of incomplete and scattered bibliographical references in Digital Humanities papers”, in CORIA 2012, p. 329-340. Available online: http://www.asso-aria.org/coria/2012/329.pdf