GROBID: when data extraction becomes a suite

GROBID: when data extraction becomes a suite

Introduction: GROBID is an already well-known open source tool in the field of Digital Humanities, originally built to extract and parse bibliographical metadata from scholarly works. The acronym stands for GeneRation Of BIbliographic Data.
Shaped by use cases and adoptions to a range of different DH and non-DH settings, the tool has been progressively evolved into a suite of technical features currently applied to various fields, like that of journals, dictionaries and archives.
[Click ‘Read more’ for the full post!]

When history meets technology. impresso: an innovative corpus-oriented perspective.

When history meets technology. impresso: an innovative corpus-oriented perspective.

Historical newspapers, already available in many digitized collections, may represent a significant source of information for the reconstruction of events and backgrounds, enabling historians to cast new light on facts and phenomena, as well as to advance new interpretations. Lausanne, University of Zurich and C2DH Luxembourg, the ‘impresso – Media Monitoring of the Past’ project wishes to offer an advanced corpus-oriented answer to the increasing need of accessing and consulting collections of historical digitized newspapers.
[…] Thanks to a suite of computational tools for data extraction, linking and exploration, impresso aims at overcoming the traditional keyword-based approach by means of the application of advanced techniques, from lexical processing to semantically deepened n-grams, from data modelling to interoperability.
[Click ‘Read more’ for the full post!]

Research COVID-19 with AVOBMAT

Research COVID-19 with AVOBMAT

Introduction: In our guidelines for nominating content, databases are explicitly excluded. However, this database is an exception, which is not due to the burning issue of COVID-19, but to its exemplary variety of digital humanities methods with which the data can be processed.AVOBMAT makes it possible to process 51,000 articles with almost every conceivable approach (Topic Modeling, Network Analysis, N-gram viewer, KWIC analyses, gender analyses, lexical diversity metrics, and so on) and is thus much more than just a simple database – rather, it is a welcome stage for the Who is Who (or What is What?) of OpenMethods.

Automatic annotation of incomplete and scattered bibliographical references in Digital Humanities papers

Automatic annotation of incomplete and scattered bibliographical references in Digital Humanities papers

The reviewed article presents the project BILBO and illustrates the application of several appropriate machine-learning techniques to the constitution of proper reference corpora and the construction of efficient annotation models. In this way, solutions are proposed for the problem of extracting and processing useful information from bibliographic references in digital documentation whatever their bibliographic styles are. It proves the usefulness and high degree of accuracy of CRF techniques, which involve finding the most effective set of features (including three types of features: input, local and global features) of a given corpus of well-structured bibliographical data (with labels such as surname, forename or title). Moreover, this approach has not only been proven efficient when applied to such traditional, well-structured bibliographical data sets, but it also originally contributes to the processing of more complicated, less-structured references such as the ones contained in footnotes by applying SVM with new features for sequence classification.

[Click ‘Read more’ for the full post.]

Web Scraping with Python for Beginners | The Digital Orientalist

Web Scraping with Python for Beginners | The Digital Orientalist

Introduction: In this blog post, James Harry Morris introduces the method of web scraping. Step by step from the installation of the packages, readers are explained how they can extract relevant data from websites using only the Python programming language and convert it into a plain text file. Each step is presented transparently and comprehensibly, so that this article is a prime example of OpenMethods and gives readers the equipment they need to work with huge amounts of data that would no longer be possible manually.

A World of Possibilities: a corpus-based approach to the diachrony of modality in Latin

A World of Possibilities: a corpus-based approach to the diachrony of modality in Latin

Introduction: Hosted at the University of Lausanne, “A world of possibilities. Modal pathways over an extra-long period of time: the diachrony in the Latin language” (WoPoss) is a project under development exploiting a corpus-based approach to the study and reconstruction of the diachrony of modality in Latin.
Following specific annotation guidelines applied to a set of various texts pertaining to the time span between 3rd century BCE and 7th century CE, the work team lead by Francesca Dell’Oro aims at analyzing the patterns of modality in the Latin language through a close consideration of lexical markers.