GROBID: when data extraction becomes a suite

GROBID: when data extraction becomes a suite

Introduction: GROBID is an already well-known open source tool in the field of Digital Humanities, originally built to extract and parse bibliographical metadata from scholarly works. The acronym stands for GeneRation Of BIbliographic Data.
Shaped by use cases and adoptions to a range of different DH and non-DH settings, the tool has been progressively evolved into a suite of technical features currently applied to various fields, like that of journals, dictionaries and archives.
[Click ‘Read more’ for the full post!]

When history meets technology. impresso: an innovative corpus-oriented perspective.

When history meets technology. impresso: an innovative corpus-oriented perspective.

Historical newspapers, already available in many digitized collections, may represent a significant source of information for the reconstruction of events and backgrounds, enabling historians to cast new light on facts and phenomena, as well as to advance new interpretations. Lausanne, University of Zurich and C2DH Luxembourg, the ‘impresso – Media Monitoring of the Past’ project wishes to offer an advanced corpus-oriented answer to the increasing need of accessing and consulting collections of historical digitized newspapers.
[…] Thanks to a suite of computational tools for data extraction, linking and exploration, impresso aims at overcoming the traditional keyword-based approach by means of the application of advanced techniques, from lexical processing to semantically deepened n-grams, from data modelling to interoperability.
[Click ‘Read more’ for the full post!]

Research COVID-19 with AVOBMAT

Research COVID-19 with AVOBMAT

Introduction: In our guidelines for nominating content, databases are explicitly excluded. However, this database is an exception, which is not due to the burning issue of COVID-19, but to its exemplary variety of digital humanities methods with which the data can be processed.AVOBMAT makes it possible to process 51,000 articles with almost every conceivable approach (Topic Modeling, Network Analysis, N-gram viewer, KWIC analyses, gender analyses, lexical diversity metrics, and so on) and is thus much more than just a simple database – rather, it is a welcome stage for the Who is Who (or What is What?) of OpenMethods.

Mining ethnicity: Discourse-driven topic modelling of immigrant discourses in the USA, 1898–1920

Mining ethnicity: Discourse-driven topic modelling of immigrant discourses in the USA, 1898–1920

Introduction: The article illustrates the application of a ‘discourse-driven topic modeling’ (DDTM) to the analysis of the corpus ChronicItaly comprising several newspapers in Italian language, appeared in the USA during the time of massive migration towards America between the end of the XIX century and the first two decades of the XX (1898-1920).

The method combines both Text Modelling (™) and the discourse-historical approach (DHA) in order to get a more comprehensive representation of the ethnocultural and linguistic identity of the Italian group of migrants in the historical American context in crucial periods of time like that immediately preceding the eruption and that of the unfolding of World War I.

Pipelines for languages: not only Latin! The Italian NLP Tool (Tint)

Pipelines for languages: not only Latin! The Italian NLP Tool (Tint)

The StandforCore NLP wishes to represent a complete Java-based set of tools for various aspects of language analysis, from annotation to dependency parsing, from lemmatization
to coreference resolution. It thus provides a range of tools which
can be potentially applied to other languages apart from English.

Among the languages to which the StandfordCore NLP is mainly applied there is Italian, for which the Tint pipeline has been developed as described in the paper “Italy goes to Stanford: a collection of CoreNLP modules for Italian” by Alessio Palmero Apostolo and Giovanni Moretti.

On the Tint webpage the whole pipeline can be found and downloaded: it comprises tokenization and sentence splitting, morphological analysis and lemmatization, part-of-speech tagging, named-entity recognition and dependency parsing, including wrappers under construction. [Click ‘Read more’ for the whole post.]

Narrelations — Visualizing Narrative Levels and their Correlations with Temporal Phenomena

Narrelations — Visualizing Narrative Levels and their Correlations with Temporal Phenomena

Introduction: Introduction by OpenMethods Editor (Christopher Nunn): Information visualizations are helpful in detecting patterns in large amounts of text and are often used to illustrate complex relationships. Not only can they show descriptive phenomena that could be revealed in other ways, albeit slower and more laborious, but they can also heuristically generate new knowledge. The authors of this article did just that. The focus here is, fortunately, on narratological approaches that have so far hardly been combined with digital text analyzes, but which are ideally suited for them. To eight German novellas a variety of interactive visualizations were created, all of which show: The combination of digital methods with narratological interest can provide great returns to Literary Studies work. After reading this article, it pays to think ahead in this field.

Analyzing Documents with TF-IDF | Programming Historian

Analyzing Documents with TF-IDF | Programming Historian

Introduction: The indispensable Programming Historian comes with an introduction to Term Frequency – Inverse Document Frequency (tf-idf) provided by Matthew J. Lavin. The procedure, concerned with specificity of terms in a document, has its origins in information retrieval, but can be applied as an exploratory tool, finding textual similarity, or as a pre-processing tool for machine learning. It is therefore not only useful for textual scholars, but also for historians working with large collections of text.