Linked Data from TEI (LIFT): A Teaching Tool for TEI to Linked Data Transformation

Linked Data from TEI (LIFT): A Teaching Tool for TEI to Linked Data Transformation

TEI editions are among the most used tool by scholarly editors to produce digital editions in various literary fields. LIFT is a Python-based tool that allows to programmatically extract information from digital texts annotated in TEI by modelling persons, places, events and relations annotated in the form of a Knowledge Graph which reuses ontologies and controlled vocabularies from the Digital Humanities domain.

“Creating specialized corpora from digitized historical newspaper archives: An iterative bootstrapping approach”

“Creating specialized corpora from digitized historical newspaper archives: An iterative bootstrapping approach”

Every scholar in digital humanities and/or social sciences has probably already faced the challenge posed by consulting large digital newspaper archives in order to extract detailed information about a topic. It is beyond any doubt that computational-oriented methods and tools currently available may provide a great contribution; however, applying such methods and tools could pose several difficulties, especially in dealing with large ensembles of items.

The First of May in German Literature

The First of May in German Literature

Introduction by OpenMethods Editor (Erzsébet Tóth-Czifra): Research on date extractions from literature brings us closer to answering big questions of “when literature takes place”.  As Frank Fischer’s blog post, First of May in German literature shows, beyond mere quantification, this line of research also yields insights on the cultural significance of certain dates. In this case, the significance of 1st of May in German literature (as reflected in the “Corpus of German-Language Fiction” dataset) was determined with the help of a freely accessible data set and the open access tool HeidelTime. The brief description of the workflow is a smart demonstration of the potential of open DH methods and data sharing in sustainable ways.

Bonus one: the post starts out from briefly touching upon some of Frank’s public humanities activities.

Bonus two: mention of the Tiwoli (“Today in World Literature”) app, a fun side product built on to pof the date extraction research.

OpenMethods Spotlights #3 Keeping a smart diary of research processes with NeMO and the Scholarly Ontology

OpenMethods Spotlights #3 Keeping a smart diary of research processes with NeMO and the Scholarly Ontology

In the next episode, we are looking behind the scenes of two ontologies: NeMO and the Scholarly Ontology (SO) with Panos Constantopoulos and Vayianos Pertsas who tell us the story behind these ontologies and explain how they can be used to ease or upcycle your daily works as a researcher. We discuss the value of knowledge graphs, how NeMO and SO connect with the emerging DH ontology landscape and beyond, why Open Access is a precondition of populating them, the Greek DH landscape …and many more!

GROBID: when data extraction becomes a suite

GROBID: when data extraction becomes a suite

Introduction: GROBID is an already well-known open source tool in the field of Digital Humanities, originally built to extract and parse bibliographical metadata from scholarly works. The acronym stands for GeneRation Of BIbliographic Data.
Shaped by use cases and adoptions to a range of different DH and non-DH settings, the tool has been progressively evolved into a suite of technical features currently applied to various fields, like that of journals, dictionaries and archives.
[Click ‘Read more’ for the full post!]

Web Scraping with Python for Beginners | The Digital Orientalist

Web Scraping with Python for Beginners | The Digital Orientalist

Introduction: In this blog post, James Harry Morris introduces the method of web scraping. Step by step from the installation of the packages, readers are explained how they can extract relevant data from websites using only the Python programming language and convert it into a plain text file. Each step is presented transparently and comprehensibly, so that this article is a prime example of OpenMethods and gives readers the equipment they need to work with huge amounts of data that would no longer be possible manually.

Analyzing Documents with TF-IDF | Programming Historian

Analyzing Documents with TF-IDF | Programming Historian

Introduction: The indispensable Programming Historian comes with an introduction to Term Frequency – Inverse Document Frequency (tf-idf) provided by Matthew J. Lavin. The procedure, concerned with specificity of terms in a document, has its origins in information retrieval, but can be applied as an exploratory tool, finding textual similarity, or as a pre-processing tool for machine learning. It is therefore not only useful for textual scholars, but also for historians working with large collections of text.

Not All Character N-grams Are Created Equal: A Study in Authorship Attribution – ACL Anthology

Introduction: Studying n-grams of characters is today a classical choice in authorship attribution. If some discussion about the optimal length of these n-grams have been made, we have still have few clues about which specific type of n-grams are the most helpful in the process of efficiently identifying the author of a text. This paper partly fills that gap, by showing that most of the information gained from studying n-grams of characters comes from the affixes and punctuation.