Analyzing Documents with TF-IDF | Programming Historian

Analyzing Documents with TF-IDF | Programming Historian

Introduction: The indispensable Programming Historian comes with an introduction to Term Frequency – Inverse Document Frequency (tf-idf) provided by Matthew J. Lavin. The procedure, concerned with specificity of terms in a document, has its origins in information retrieval, but can be applied as an exploratory tool, finding textual similarity, or as a pre-processing tool for machine learning. It is therefore not only useful for textual scholars, but also for historians working with large collections of text.

Approaching Linked Data

Approaching Linked Data

Introduction: Linked Data and Linked Open Data are gaining an increasing interest and application in many fields. A recent experiment conducted in 2018 at Furman University illustrates and discusses some of the challenges from a pedagogical perspective posed by Linked Open Data applied to research in the historical domain.

“Linked Open Data to navigate the Past: using Peripleo in class” by Chiara Palladino describes the exploitation of the search-engine Peripleo in order to reconstruct the past of four archeologically-relevant cities. Many databases, comprising various types of information, have been consulted, and the results, as highlighted in the contribution by Palladino, show both advantages and limitations of a Linked Open Data-oriented approach to historical investigations.

Do humanists need BERT?

Do humanists need BERT?

Introduction: Ted Underwood tests a new language representation model called “Bidirectional Encoder Representations from Transformers” (BERT) and asks if humanists should use it. Due to its high degree of difficulty and its limited success (e.g. in questions of genre detection) he concludes, that this approach will be important in the future but it’s nothing to deal with for humanists at the moment. An important caveat worth reading.

‘Voyant Tools’

‘Voyant Tools’

Introduction: Digital humanists looking for tools in order to visualize and analyze texts can rely on ‘Voyant Tools’ (https://voyant-tools.org), a software package created by S.Sinclair and G.Rockwell. Online resources are available in order to learn how to use Voyant. In this post, we highlight two of them: “Using Voyant-Tools to Formulate Research Questions for Textual Data” by Filipa Calado (GC Digital Fellows and the tutorial “Investigating texts with Voyant” by Miriam Posner.

Evaluating named entity recognition tools for extracting social networks from novels

Evaluating named entity recognition tools for extracting social networks from novels

Introduction: Named Entity Recognition (NER) is used to identify textual elements that gives things a name. In this study, four different NER tools are evaluated using a corpus of modern and classic fantasy or science fiction novels. Since NER tools have been created for the news domain, it is interesting to see how they perform in a totally different domain. The article comes with a very detailed methodological part and the accompanying dataset is also made available.

The Uncanny Valley and the Ghost in the Machine

The Uncanny Valley and the Ghost in the Machine

Introduction: There is a postulated level of anthropomorphism where people feel uncanny about the appearance of a robot. But what happens if digital facsimiles and online editions become nigh indistinguishable from the real, yet materially remaining so vastly different? How do we ethically provide access to the digital object without creating a blindspot and neglect for the real thing. A question that keeps digital librarian Dot Porter awake and which she ponders in this thoughtful contribution.

From Hermeneutics to Data to Networks: Data Extraction and Network Visualization of Historical Sources

From Hermeneutics to Data to Networks: Data Extraction and Network Visualization of Historical Sources

Introduction: This lesson by Marten Düring from the “Programming Historian-Website” gently introduces novices to the topic to Network Visualisation of Historical Sources. As a case study it covers not only the general advantages of network visualisation for humanists but also a step-by-step explanation of the process from extraction of the data until the visualization (using the Palladio-tool). This lesson has also been translated into Spanish and includes many useful references for further reading.

Towards Scientific Workflows and Computer Simulation as a Method in Digital Humanities – Digitale Bibliothek – Gesellschaft für Informatik e.V.

Introduction: The explore! project tests computer stimulation and text mining on autobiographic texts as well as the reusability of the approach in literary studies. To facilitate the application of the proposed method in broader context and to new research questions, the text analysis is performed by means of scientific workflows that allow for the documentation, automation, and modularization of the processing steps. By enabling the reuse of proven workflows, the goal of the project is to enhance the efficiency of data analysis in similar projects and further advance collaboration between computer scientists and digital humanists.

Not All Character N-grams Are Created Equal: A Study in Authorship Attribution – ACL Anthology

Introduction: Studying n-grams of characters is today a classical choice in authorship attribution. If some discussion about the optimal length of these n-grams have been made, we have still have few clues about which specific type of n-grams are the most helpful in the process of efficiently identifying the author of a text. This paper partly fills that gap, by showing that most of the information gained from studying n-grams of characters comes from the affixes and punctuation.

If These Crawls Could Talk: Studying and Documenting Web Archives Provenance

If These Crawls Could Talk: Studying and Documenting Web Archives Provenance

Introduction: With Web archives becoming an increasingly more important resource for (humanities) researchers, it also becomes paramount to investigate and understand the ways in which such archives are being built and how to make the processes involved transparent. Emily Maemura, Nicholas Worby, Ian Milligan, and Christoph Becker report on the comparison of three use cases and suggest a framework to document Web archive provenance.