Analyzing Documents with TF-IDF | Programming Historian

Analyzing Documents with TF-IDF | Programming Historian

Introduction: The indispensable Programming Historian comes with an introduction to Term Frequency – Inverse Document Frequency (tf-idf) provided by Matthew J. Lavin. The procedure, concerned with specificity of terms in a document, has its origins in information retrieval, but can be applied as an exploratory tool, finding textual similarity, or as a pre-processing tool for machine learning. It is therefore not only useful for textual scholars, but also for historians working with large collections of text.

Not All Character N-grams Are Created Equal: A Study in Authorship Attribution – ACL Anthology

Introduction: Studying n-grams of characters is today a classical choice in authorship attribution. If some discussion about the optimal length of these n-grams have been made, we have still have few clues about which specific type of n-grams are the most helpful in the process of efficiently identifying the author of a text. This paper partly fills that gap, by showing that most of the information gained from studying n-grams of characters comes from the affixes and punctuation.

If These Crawls Could Talk: Studying and Documenting Web Archives Provenance

If These Crawls Could Talk: Studying and Documenting Web Archives Provenance

Introduction: With Web archives becoming an increasingly more important resource for (humanities) researchers, it also becomes paramount to investigate and understand the ways in which such archives are being built and how to make the processes involved transparent. Emily Maemura, Nicholas Worby, Ian Milligan, and Christoph Becker report on the comparison of three use cases and suggest a framework to document Web archive provenance.

Transkribus & Magazines: Transkribus’ Transcription & Recognition Platform (TRP) as Social Machine…

Transkribus & Magazines: Transkribus’ Transcription & Recognition Platform (TRP) as Social Machine…

Introduction: This article proposes establishing a good collaboration between FactMiners and the Transkribus project that will help the Transkribus team to evolve the “sustainable virtuous” ecosystem they described as a Transcription & Recognition Platform — a Social Machine for Job Creation & Skill Development in the 21st Century!

Towards Semantic Enrichment of Newspapers: A Historical Ecology Use Case

Introduction: Ecologists are much aided by historical sources of information on human-animal interaction. But how does one cope with the plethora of different descriptions for the same animal in the historic record? A Dutch research group reports on how to aggregate ‘Bunzings’, ‘Ullingen’, and ‘Eierdieven’ (‘Egg-thieves’) into a useful historical ecology knowledge base.