“Creating specialized corpora from digitized historical newspaper archives: An iterative bootstrapping approach”
Every scholar in digital humanities and/or social sciences has probably already faced the challenge posed by consulting large digital newspaper archives in order to extract detailed information about a topic. It is beyond any doubt that computational-oriented methods and tools currently available may provide a great contribution; however, applying such methods and tools could pose several difficulties, especially in dealing with large ensembles of items.
In his contribution, Joshua W. Black illustrates an innovative approach overcoming the flaws of the traditional keyword-based technique as well as those of text mining, especially where data are hard to detect and process due to OCR issues.
Developed on a portion of the ‘Papers Past’ newspaper archive held in the National Library of New Zealand, the method described by Black entails a multiple-step path beginning with a preprocessing stage, carried out on a META/ALTO XML format dataset, and followed by a corpus exploration and, finally, by a labelling stage. Through the iteration of such a loop, as demonstrated in the article, it is possible to reach an increasingly refined level of pinpointing and selection of relevant items for every potential research topic.
“In the case study, three iterations of the methods were sufficient to generate a specialized corpus of philosophical writing in early colonial New Zealand newspapers. After three iterations, the method achieved a balance of both selectiveness and accuracy […]. This project enables both model construction and model criticism” (p. 792).
Already well-known in statistics, as well as in computational linguistics, physics and other fields, the concept of bootstrapping is highlighted by Susan Carey as a “metaphor” (p.59) applied by many to the learning process, with particular regard to languages and counting, and she argues that “in thinking about how bootstrapping might work, we are led to a fuller appreciation of the role of language in supporting the cultural transmission of knowledge” (p. 68).
References
Black, Joshua Wilson. “Creating specialized corpora from digitized historical newspaper archives: An iterative bootstrapping approach”, Digital Scholarship in the Humanities, Volume 38, Issue 2, June 2023, Pages 779–797, https://doi.org/10.1093/llc/fqac079
Carey, Susan. “Bootstrapping & the Origin of Concepts.” Daedalus 133, no. 1 (2004): 59–68. http://www.jstor.org/stable/2002789.