Topic-specific corpus building: A step towards a representative newspaper corpus on the topic of return migration using text mining methods – Journal of Digital History

https://openmethods.dariah.eu/2022/01/27/topic-specific-corpus-building-a-step-towards-a-representative-newspaper-corpus-on-the-topic-of-return-migration-using-text-mining-methods-journal-of-digital-history/ OpenMethods introduction to: Topic-specific corpus building: A step towards a representative newspaper corpus on the topic of return migration using text mining methods - Journal of Digital History 2022-01-27 14:04:34 Introduction: In this post, we highlight a new publication venue for Historian Digital Humanists, the Journal of Digital History where digital scholarship is presented in three layers: narrative, epistemic and data layers. These publications are therefore complex digital scholarly outputs that open a bigger window on DH research and enable readers to follow along the whole research process, execute or eventually even reproduce certain steps. We showcase this innovative publication method though highlighting a methodology paper from the first issue, Sarah Oberbichler’s and Eva Pfanzelter’s Topic-specific corpus building: A step towards a representative newspaper corpus on the topic of return migration using text mining methods. [Click ‘Read more’ for the full post!] Marinella Testori Erzsébet Tóth-Czifra Blog post Corpus building Enhanced publications Executable papers Journal of Digital History LDA NewsEye Reproducibility Scholarly communication WSD

Pushing the boundaries of the scholarly paper to make them better aligned with digital research workflows has been a recurrent topic on Open Methods. In this post, we highlight a new publication venue for Historian Digital Humanists, the Journal of Digital History where digital scholarship is presented in three layers: narrative, epistemic and data layers. These publications are therefore complex digital scholarly outputs that open a bigger window on DH research and enable readers to follow along the whole research process, execute or eventually even reproduce certain steps.

We showcase this innovative publication method though highlighting a methodology paper from the first issue, Sarah Oberbichler’s and Eva Pfanzelter’s Topic-specific corpus building: A step towards a representative newspaper corpus on the topic of return migration using text mining methods.

About the Journal of Digital History:

As highlighted in its website (https://journalofdigitalhistory.org/en/about), the Journal of Digital History aims at “promoting a new form of data-driven scholarship” in the field of history exploiting digital methods.

Developed by the C²DH (Luxembourg Centre for Contemporary and Digital History in conjunction with DeGruyter, the Journal publishes peer-reviewed articles structured according to the interconnection between a hermeneutic, a data and a narrative layer of reading.

Leveraging publishing via a Jupiter Notebook (https://journalofdigitalhistory.org/en/notebook-viewer-form) and a jupyter_contrib_nbextensions system (https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/install.html), each article is written and can be consulted in the light of its methodological features (hermeneutic layer), its data and code structure (data layer) and its storytelling (narrative layer).

To provide an example, let’s see the article “The Nameless Crowds: Using Quantitative Data and Digital Tools to Study the Ancient Vocabulary of the Crowd in Tacitus” by Louis Autin (https://journalofdigitalhistory.org/en/article/JJszM3GwAYDs#h34). On the right, there is the list – Narrative, Hermeneutics, Data – of the layers available; by clicking on each one of them, it is possible to visualize the article itself accordingly.

As Andreas Fickers and Frédéric Clavert illustrate in their editorial contribution at the opening of the first Journal’s issue, the Journal represents a practical application of the “arranged like a pyramid” vision provided by Robert Darnton in 1999 about scholarship writing and publishing. In summary, as history is a complex of intertwined facts and events, so, due to digital methods and processes, also its narration nowadays can be seen as “a complex process of human-machine interaction” (Clavert and Fickers, 2021).

Source: https://journalofdigitalhistory.org/en

About the highlighted paper, Topic-specific corpus building: A step towards a representative newspaper corpus on the topic of return migration using text mining methods:

This methodology paper addresses the limitations of keyword search based approaches to corpus building (i.e. receiving too much irrelevant results or the unintended exclusion of too many relevant results in the context of a qualitative, discourse-driven analysis on return migration from the Americas to Europe between 1850 and 1950.

As an alternative, the enhanced paper proposes and describes text-mining methods for building a topic-specific corpus, namely semi-supervised similarity-based word sense disambiguation (WSD) approach using Latent Dirichlet allocation (LDA), a probabilistic model that calculates the probability distribution over terms, and the Jensen-Shannon (JS) distance (the square of the Jensen-Shannon divergence), which measures the similarity between.

The methodology paper describes how the manual and the automated steps of the workflow are harmonized and optimized for efficiency and precision in corpus compilation, aiming for high-degree of relevance while reducing the burdens of manual filtering or large-scale manual annotation.

The corpus had been built on top of the News Eye database. Due to copyright reasons, it was not possible to share along with the paper but the codes associated with each step of the corpus building pipeline are interlinked an can be directly accessed from the text via switching from the narrative to the hermeneutic layer or by clicking on the source code.

Source: https://journalofdigitalhistory.org/en/article/4yxHGiqXYRbX

Humanities researchers often encounter the problem that their specialized corpora, created by keyword searches, either contain documents that are irrelevant for their research questions because the search queries were too broad, or they miss relevant documents because the search requests were too narrow. The reason for this lies in the complexity of language, which is characterized by ambiguity and concepts that are difficult, if not impossible, to trace by computational methods and thus keyword searches alone. This paper shows how text mining methods can support the building of a topic-specific corpus. Using the example of return migration issues, the aim is, on the one hand, to build a corpus that is as representative as possible and, on the other hand, to overcome the bias that comes with complex keyword searches that are influenced by the researcher’s prior knowledge. The paper begins with a discussion of the motivations for and the challenges of building research driven corpora, leads through the steps that were taken to obtain a satisfactory corpus that can be analysed further and gives an outlook on how the created corpus was used to conduct a qualitative, discourse-driven analysis on return migration from the Americas to Europe between 1850 and 1950.

References

Autin, L. (2021). The Nameless Crowds: Using Quantitative Data and Digital Tools to Study the Ancient Vocabulary of the Crowd in Tacitus. Journal of Digital History, jdh001. https://journalofdigitalhistory.org/en/article/JJszM3GwAYDs

Clavert, F., & Fickers, A. (2021). On pyramids, prisms, and scalable reading. Journal of Digital History, jdh001. https://journalofdigitalhistory.org/en/article/jXupS3QAeNgb

Darnton, R. (1999). The New Age of the Book. The New York Review of Books. https://nybooks.com/articles/1999/03/18/the-new-age-of-the-book/

Journal of Digital History https://journalofdigitalhistory.org/en/

Oberbichler, S., & Pfanzelter, E. (2021). Topic-specific corpus building: A step towards a representative newspaper corpus on the topic of return migration using text mining methods. Journal of Digital History, jdh001. https://journalofdigitalhistory.org/en/article/4yxHGiqXYRbX

Original date of publication: November, 2021.

InternetArchive link: https://web.archive.org/web/20220116042531/https://journalofdigitalhistory.org/en

Leave a Reply

Your email address will not be published. Required fields are marked *