Introduction: In this blog post, James Harry Morris introduces the method of web scraping. Step by step from the installation of the packages, readers are explained how they can extract relevant data from websites using only the Python programming language and convert it into a plain text file. Each step is presented transparently and comprehensibly, so that this article is a prime example of OpenMethods and gives readers the equipment they need to work with huge amounts of data that would no longer be possible manually.
Introduction: In this article, José Calvo Tello offers a methodological guide on data curation for creating literary corpus for quantitative analysis. This brief tutorial covers all stages of the curation and creation process and guides the reader towards practical cases from Hispanic literature. The author deals with every single step in the creation of a literary corpus for quantitative analysis: from digitization, metadata, automatic processes for cleaning and mining the texts, to licenses, publishing and achiving/long term preservation.
Introduction: Hosted at the University of Lausanne, “A world of possibilities. Modal pathways over an extra-long period of time: the diachrony in the Latin language” (WoPoss) is a project under development exploiting a corpus-based approach to the study and reconstruction of the diachrony of modality in Latin.
Following specific annotation guidelines applied to a set of various texts pertaining to the time span between 3rd century BCE and 7th century CE, the work team lead by Francesca Dell’Oro aims at analyzing the patterns of modality in the Latin language through a close consideration of lexical markers.
The StandforCore NLP wishes to represent a complete Java-based set of tools for various aspects of language analysis, from annotation to dependency parsing, from lemmatization
to coreference resolution. It thus provides a range of tools which
can be potentially applied to other languages apart from English.
Among the languages to which the StandfordCore NLP is mainly applied there is Italian, for which the Tint pipeline has been developed as described in the paper “Italy goes to Stanford: a collection of CoreNLP modules for Italian” by Alessio Palmero Apostolo and Giovanni Moretti.
On the Tint webpage the whole pipeline can be found and downloaded: it comprises tokenization and sentence splitting, morphological analysis and lemmatization, part-of-speech tagging, named-entity recognition and dependency parsing, including wrappers under construction. [Click ‘Read more’ for the whole post.]
Introduction: Linked Data and Linked Open Data are gaining an increasing interest and application in many fields. A recent experiment conducted in 2018 at Furman University illustrates and discusses some of the challenges from a pedagogical perspective posed by Linked Open Data applied to research in the historical domain.
“Linked Open Data to navigate the Past: using Peripleo in class” by Chiara Palladino describes the exploitation of the search-engine Peripleo in order to reconstruct the past of four archeologically-relevant cities. Many databases, comprising various types of information, have been consulted, and the results, as highlighted in the contribution by Palladino, show both advantages and limitations of a Linked Open Data-oriented approach to historical investigations.
Introduction: Digital humanists looking for tools in order to visualize and analyze texts can rely on ‘Voyant Tools’ (https://voyant-tools.org), a software package created by S.Sinclair and G.Rockwell. Online resources are available in order to learn how to use Voyant. In this post, we highlight two of them: “Using Voyant-Tools to Formulate Research Questions for Textual Data” by Filipa Calado (GC Digital Fellows and the tutorial “Investigating texts with Voyant” by Miriam Posner.
Introduction: Named Entity Recognition (NER) is used to identify textual elements that gives things a name. In this study, four different NER tools are evaluated using a corpus of modern and classic fantasy or science fiction novels. Since NER tools have been created for the news domain, it is interesting to see how they perform in a totally different domain. The article comes with a very detailed methodological part and the accompanying dataset is also made available.
Introduction: There is a postulated level of anthropomorphism where people feel uncanny about the appearance of a robot. But what happens if digital facsimiles and online editions become nigh indistinguishable from the real, yet materially remaining so vastly different? How do we ethically provide access to the digital object without creating a blindspot and neglect for the real thing. A question that keeps digital librarian Dot Porter awake and which she ponders in this thoughtful contribution.
Introduction: This is a comprehensive account of a workshop on research data in the study of the past. It introduces a broad spectrum of aspects and questions related to the growing relevance of digital research data and methods for this discipline and which methodological and conceptual consequences are involved and needed, especially a shared understanding of standards.
Introduction: This blog post describes how the National Library of Wales makes us of Wikidata for enriching their collections. It especially showcases new features for visualizing items on a map, including a clustering service, the support of polygons and multipolygons. It also shows how polygons like the shapes of buildings can be imported from OpenStreetMap into Wikidata, which is a great example for re-using already existing information.