“Multilingual Research Projects: Non-Latin Script Challenges for Making Use of Standards, Authority Files, and Character Recognition”.

https://openmethods.dariah.eu/2023/07/08/multilingual-research-projects-non-latin-script-challenges-for-making-use-of-standards-authority-files-and-character-recognition/ OpenMethods introduction to: “Multilingual Research Projects: Non-Latin Script Challenges for Making Use of Standards, Authority Files, and Character Recognition”. 2023-07-08 13:41:15 Everyone of us is accustomed to reading academic contributions using the Latin alphabet, for which we have already standard characters and formats. But what about texts written in languages featuring different, ideographic-based alphabets (for example, Chinese and Japanese)? What kind of recognition techniques and metadata are necessary to adopt in order to represent them in a digital context? Marinella Testori Blog post Analysis Annotating Capture Content Analysis Data Recognition English Structural Analysis Text ECPO OCR

Introduction: Everyone of us is accustomed to reading academic contributions using the Latin alphabet, for which we have already standard characters and formats. But what about texts written in languages featuring different, ideographic-based alphabets (for example, Chinese and Japanese)? What kind of recognition techniques and metadata are necessary to adopt in order to represent them in a digital context?

The topic is faced by Matthias Arnold (2022) in the contribution featured here and illustrating a couple of experiments developed by experts of the Heidelberg Research Architecture (HRA) and the Heidelberg Centre for Transcultural Studies (HCTS).

The first experiment has tested the expansion and application of the VRA Core 4 XML Metadata Ziziphus web-editor, described by the same Arnold (2014), to the development of metadata specific for non-Latin scripts. In the paper, Arnold details the main steps of the process as well as the solutions provided to several challenges, like the transcription of the original texts and the key-question of the relation between language, script and transliteration.

The second experiment, instead, has considered how to optimize OCR recognition processes to develop full-text of the newspapers collected in the Early Chinese Periodicals Online (ECPO) database. While the usual procedures have proved to be unsuccessful in tackling the complexity of the characters, Arnold explains how the combination of progressively more refined approaches, from the machine-learning oriented to the latest neural networks, has showed the most promising results for tackling the challenges posed by text segmentation and semantic metadata.

References

Arnold, Matthias. 2022. “Multilingual Research Projects: Non-Latin Script Challenges for Making Use of Standards, Authority Files, and Character Recognition.” Digital Studies/Le champ numérique 12(1): 1–36. DOI: https://doi.org/10.16995/dscn.8110.

Arnold, Matthias. 2014. “Ziziphus – An Online Editor for VRA Core 4 XML Metadata”. Steady as she goes: Images and the Visual Resources Association: preserving the past while embracing the future. ARLIS/ANZ Conference, 15-17 October 2014, Auckland Art Gallery Toi o Tâmaki, New Zealand.

Early Chinese Periodicals Online (ECPO)

https://kjc-sv034.kjc.uni-heidelberg.de/ecpo/ecpo-database.php

Link to original content: Arnold, Matthias. 2022. “Multilingual Research Projects: Non-Latin Script Challenges for Making Use of Standards, Authority Files, and Character Recognition.” Digital Studies/Le champ numérique 12(1): 1–36. DOI: https://doi.org/10.16995/dscn.8110.

Leave a Reply

Your email address will not be published. Required fields are marked *