https://openmethods.dariah.eu/2020/11/09/an-end-to-end-approach-for-extracting-and-segmenting-high-variance-references-from-pdf-documents-proceedings-of-the-18th-joint-conference-on-digital-libraries/
OpenMethods introduction to: An end-to-end approach for extracting and segmenting high-variance references from pdf documents
2020-11-09 13:01:00
Introduction: Digital text analysis depends on one important thing: text that can be processed with little effort. Working with PDFs often leads to great difficulties, as Zeyd Boukhers Shriharsh Ambhore and Steffen Staab describe in their paper. Their goal is to extract references from PDF documents. Highlight of their described workflow are very impressive precision rates. The paper thereby encourages to a further development of the process and its application as a "method" in the humanities.
Stefan Karcher
https://dl.acm.org/doi/abs/10.1109/JCDL.2019.00035
Blog post
Analysis
Bibliographic Listings
Content Analysis
Digital Humanities
English
Metadata
Research Activities
Research Objects
Research Results
Text
Text Bearing Objects
Computer-related introductions in 1993
Digital press
Electronic documents
Graphics file formats
ISO standards
Office document file formats
Open formats
Page description languages
Social sciences
Vector graphics
Introduction by OpenMethods Editor (Stefan Karcher):
Digital text analysis depends on one important thing: text that can be processed with little effort. Working with PDFs often leads to great difficulties, as Zeyd Boukhers Shriharsh Ambhore and Steffen Staab describe in their paper. Their goal is to extract references from PDF documents. Highlight of their described workflow are very impressive precision rates. The paper thereby encourages to a further development of the process and its application as a “method” in the humanities.
The benefit of combining the different steps in
a coherent mechanism is demonstrated and validated with the obtained result. The presented approach is non-parameterized, where
it takes the PDF document as input and outputs a list of segmented
reference strings. As a result, the approach achieved a satisfactory
result on different datasets overcoming state-of-the-art methods.
[We recommend to use the browser extension “Unpaywall”, which was introduced at DARIAH open, to quickly get a free version of the OA paper]
Source: An end-to-end approach for extracting and segmenting high-variance references from pdf documents | Proceedings of the 18th Joint Conference on Digital Libraries
Original date of publication: 07.2019
InternetArchive link: https://web.archive.org/web/20200630092304/https://dl.acm.org/doi/abs/10.1109/JCDL.2019.00035