An end-to-end approach for extracting and segmenting high-variance references from pdf documents

Introduction: Digital text analysis depends on one important thing: text that can be processed with little effort. Working with PDFs often leads to great difficulties, as Zeyd Boukhers Shriharsh Ambhore and Steffen Staab describe in their paper. Their goal is to extract references from PDF documents. Highlight of their described workflow are very impressive precision rates. The paper thereby encourages to a further development of the process and its application as a “method” in the humanities.

OpenMethods

HIGHLIGHTING DIGITAL HUMANITIES METHODS AND TOOLS

Tag: Page description languages