GROBID: when data extraction becomes a suite

https://openmethods.dariah.eu/2020/09/09/grobid-when-data-extraction-becomes-a-suite/ OpenMethods introduction to: GROBID: when data extraction becomes a suite 2020-09-09 08:01:47 Introduction: GROBID is an already well-known open source tool in the field of Digital Humanities, originally built to extract and parse bibliographical metadata from scholarly works. The acronym stands for GeneRation Of BIbliographic Data. Shaped by use cases and adoptions to a range of different DH and non-DH settings, the tool has been progressively evolved into a suite of technical features currently applied to various fields, like that of journals, dictionaries and archives. [Click ‘Read more’ for the full post!] Marinella Testori Blog post Bibliographic Listings Data Digital Humanities Enrichment Information Retrieval Link Machine Learning Named Entity Recognition Persons Projects Research Research Objects Research Process Research Techniques Software Text Text Bearing Objects Tools Bibliographical data GROBID information extraction Information retrieval techniques NER ; NLP ; Digital Humanities ; History NLP TEI

Introduction by OpenMethods Editor (Marinella Testori): GROBID is an already well-known open source tool in the field of Digital Humanities, originally built to extract and parse bibliographical metadata from scholarly works. The acronym stands for GeneRation Of BIbliographic Data.
Shaped by use cases and adoptions to a range of different DH and non-DH settings, the tool has been progressively evolved into a suite of technical features currently applied to various fields, like that of journals, dictionaries and archives.

This is reflected in the different GROBID modules that have been developed to tackle different tasks: the management of lexicographic resources (GROBID-Dictonaries) as well as of catalogues (GROBID-Cat); name-entity recognition and disambiguation (grobid-ner) as well as the parsing of information like those referring to measurement (grobid-quantities), astronomical objects (grobid-astro), and training data (grobid-smecta).

The capabilities and potentials of GROBID are illustrated in the official introduction of its website (https://grobid.readthedocs.io/en/latest/Introduction/), detailing the full list of functionalities, among which there are header and reference extraction along with several types of parsing applications to references, names, dates, and so on. You can see GROBID deployed in both private and public enterprises such as: ResearchGate, HAL Research Archive, the European Patent Office, INIST-CNRS, Mendeley, CERN (Invenio), Internet Archive, and many others.

One interesting key aspect of GROBID related to a DH standard is that it uses full encoding in TEI, both for the training corpus and the parsed results.

As it can be seen, thus, a suite of rich applications, potentially open to new developments are ever broadening.

Scientific papers potentially offer a wealth of information that allows one to put the corresponding work in context and offer a wide range of services to researchers. GROBID is a high performing software environment to extract such information as metadata, bibliographic references or entities in scientific texts. Most modern digital library techniques rely on the availability of high quality textual documents. In practice, however, the majority of full text collections are in raw PDF or in incomplete and inconsistent semi-structured XML. To address this fundamental issue, the development of the Java library GROBID started in 2008 [1]. The tool exploits “Conditional Random Fields” (CRF), a machine-learning technique for extracting and restructuring content automatically from raw and heterogeneous sources into uniform standard TEI (Text Encoding Initiative) documents.

References:

Laurent Romary, Patrice Lopez. GROBID – Information Extraction from Scientific Publications.ERCIM News, ERCIM, 2015, Scientific Data Sharing and Re-use, 100. https://hal.inria.fr/hal-01673305

 

The GROBID suite:

Grobid: https://github.com/kermitt2/grobid

Grobid-Dictionaries: https://github.com/MedKhem/grobid-dictionaries

Grobid-Cat: https://github.com/MedKhem/grobid-cat

Grobid-NER: https://github.com/kermitt2/grobid-ner

Grobid-quantities: https://github.com/kermitt2/grobid-quantities

Grobid-astro: https://github.com/kermitt2/grobid-astro

Grobid-SMECTA: https://github.com/Vi-dot/grobid-smecta

 

You can see different applications of GROBID on the following portals:

ResearchGate: https://www.researchgate.net/

HAL Research Archive: https://hal.archives-ouvertes.fr/

the European Patent Office: https://www.epo.org/

INIST-CNRS: https://www.epo.org/

Mendeley: https://www.mendeley.com/?interaction_required=true

CERN (Invenio); https://invenio-software.org/

Internet Archive: https://archive.org/

Original date of publication: 29.12.2017

InternetArchive link: https://web.archive.org/web/20200611021205/https://hal.inria.fr/hal-01673305

Cite this article as: Marinella Testori, OpenMethods introduction to: "GROBID: when data extraction becomes a suite," in OpenMethods, September 9, 2020, https://openmethods.dariah.eu/2020/09/09/grobid-when-data-extraction-becomes-a-suite/.

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *