GROBID: when data extraction becomes a suite
Introduction by OpenMethods Editor (Marinella Testori): GROBID is an already well-known open source tool in the field of Digital Humanities, originally built to extract and parse bibliographical metadata from scholarly works. The acronym stands for GeneRation Of BIbliographic Data.
Shaped by use cases and adoptions to a range of different DH and non-DH settings, the tool has been progressively evolved into a suite of technical features currently applied to various fields, like that of journals, dictionaries and archives.
This is reflected in the different GROBID modules that have been developed to tackle different tasks: the management of lexicographic resources (GROBID-Dictonaries) as well as of catalogues (GROBID-Cat); name-entity recognition and disambiguation (grobid-ner) as well as the parsing of information like those referring to measurement (grobid-quantities), astronomical objects (grobid-astro), and training data (grobid-smecta).
The capabilities and potentials of GROBID are illustrated in the official introduction of its website (https://grobid.readthedocs.io/en/latest/Introduction/), detailing the full list of functionalities, among which there are header and reference extraction along with several types of parsing applications to references, names, dates, and so on. You can see GROBID deployed in both private and public enterprises such as: ResearchGate, HAL Research Archive, the European Patent Office, INIST-CNRS, Mendeley, CERN (Invenio), Internet Archive, and many others.
One interesting key aspect of GROBID related to a DH standard is that it uses full encoding in TEI, both for the training corpus and the parsed results.
As it can be seen, thus, a suite of rich applications, potentially open to new developments are ever broadening.
Scientific papers potentially offer a wealth of information that allows one to put the corresponding work in context and offer a wide range of services to researchers. GROBID is a high performing software environment to extract such information as metadata, bibliographic references or entities in scientific texts. Most modern digital library techniques rely on the availability of high quality textual documents. In practice, however, the majority of full text collections are in raw PDF or in incomplete and inconsistent semi-structured XML. To address this fundamental issue, the development of the Java library GROBID started in 2008 [1]. The tool exploits “Conditional Random Fields” (CRF), a machine-learning technique for extracting and restructuring content automatically from raw and heterogeneous sources into uniform standard TEI (Text Encoding Initiative) documents.
References:
Laurent Romary, Patrice Lopez. GROBID – Information Extraction from Scientific Publications.ERCIM News, ERCIM, 2015, Scientific Data Sharing and Re-use, 100. https://hal.inria.fr/hal-01673305
The GROBID suite:
Grobid: https://github.com/kermitt2/grobid
Grobid-Dictionaries: https://github.com/MedKhem/grobid-dictionaries
Grobid-Cat: https://github.com/MedKhem/grobid-cat
Grobid-NER: https://github.com/kermitt2/grobid-ner
Grobid-quantities: https://github.com/kermitt2/grobid-quantities
Grobid-astro: https://github.com/kermitt2/grobid-astro
Grobid-SMECTA: https://github.com/Vi-dot/grobid-smecta
You can see different applications of GROBID on the following portals:
ResearchGate: https://www.researchgate.net/
HAL Research Archive: https://hal.archives-ouvertes.fr/
the European Patent Office: https://www.epo.org/
INIST-CNRS: https://www.epo.org/
Mendeley: https://www.mendeley.com/?interaction_required=true
CERN (Invenio); https://invenio-software.org/
Internet Archive: https://archive.org/
Original date of publication: 29.12.2017
InternetArchive link: https://web.archive.org/web/20200611021205/https://hal.inria.fr/hal-01673305