DARIAH Code Sprint 2019 - Sciencesconf.org

DARIAH Code Sprint 2019

24-26 Sep 2019 Berlin (Germany)

The DARIAH Code Sprint 2019

Overall information

The DARIAH code sprint was again organised by the DESIR project, an offspring project of DARIAH-EU tasked with developing sustainability approaches for the DARIAH research infrastructure in terms of technological and organisational matters. The code sprint was an opportunity to bring together interested developers, DH-affiliated people, not only from the wide DARIAH community.

We had three tracks approaching the wider topic of bibliographical metadata from three angles: Data extraction from PDFs (GROBID), data import and processing applying BibSonomy and data visualisation. The connecting brace was to work with the same bibliographical data through the process and to improve the interoperability within the tools.

Although this was already our second code sprint it was not exclusively addressed to participants of the first code sprint. Everyone was welcome! But an affiliation to coding in the DH or in general technological discussions was still helpful.

The DARIAH Code Sprint 2019 took place in Berlin from 24 to 26 September 2019. More detailed information about the location can be found here.

The documentation of the first DARIAH Code Sprint can be found on this page under Documentation of the first Code Sprint.

Track descriptions

Track A: Extraction of bibliographical data and citations from PDF applying GROBID

As a result of the first Code Sprint that was organised last year (2018) by the DESIR project, this track has successfully built a tool covering the following functionalities:
1. Citation extraction of PDF files using GROBID;
2. Visualisation of extracted information directly on the PDF files. This visualization is intended to highlight important information on scientific articles (e.g., authors, title, tables, figures, keywords);
3. Inclusion of some additional information from external services (e.g., affiliation disambiguation, named entity recognition);
4. Integration of all extracted data on the PDF files as usable viewers.
By accessing the tool, users will be given some ideas of how this tool works:
a. Firstly, users need to upload any scientific article in Pdf format;
b. Then, click the service buttons as needed to see the highlighted results that show:
- bibliographical extraction results;
- affiliation processing results;
- named-entity recognition.
For the second sprint code, Track A invited participants to enrich Grobid functionalities by adding an acknowledgment parsing service.
The goal was to produce a new service for extracting acknowledgment text and formatting the results in XML/TEI or JSON.
Information can be extracted from the given acknowledgment string are (for example): educational institution, funding agency, grant name, grant number, individual, affiliation of individual, project name, research institution, other institution.
* Given any acknowledgment input text:
"We want to acknowledge the patient enrolled in this study for their participation and the Basque Biobank for Research-OEHUN for its collaboration providing the human samples and the clinical information used in this project with appropriate ethics approval. Our gratefulness to Dr. Juan Burgos for the selection of the human samples and Dr. Felix Royo for helping with statistical analysis. "
* Then, the results of acknowledgment parsing service in JSON format (model training with Delft) might as follows:
{
    "software": "DeLFT",
    "date": "2019-08-01T13:47:12.506524",
    "model": "desirSecondCodeSprint",
    "texts": [
        {
            "text": "We want to particularly acknowledge the patients enrolled in this study for their participation and the Basque Biobank for Research-OEHUN for its collaboration providing the human samples and the clinical information used in this project with appropriate ethics approval. Our gratefulness to Dr. Juan Burgos for the selection of the human samples and Dr. Felix Royo for helping with statistical analysis.",
            "entities": [
                {
                  "text": "Basque Biobank for Research -",
                    "class": "",
                    "score": 1.0,
                    "beginOffset": 104,
                    "endOffset": 131
                },
                {
                    "text": "Dr . Juan Burgos",
                    "class": "",
                    "score": 1.0,
                    "beginOffset": 292,
                    "endOffset": 306
                },
                {
                    "text": "Dr . Felix Royo",
                    "class": "",
                    "score": 1.0,
                    "beginOffset": 351,
                    "endOffset": 364
                }
            ]
        }
    ],
    "runtime": 0.443
}

Steps for track A:
1. Annotation process (using Doccano, https://github.com/chakki-works/doccano).
    The dataset was produced from the automatic extraction process via GROBID of ~ 3500 scientific article "acknowledgments" sections in Open Access. In this step, Track A typically wants to recognize and therefore to annotate, for example, funding organizations, project names, person names, grants.
2. Using annotated datasets for building acknowledgment models. Besides using Grobid (https://grobid.readthedocs.io/en/latest/Training-the-models-of-Grobid/), training will also be introduced using Delft (https://github.com/kermitt2/delft/).
3. Add a new service in Grobid for parsing acknowledgment text.
4. Bring the acknowledgment parsing results into the demonstrator built in the First Code Sprint, including the visualisation of recognized entities introduced in the acknowledgement section on the PDF layout.

Track B: Automatic Import of Bibliographic Data into BibSonomy

In this track we aimed to extend the tool for automatic import of bibliographic metadata into BibSonomy. The first version of the application was created at the DESIR workshop 2018. The main idea was to provide a way for researchers, especially from the Digital Humanities community, to store and share bibliographic data in a common place. The resulting bibliography of DH literature can contribute to the well-being of the discipline. Currently, users can upload a pdf file and have metadata automatically extracted using an integrated version of GROBID. In a further step, users can correct the metadata and save it to BibSonomy. The system also provides information about missing metadata.

We wanted to extend the tool by adding further features, such as:
- Metadata extraction from text files: To allow usage of the tool beyond pdf files, we want to implement a feature that allows metadata extraction from text in general.
- Individual user login for BibSonomy: As it is now, users have to provide their BibSonomy credentials during the installation process. We want to implement a proper login feature for BibSonomy users.
- Improved User Interface: We want to further extend and optimize the existing user interface to allow optimal use of the tool's features.
- Web API: We aim to offer an open web API for accessing the tool's features.

Feel free to come up with your own ideas for improvement. We are looking forward to actively discuss all ideas in the beginning of the code sprint. The programming language used is Java with use of the Spring framework. Experience in web programming, particularly using web APIs and frameworks, is welcome.

BibSonomy is a social bookmarking system for researchers. It allows users to bookmark all types of bibliographic data, such as papers, books, articles etc. With its RESTAPI it enables collaborative storage and retrieval of bibliographic metadata.

GROBID is a tool that allows the extraction of metadata from pdf. It comes with a simple and easy to use Web API.

Track C: Visualisation of time dependent graphs of relations

One of the major substantial outcomes of the previous DESIR Code Sprint Track-C was the novel generic concept of time dependent graphs of relations and its visual presentation. Examples of such graphs may be co-authorship and citation graphs, genealogy trees, or characters interaction graphs. From the visual perspective both the structure and time characteristics of such graphs play a significant analytical role. Our web-based tool developed throughout DESIR project now holds a functionality of visualizing bibliographical datasets (e.g imported via BibSonomy API or loaded from a file), on top of the generic data model. Within this Code Sprint we focussed on the extension of our tool both towards new data formats and use cases, as well as new visual forms. The participants had the opportunity to work on the mapping of different data to the generic model of our graphs and/or on the translation of data formats to intermediate RDF description (subject-predicate-object). Bring-Your-Own-Data model is encouraged. New visual forms will cover the modification of web application user interface to include additional visualizations of metadata or aggregated information. Experience in Java and/or Javascript programming was recommended.

Online user: 2