24-26 Sep 2019 Berlin (Germany)

Documentation of the first Code Sprint > Documentation of the first Code Sprint

Overall information

The code sprint was organised by the DESIR project, an offspring project of DARIAH-EU tasked with developing sustainability approaches for the DARIAH research infrastructure in terms of technological and organisational matters. The DESIR code sprint revolves around "Bibliographical metadata: Citations and References" and there were three tracks centered around this topic and one more infrastructurally oriented track on AAI.

The code sprint took place from July 31st to August 2nd in the premises of the Humboldt University Berlin in a relaxed and productive environment. Aim of the code sprint was to improve the cooperation between various DARIAH-related partners and institutions on the one hand and to develop service concepts around bibliographical data.

Picture by R. Jäschke, https://commons.wikimedia.org/wiki/File:Dorotheenstra%C3%9Fe_26,_Berlin,_Germany.jpg

We invited all DH developers with or without direct affiliation to DARIAH to join us for two days of hacking.

Keynote: Prof. Dr. Ralf Schenkel: Bibliographic Information Sources for Computer Science with Focus on Citations

The keynote for the DARIAH Code Sprint was given by Prof. Dr. Ralf Schenkel (Trier University): Bibliographic information systems need to rely on metadata provided by various sources in various forms and with various quality. The talk gave some insights how the dblp bibliography as an example for such a system is maintained and improved. It also examined coverage and quality of other sources of bibliographic information in the computer science domain. The talk also presented initial results on citation extraction from the fulltext of publications using the SciParse software.

Please find the slides of the talk here: Keynote Prof. Dr. Ralf Schenkel

Track A: Extraction of bibliographical data and citations from PDF applying GROBID

As a machine learning library for extracting, parsing and re-structuring raw documents, such as PDF documents, into structured TEI-encoded ones, GROBID is a powerful tool that focuses on technical and scientific publications. When fully processing PDF documents, GROBID can manage 55 final labels used to build relatively fine-grained units ranging from traditional publication metadata to full text structures. Some of these metadata are title, author first/last/middle-name, affiliation type, detailed address, journal, volume, issue, page. Meanwhile, for the full text structures, it can be section title, paragraph, reference marker, head or foot note, figure captions.

With its first developments starting in 2008, GROBID has become a state-of-the-art (Lipinski:2013) (Tkaczyk:2018) open source library for extracting metadata from technical and scientific documents in PDF format. Beyond simple bibliographic extraction tasks, the goal of the library is to reconstruct the logical structure of a raw document to enable large scale advanced digital library processes. For achieving this, GROBID explores a fully automated solution relying on machine learning (Linear Conditional Random Fields) models. The library is integrated today in various commercial and public scientific services such as ResearchGate, Mendeley, CERN Inspire and the HAL national publication repository in France. It is used on a daily basis by thousands of researchers and engineers. Since 2011, the library is open source under an Apache 2 license.

GROBID can be considered as a production-ready environment which includes a comprehensive web service API, a batch processing, a JAVA API, a generic evaluation framework, and the semi-automatic generation of training data. The GROBID Web API provides a simple and efficient way to use. For production and benchmarking, it’s strongly recommended to use this web service mode on a multi-core machine and to avoid running GROBID in the batch mode.

In the scope of the code sprint workshop we proposed a hands-on session where users were guided through PDF data extraction and processing. The workshop was framed according to the skills available among the audience. As a general suggestion, we have foreseen the following required skills: knowledge of java, python or javascript and the ability to communicate with a web service via HTTP.

The session covered the following topics (by order of priority, however they will be tackle depending on skills, time and interest):

Extraction of citation data from PDFs (required skills: Java/Python, Javascript, http, XML/JSON)
Visualisation of extracted information using GROBID references directly on the PDF (highlighting authors, title, tables, figures, keywords, etc.) (required skills: Java/Python, Javascript, html, XML/JSON)
Enhanced information by external services (e.g. affiliation disambiguation, gps coordinates, concept disambiguation) (required skills: Java/Python, Javascript, html, XML/JSON)
Creation of an enhanced view of a PDF to combine all data extracted in previous tasks and produce an usable viewer. (required skills: Javascript, http)

Track B: Import and export of bibliographical data from BibSonomy and ingest in managed collections

DH researchers can benefit from a broad overview on scholarly publications relevant for their work. Thus, a bibliography of DH literature can contribute to the well-being of the discipline. For computer science, DBLP is the de facto standard, easily allowing researchers to see who has contributed to the development of the field. Building such a great resource is a big achievement and we aim at taking the first steps towards a DH bibliography: enabling an easy-to-use web application to import and export bibliographic metadata for the digital humanities. Therefore, we did not want to re-invent the wheel, as tools like Zotero, BibSonomy, etc. already exist. Instead, we focussed on the simplification of data entry, e.g., by enabling import from ORCID or via drag'n'drop from PDF files (using technology developed in Track A), and use of BibSonomy as a backend for storing and organising literature references. With its RESTAPI it enables collaborative storage and retrieval of bibliographic metadata. The choice of programming language and frameworks is not fixed, yet, and will be decided later. Experience in web programming, particularly using web APIs and frameworks was a prerequisite.

Track C: Visualisation of processed data with added dimensions for journals, topics, or reference dependency graphs

The visualization of data and results gains more and more importance as natural component of the research cycle. In DH applications most of the visualization focus is around so called information visualization - graphical approaches showing usually high-dimensional and unstructured data with structure representation, revealing hidden structure or its internal relations, usually by means of graphs, charts, maps, etc. Although a number of information visualization toolkits and services exists, many approaches and tools from scientific visualization may be applied to amplify cognition, especially for 3D or 4D interaction.

The task of this track during the code sprint was at least twofold. On one hand, to elaborate specific visualization means on the boundary of infovis and scivis for bibliographical data (e.g. author networks with additional dimensions for e.g. journals, topics or reference dependency graphs). On the other hand, the track should conceptualize specific services that fit into the current DARIAH infrastructure landscape and with the preconditions provided by the other tracks in the code sprint, e.g. using data from BibSonomy.

Existing building components of the generic visualization framework VisNow (http://visnow.icm.edu.pl) were used, combined with web frameworks. Experience in both standard Java programming and web programming were a prerequisite. Experience in using 3D graphics web engines was an additional asset.

Track D: Securing Online Services in the DARIAH AAI using SAML/Shibboleth

Researchers that want to share their online services within DARIAH can take advantage of the DARIAH Authentication and Authorization Infrastructure (AAI). The DARIAH AAI enables researchers from eduGAIN to access DARIAH services, by using the interoperable SAML standard. This workshop introduced the DARIAH AAI and enablde its participants to install, configure and test the Shibboleth Service Provider (SP) to integrate with an online service. The goal was to make the participants familiar with the Shibboleth SP and how it integrates with their Web application. The workshop of course also provided for an introduction to SAML from an SP side, gave recommendations for further open-source SP implementations, and compared with other AAI technologies like OAuth2 and OpenID Connect.

Online user: 14