The Dutch East India Company (VOC) and the General Missives

In this use case, researchers from VU University, the Huygens Institute and the Dutch Language Institute developed tools for analyzing 17th and 18th century Dutch texts.

Principal investigator

Lodewijk Petram

The project’s primary research data is the corpus of the General Missives, a series of reports detailing the activities of the Dutch East India Company and the events that occurred in the Asian regions where the VOC was active between 1610 and 1795.

First, researchers at Dutch Language Institute applied techniques of optical character recognition (OCR) to digitize the texts. Using TEI (Text Encoding Initiative), a widely used international standard for representing text in digital form, they enriched the output of this OCR-process with metadata, such as author and date, on the level of each individual missive. Additionally, they added structuring elements to distinguish between transcriptions, summarizing notes of the editors, and footnotes. VU University then converted this data to NAF (NLP Annotation Format). On top of that, Dirk Roorda performed a cleaning operation and conversion to Text-Fabric and back to XML.To facilitate manual and automatic entity annotation processes, the NAF files were converted to the more lightweight CONLL and XMI formats.

One of the resources developed in this use case is a tool for Named Entity Recognition (NER). NER is a technology that automatically recognizes named entities like persons, locations or ships in a set of texts, and categorizes them. In the companion publication to this tool (Arnoult et al., 2021), Sophie Arnoult, Lodewijk Petram and Piek Vossen compared language-specific and multilingual pretrained language models for NER and introduced a new NER model for Dutch based on the General Missives. These NER annotations are delivered in a Text-Fabric data module and they are also integrated in the (cleaned) XML files.

To wrap up this use case, we aim to develop a manual or tutorial so our NER-tool can be used by a broader community of scholars. Furthermore, we intend to make the annotated corpus available for research and to make the NER-model available for reuse.

Project info

Partners

Researchers

Lodewijk Petram

Senior Research Data Manager, Huygens Institute

Jesse de Does

Computational Linguist, Dutch Language Institute

Katrien Depuydt

Senior Researcher, Dutch Language Institute

Sophie Arnoult

PhD candidate, University of Amsterdam

Piek Vossen

Professor of Computational Lexicology, VU Amsterdam

Dirk Roorda

Researcher, KNAW Humanities Cluster

Joris van Zundert

WP6 Co-leader, CLARIAH NL

Julia Neugarten

Employee until October 2022, KNAW Humanities Cluster

Publications

Research paper: Arnoult et al., 2021. Batavia asked for advice. Pretrained language models for Named Entity Recognition in historical texts.

The cleaned data is available in Text-Fabric and XML-TEI in the CLARIAH/wp6-missieven GitHub repository. There are also a few tutorials there, which can be viewed on NBViewer.

The dataset is available in the data section of the CLTL voc-missives GitHub repository.

One of the best-performing models we experimented with in the paper is available in the HuggingFace Transformers library: https://huggingface.co/CLTL/gm-ner-xlmrbase.

More projects

WP6: Text

Work package 6 creates data and tools aimed at all scholars working with textual data, such as researchers in literary studies, history, phi...