The Dutch East India Company (VOC) and the General Missives
In this use case, researchers from VU University, the Huygens Institute and the Dutch Language Institute developed tools for analyzing 17th and 18th century Dutch texts.
The project’s primary research data is the corpus of the General Missives, a series of reports detailing the activities of the Dutch East India Company and the events that occurred in the Asian regions where the VOC was active between 1610 and 1795.
First, researchers at Dutch Language Institute applied techniques of optical character recognition (OCR) to digitize the texts. Using TEI (Text Encoding Initiative), a widely used international standard for representing text in digital form, they enriched the output of this OCR-process with metadata, such as author and date, on the level of each individual missive. Additionally, they added structuring elements to distinguish between transcriptions, summarizing notes of the editors, and footnotes. VU University then converted this data to NAF (NLP Annotation Format). On top of that, Dirk Roorda performed a cleaning operation and conversion to Text-Fabric and back to XML.To facilitate manual and automatic entity annotation processes, the NAF files were converted to the more lightweight CONLL and XMI formats.
One of the resources developed in this use case is a tool for Named Entity Recognition (NER). NER is a technology that automatically recognizes named entities like persons, locations or ships in a set of texts, and categorizes them. In the companion publication to this tool (Arnoult et al., 2021), Sophie Arnoult, Lodewijk Petram and Piek Vossen compared language-specific and multilingual pretrained language models for NER and introduced a new NER model for Dutch based on the General Missives. These NER annotations are delivered in a Text-Fabric data module and they are also integrated in the (cleaned) XML files.
To wrap up this use case, we aim to develop a manual or tutorial so our NER-tool can be used by a broader community of scholars. Furthermore, we intend to make the annotated corpus available for research and to make the NER-model available for reuse.
Project info
Researchers
Senior Research Data Manager, Huygens Institute
Computational Linguist, Dutch Language Institute
Senior Researcher, Dutch Language Institute
PhD candidate, University of Amsterdam
Professor of Computational Lexicology, VU Amsterdam
Researcher, KNAW Humanities Cluster
WP6 Co-leader, CLARIAH NL
Employee until October 2022, KNAW Humanities Cluster
Publications
Research paper: Arnoult et al., 2021. Batavia asked for advice. Pretrained language models for Named Entity Recognition in historical texts.
The cleaned data is available in Text-Fabric and XML-TEI in the CLARIAH/wp6-missieven GitHub repository. There are also a few tutorials there, which can be viewed on NBViewer.
The dataset is available in the data section of the CLTL voc-missives GitHub repository.
One of the best-performing models we experimented with in the paper is available in the HuggingFace Transformers library: https://huggingface.co/CLTL/gm-ner-xlmrbase.