CLARIAH Highlight: Lodewijk Petram
This month, the Clariah Communications Team interviewed researchers from Work Package 6 about their text annotation tools. Work Package 6 develops resources for scholars in literary studies, history, philosophy, and religious studies, aiding in the analysis of textual data.
This month, the Clariah Communications Team sat down with a few of the researchers from Clariah's Work Package 6 team to better understand the tools and systems they are developing within the Clariah infrastructure in relation to text annotation. Work package 6 is creating tools and data for all scholars who want to work with textual data, covering the disciplines literary studies, history, philosophy and religious studies.
One of these interviews was with economic historian and researcher, Lodewijk Petram, who is based at the Huygens Institute in Amsterdam, Netherlands. Petram further informed us about the VOC Clariah use case, a casewhere researchers used a named entity recognition tool to identify entities like persons, ships, organizations, and locations in digitized historical texts related to the VOC (Dutch East India Company). The Dutch East India Company (or the VOC in Dutch), which was founded in Amsterdam in 1602, was a major trading company. The company was an economic powerhouse that produced many historical documents and records. However, it was also notorious for its brutal conduct in Asia, engaging in the exploitation of local peoples and resources, involvement in the slave trade, and implementing harsh measures to maintain control over its territories.
Inspiration for the VOC Clariah use case started with the large collection of digitized source publications edited by the Huygens Institute from original documents in the National Archives .The aim was to use this use case to add a new layer of functionality within the digital collections of the National Archives and encourage scholars to utilize such functionalities in their own research. The creation of this use case included the creation of a named entity recognition tool, which could identify and extract entities in early modern Dutch texts that were specifically related to the VOC. This included persons, ships, organizations, and locations. The use case was trained using manually annotated samples which would teach the computer certain language patterns and in which contexts certain entities appear. The data preparation process, led by Dirk Roorda, involved the separation of different types of texts (e.g. transcriptions, summaries, footnotes) so they were able to be treated differently when annotated by the tool.
While the current VOC use case has stopped development, Petram sees long-term impact of its outcomes through the GLOBALISE project. The GLOBALISE project aims to develop a reusable methodology for working with the VOC archives, including entity identification and the contextualization of reference data. The project also explores potential collaborations with non-European scholars, museums and art institutions which would like to research and showcase the historical significance of the VOC archives.
Fun facts about the VOC Use Case:
- The VOC entity recognition tool was able to distinguish between the names of ships and locations which is impressive considering many VOC ships were named after Dutch cities.
- The annotation of the VOC texts in early modern Dutch was best carried out using a multilingual language model, in other words, a model that can work with and distinguish between different languages.
- For some territories in Asia, the VOC documents housed at the National Archives in the Hague are the earliest accounts from the early modern period. Through the VOC use case, these documents became easier accessible for scholars to use in their research.
- Globalise, digital infrastructure project that builds in part on the work of the use case, is working with international researchers in Asia to better understand the international history and impact of the Dutch East India Company.
Collaboration and Interdisciplinary Work
During our interview, we also discussed the importance and challenges of interdisciplinary collaboration between humanities scholars and technological developers. Petram emphasized that effective communication and frequent feedback loops are crucial for understanding each other's domains and ensuring the tools meet different scholars' needs. Dedicated teams working on a single project, seated at the same table, can facilitate this process. This builds solutions to challenges that may not have come up otherwise and ensures that work processes are always moving forward. Petram also discusses the potential for tutorials and documentation, such as the Programming Historian website. This is a website which helps humanities scholars successfully adopt and use digital tools more effectively via accessibly written tutorials by the developer themselves in English, Spanish, French and Portuguese.