Best paper award at DH Benelux 2019 for paper on temporal exploration of audiovisual sources

During the Digital Humanities Benelux 2019 conference in Liège, a paper created in the context of CLARIAH received the best paper award. This paper looked at novel and visual ways to support scholars' interpretation of audiovisual sources through temporal content exploration.

From 11 to 13 September 2019, a large number of Digital Humanities (DH) researchers from Belgium, Luxembourg and the Netherlands came together in Liège for the annual DH Benelux conference. On the basis of peer reviews conducted for DH Benelux 2019, five papers were nominated for the best paper award. According to the jury, two papers stood "head and shoulders above the rest", and therefore two best paper awards were given. One was bestowed upon Gerben Zaagsma and his paper Digital History and the Politics of Digitization. The other best paper award went to Hugo Huurdeman, Liliana Melgar, Roeland Ordelman and Julia Noordegraaf, for a paper created in the context of CLARIAH's Media Studies work package. This paper was entitled Supporting the Interpretation of Enriched Audiovisual Sources through Temporal Content Exploration.

The paper by Huurdeman et al. describes findings of the ReVI project, a pilot looking at enhancing the Resource Viewer of the CLARIAH Media Suite, where audiovisual materials can be played. Specifically, the ReVI project looked at optimal ways "to support the exploration of different types of content metadata of audiovisual sources, such as segment information or automatic transcripts." During the project, various design thinking sessions were conducted, and a prototype including temporal content visualizations of audiovisual materials was created and evaluated in a user study.

The findings of the user study showed a clear value of temporal visualizations and advanced annotation features for research purposes, as well as the continued importance of a data and tool criticism approach. New content exploration tools can benefit scholars doing research with audiovisual sources, for instance in media studies, oral history, film studies, and other disciplines which are increasingly using audiovisual media. The findings documented in the DH Benelux 2019 paper may serve as an inspiration for improving AV-media-based research tools. Concretely, it will also inform the further enhancement of the Resource Viewer of the CLARIAH Media Suite.

The conference presentation is available on SlideShare, and the paper abstract via the DH Benelux conference website.


Blog post written by Jan Odijk (Utrecht University, CLARIAH NL)

the original blog was posted at the CLARIN ERIC website

On 23 and 24 May the CLARIN ParlaFormat workshop was held in Amersfoort, the Netherlands. This workshop was organized by the CLARIN Interoperability Committee, a subcommittee of CLARIN’s National Coordinators’ Forum.

parlaformatParticipants at the RCE in Amersfoort There were 25 participants from 13 different countries, as well as participants from the CLARIN Interoperability  Committee and from the CLARIN ERIC Board.

The goal of the workshop was to present an outline of a standard format (proposed name: parla-CLARIN, a subset of TEI) for parliamentary data to the research community, to assess the support for it, and to identify potential or real problems for its development and wide adoption. This proposal was prepared and presented by Tomaž Erjavec and Andrej Pančur (from CLARIN Slovenia).

The participants presented the formats they currently work with, indicated which aspects of these formats are important for them, and inquired whether these are covered by the new proposal. There was a very good and constructive atmosphere during the whole workshop, active contributions by all participants and lively discussions. In particular, there were extensive discussions on the existing standard Akoma Ntoso, which is in use in various parliaments, and what the relation should be between it and the newly proposed format.

DinnerJoint dinner at Sally's Indonesian Kitchen & Restaurant in Amersfoort

After the presentations by the various participants, Tomaž Erjavec and Andrej Pančur responded to some of the questions about and criticisms on parla-CLARIN, and how they will address these in the coming period.

The parla-CLARIN format will now be further developed. Information about it and contributions to it can be provided via Github. All participants will upload samples of their data there so that these can be taken into consideration during the further development of parla-CLARIN.

In about three months a revised version of parla-CLARIN will be made public, and we plan to organize a follow-up workshop with a shared task: all participants will convert their data into the parla-CLARIN format, report on problems encountered and share conversion scripts. If that workshop is successful, we plan to organize another follow-up workshop in which participants will address research questions that cover parliaments from multiple countries, which is then possible because of the uniform format of the parliamentary data.

The agenda and the presentations are available on the ParlaFormat event page

The Rotterdam Exchange Format Initiative (REFI) launches standard for sharing qualitative data across qualitative data analysis software.

By: Liliana Melgar and Marijn Koolen (CLARIAH project)

The Rotterdam Exchange Format Initiative (REFI) consists of a group of software developers and expert qualitative researchers who decided to join efforts in creating a standard for the exchange of data between qualitative data analysis software packages, also called CAQDAS or QDAS.

QDA software packages are designed to facilitate qualitative data analysis. This type of software has existed for more than thirty years (Silver and Patashnick, 2011). According to SoSciSo, an inventory of software used in social science research, there may be more than thirty packages of this type in the market. This makes it difficult for qualitative researchers to choose a package for their research, but also even more difficult to move their data out of or across these packages.

Representing CLARIAH, we attended the launching event of the project exchange format produced by the REFI group, and joined the discussions about the implications and next steps.

The REFI initiative and standard

The REFI initiative originated with the aim to solve the difficulties in exchanging data between QDA software. As Fred van Blommestein explains, the main reasons to facilitate exchange were to make it possible for users to switch to other software packages, exchange data with colleagues, leave a software package to choose another one (not to be locked-in) thus getting the benefits from using the best features of each specific software, and also for result verification (comparing results between packages). An extra reason for creating an exchange format, which was extensively discussed during the launching event, is research data archiving.

The idea to facilitate data exchange between QDA packages began during the KWALON conference in 2010. KWALON is an independent organization of researchers and lecturers at universities, colleges, research agencies and other organizations that deal with the methodology of qualitative social science research. In 2010, the so-called “KWALON experiment” was the first attempt to identify the issues in exchanging qualitative data between these applications, The KWALON Experiment consisted of five developers of Qualitative Data Analysis (QDA) software, all analysing the same dataset regarding the financial crisis in the time period 2008-2009, provided by the conference organisers (an article about this experiment was published in the KWALON journal, FQS, “Forum: Qualitative Social Research” in 2011. Each developer used their own software for the analysis.

During the second KWALON conference, which took place in Rotterdam in 2016, Jeanine Evers, an active member of KWALON since 1995, asked the developers of the QDA packages if they were willing to work on an exchange format. The REFI group was then created and started working right after this conference. Developers from ATLAS.ti, F4 analyse, NVivo, QDA miner, Quirkos, and Transana have been actively working on the standard; also with some participation by developers from Dedoose and MAXQDA. The coordination of the REFI group is done by Fred van Blommestein, Jeanine Evers, Yves Marcoux, Elias Rizkallah, and Christina Silver (see photo).

The REFI initiative has produced two standards:

  • The first product was a “codebook exchange” format, launched in Montreal in March 2018. This format allows users of QDA packages to export their codebooks and import them into any of the programs that implement the format (more about codebooks and the list of software packages which are compatible is at the REFI website).
  • The second product, launched on March 18, 2019 in Rotterdam (see photo with the proud group) is the “project exchange” format, which facilitates the exporting and importing of the main components of a research project done by a researcher with one of the participating software packages. As explained in the REFI website, those components include, among others: the source documents that are analyzed, the segments in those documents that researchers have identified and annotated, the codes and annotations they have assigned to these segments, memos with analytical notes, the links between codes, segments or memos, the cases, the sets/groups of entities, the visual representations of linked entities in the project, and user information.

refi (Source: REFI website)

The launching event

The project exchange format was launched during a workshop event on March 20-21, 2019 in Rotterdam, where besides the REFI group members, other participants from the archival community and infrastructure projects were invited to present and discuss the implications of these exchange formats.

Presenters included:

  • Ron Dekker from Director of CESSDA, the Consortium of European Social Science Data Archives, who pointed to the limitation of some european projects which end up with tools that cannot be sustained in the long term. He argued in favor of an integrated approach to research data infrastructures which provides a “minimum viable ecosystem” for federating existing initiatives and structures within a single, consolidated and seamless platform that would facilitate data provision and exchange between the four major stakeholders: member states, service providers, data producers, and data users.
  • Sebastian Karcher , from the Qualitative Data repository at Syracuse University, introduced us to the QDR repository, which curates, stores, preserves, publishes, and enables the download of digital data generated through qualitative and multi-method research in the social sciences. Sebastian presented the requirements and challenges in providing high quality data services to researchers, which involve not only curation, but also good documentation, assistance, and training.
  • Louise Corti from the UK Data archive founded at the University of Essex in 1967 introduced the collections, users, and main processes of the archive. She highlighted the importance of the QDA exchange standard, since now QDA packages could offer a “deposit” or “archive” button to their users.
  • Rico Simke, a software engineer from the Center for Digital Systems (CeDiS) of the library of Free University Berlin, described the rich qualitative collections that they host, among others, the “Visual history archive”, which contains 52,000 interviews with survivors and witnesses of the Holocaust, and the “Forced labor” collection, which contains 583 interviews with survivors of Nazi forced labor. Rico explained the curation processes to facilitate fine-grained access to these collections, and we all discussed the tension between software for editing and publishing these collections, versus the software to perform qualitative analyses with those collections.
  • René van Horik, from DANS, the Dutch institute for permanent access to digital research resources guided us through the existing certifications for data repositories, he highlighted the importance of the QDA exchange standard, since it facilitates the creation of data management plans for researchers.
  • Steve McEachern, from the Australian Data Archive, and the ANU Center for Social Research and Methods and Qualitative Data, which collects and preserve Australian Social Science data, including 5000 datasets and 1500 studies (including a small set of qualitative research datasets of e.g. election studies, public opinion polls, censuses, administrative data), talked about Dataverse, and the future directions in processing qualitative data. He also discussed the difficulties to separate what is data and what is analysis, and their efforts to try to come up with a process model of qualitative research.
  • Julian Hocker: Ph.D student in Information science at the Leibniz-Institute for research and information in education (DIPF) in Germany, presented his research on a metadata model for qualitative research, which will encourage researchers to share qualitative data, mostly their coding schemes.

Discussion and next steps

At the launching event, the implications of the exchange formats were discussed, mostly focusing at this stage on the requirements for the format to be compatible with the requirements for data deposit at repositories. The participants actively listed the elements required for the standard to be more suitable to this aim. A second version of the exchange format, as well as the dissemination activities among the involved communities and the users of the QDAS packages were listed as the main actions to take by the REFI group in the near future.

In the proposal for CLARIAH-PLUS (p. 8) it is argued that: “The CLARIAH infrastructure will increase our empirical base, options for analysing […] data, and the efficiency of research by orders of magnitude (data-intensive science).”
Ok, but is it true?


Geert Wilder, leader of the Dutch populist party PVV, known for standing up for ordinary people, recently published a tweet (see below) in which he used the completely unknown word 'difficulteren' (doing difficult). Remarkable, because his party is known for their straightforward use of language that even 'ordinary' people can understand.wildersThe reason for the blogs The use of the particular word 'difficulteren' by populist party leader Geert Wilders. The translation of this tweet in English is: The President of Parliament Arib seemed okay yesterday when I spoke to her about awarding Muhammad cartoon prizes in Dutch Parliament during “party day”. Now she is going to difficulteren (doing difficult). Suddenly everything must be done via commission, praesidium, etc..

Linguist Marc van Oostendorp, professor of Dutch Language and Literature at Radboud University in Nijmegen and a passionate blogger, wrote a nice blog about this tweet and  formulated a conjecture about the use of this word. Marten van der Meulen, PhD student and writer, responded to this blog by conducting corpus searches in data that have been made accessible in the CLARIAH infrastructure in order to test Marc’s conjecture. Marten tried to find when  this unknown word ‘difficulteren’ was used for the first time, how often it has been used at all in recent years, and in what contexts it mainly occurred?

The research

‘increase our empirical base'

Marten searched in 6 corpora (Staten Generaal Digitaal, Corpus Gesproken Nederlands, Corpus Hedendaags Nederlands, Brieven als Buit Corpus, Sonar en in the corpora of Nederlab (where it mainly occurs in Early Dutch Books Online). A prominent feature of  CLARIAH is that it allows every humanities scholar to search these resources: you don't have to be a corpus linguist, you don't have to be able to code, you don't have to download corpora or software. CLARIAH offers web applications with user-friendly interfaces that make searching in those corpora easy. See below for links.

'increase options for analysing … data'

These resources make it possible to search by lemma rather than by word, which makes the search and analysis of the search results a lot easier and results in a larger number of relevant data. Moreover, many of the sources contain metadata such as genre, time and place, so that it can also be quickly determined where, when and in which genres this word occurs frequently or less frequently.

'increase the efficiency of research'

Marten did this research within 1 day, something that was not possible before CLARIAH, except perhaps for a select group of corpus linguists.

difficulterenDifficulteren: Oprechte Haerlemsche courant (08-11-1687). Found in the archives of the Library of the Netherlands by searching for ‘difficulteren’ in the search-app of the NederLab-project.

Of course, you can also search the internet, via Google or Twitter. This complements the search in specific corpora, especially since the empirical basis is then even larger. But then one has to look up all the word forms of this verb separately and the analysis of the results requires more (manual) work, especially because there are hardly any relevant metadata. Marten has also searched with Google, but he has not yet been able to analyse the results in that one day. He also searched the Corpus of the Web (COW) for Dutch, smaller than the whole internet but still quite large (7 billion words), and there were fewer hits, so they could be analysed further.

The search query in question concerns a one-word lemma, and that is a relatively simple task. But the CLARIAH infrastructure also allows much more complex searches, with combinations of words, word pairs with a grammatical dependency relationship, and complete grammatical constructions.


My conclusion is therefore that CLARIAH facilitates and already substantiates the above claim.

Do you want to know more, or take a course to make the best use of these tools? Please feel free to contact CLARIAH via: .

Jan Odijk



Corpus Hedendaags Nederlands
(searching for word pairs with a grammatical dependency relationship)

(searching for grammatical constructions)

(overview of tools and services, still under development)

Last week, the 16th International Semantic Web Conference (ISWC 2017) took place in Vienna, Austria. Around 600 researchers from all over the world came together to exchange knowledge and ideas in 7 tutorials, 18 workshops, and 3 full days of keynotes, conference talks, and a big poster & demo session. Needless to say, I only saw a small part of it, but all the papers and many of the tutorial materials are avaialble through the conference website

First of all, kudos to the organising committee for putting together a fantastic programme and great overall surroundings. The WU Campus (workshops, posters & demos and jam session) has a really gorgeous campus with a marvellous spaceship-like library.

The main conference took place next door at the Messe, where the Wifi worked excellently (quite a feat at a CS conference where most participants carry more than one device). The bar for next year is set high! 

But back to the conference: 

On Sunday, I got to present the SERPENS CLARIAH research pilot during the Second Workshop on Humanities in the Semantic Web (WHISE II). There were about 30 participants in the workshop, and a variety of projects and topics was presented. I particularly liked the presentation by Mattia Egloff on his and Davide Picca's work on DHTK: The Digital Humanities ToolKit. They are working on a python module that supports analysis of books and they are developing and testing it for an undergraduate course for humanities students. I really think that by providing (humanities) students with tools to start doing their own analyses, we can get them enthusiastic about programming, as well as thinking about the limitations of such tools, which can lead to better projects in the long run. 

In the WHISE workshop, as well as in the main conference, there were several presentations on multimedia datasets for the Semantic Web. The multimedia domain is not new to Semantic Web, but some of the work (such as Rick Meerwaldt, Albert Meroño-Peñuela and Stefan Schlobach. Mixing Music as Linked Data: SPARQL-based MIDI Mashups Mashups) doesn't just focus on the metadata but actually encodes the MIDI signal as RDF and then uses it for a mashup.

Another very interesting  resource is IMGpedia, created by Sebastián Ferrada, Benjamin Bustos and Aidan Hogan, which was presented in a regular session (winner best student resource paper) as well as during the poster session (winner best poster). The interesting thing about this resource is that it doesn't only allow you to query on metadata elements, but also on visual characteristics. 

IMG 7976

Metadata and content features are also combined in The MIDI Linked Data Cloud by Albert Meroño-Peñuela, Rinke Hoekstra, Victor de Boer, Stefan Schlobach, Berit Janssen, Aldo Gangemi, Alo Allik, Reinier de Valk, Peter Bloem, Bas Stringer and Kevin Page which would for example make studies in ethnomusicology possible. I think such combinations of modalities is super exciting for humanities research where we work with extremelty rich information sources and often need to/want to combine sources to answer our research questions. 

Enriching and making available cultural heritage data is also a topic that keeps popping up at ISWC, this year there was for example "Craig Knoblock, Pedro Szekely, Eleanor Fink, Duane Degler, David Newbury, Robert Sanderson, Kate Blanch, Sara Snyder, Nilay Chheda, Nimesh Jain, Ravi Raju Krishna, Nikhila Begur Sreekanth and Yixiang Yao: Lessons Learned in Building Linked Data for the American Art Collaborative". This project was a pretty big undertaking in terms of aligning and mapping museum collections. I really like that the first lesson learnt to create reproducible workflows: 

IMG 7960

This doesn't only hold for conversion of museum collections, but for all research. But it's still nice to see mentioned here. Reproducibility is also a motivator in "Tobias Kuhn, Egon Willighagen, Chris Evelo, Núria Queralt Rosinach, Emilio Centeno and Laura Furlong: Reliable Granular References to Changing Linked Data" which investigates the use of nanopublications to enable referring to items or subsets within data collections for finegrained referencing of previous work.

My favourite keynote at this conference (and they had three excellent ones) was by Jamie Taylor, formerly of Freebase, now Google. He argued for more commonsense knowledge in our knowledge graphs. While I do think that is a great vision, as many of our resources lack this leading to all sorts of weird outcomes in for instance named entity linking (you can ask Filip Ilievski for the funniest examples) it was unclear how to go about this this and whether this would be possible at all. The examples he gave in the keynote for toasters and kettles would work out just fine (kettles heat up water, toasters heat up baked goods) but for complex concepts such as murders (Sherlock Holmes anyone?) I'm not sure how this would work. But enough food for thought. See also Pascal Hitzler's take on this keynote

For other highlights of the conference, check out these other trip reports by Juan Sequeda and Paul Groth.


See you in Monterey, California next year? 

IMG 7990