By Jauco Noordzij

CLAAS Kickoff

Recently we had a kickoff meeting for the CLARIAH technical committee and the CLaaS infrastructure.

While the goal of the Technical Committee is well-defined:
"To develop interoperating standards that allow the software of the separate workpackages to interoperate", the approach to reach that goal is, ahh, not so well defined. Last week we came together to discuss how to achieve this goal.

Now, it is important to note that our goal can be considered a "wicked" problem, i.e. a problem that resists being tamed. Three important characteristics of these problems are that they have no definitive formulation, that there is no definitive solution (just a better or worse outcome) and also no ultimate test of a solution to the problem (you can't determine definitively if a solution worked).

It is also important to note that the technical committee consists of 11 people, each representing a group and each chosen explicitly for having a different view of the problem space.

Let's just say that we figured that a round-table discussion was not the  most effective method for formulating how to achieve our goal ;-)

Instead we used liberating structures to generate a common understanding of our goal and how to achieve it.  On the surface these are work-forms to facilitate discussions. For example, you might generate ideas using 1-2-4-all. As is often the case, the surface is the least interesting part though, and LS is mostly about finding ways to work together where everyone is engaged simultaneously, working in the same direction, and avoid getting "stuck" in writer's block or analysis paralysis. The community around these structures is continuously looking for micro-structures that enable us to interact in ways that are inclusive, easy to learn (aka expert-less), can operate on different scales, are fun without wasting your time (aka  seriously fun) and get you to someplace that's "good enough" for now,  allowing you to improve upon it when needed (aka failing forward).

In this meeting we answered a series of questions that bring you from a global purpose to concrete actionable steps without "locking" you into a specific solution (i.e. purpose to practice). It does so by posing 5 questions to guide your thoughts.

  1. Why is the work important to Researchers and Developers
  2. What rules must we obey to succeed in our goal
  3. What will we offer our clients/users/audience
  4. Who can make or break our goal

Each question is answered in a specific way: First we take a short time (2 minutes or so) where everyone thinks about the question and formulates their own answers. Then, you pair up with someone else and expand both your lists, again in a few minutes. Then two pairs form a foursome and exchange the items after which  filtering is done to select the top n items that the group considers most important (this is called 1-2-4-all). This method is similar to a distributed map-reduce job in efficiency and every time I used it, people have commented how efficient consensus was reached. You will usually notice both that (1) large parts of the individual lists overlap and (2) that the other group has some options that wouldn't have occurred to you even if you had had an hour to think about it.


While we have no definite answers to the questions yet, here's the current status

  1. Why is the work important to Researchers and Developers?
    • To increase interoperability and sustainability (that last one is also about keeping in manageable)
    • To enable to "unix philosophy" in research tooling
    • To keep IT standards in sync with research methods
    • To increase efficiency
    • To incorporate lessons learned and best practices
  2. What rules must we obey to succeed in our goal
    • Iterative (short experiments) co-design with both developers and researchers
    • Communication and documentation throughout the lifecycle
    • Keep a single (or very limited set of) communication channels
    • Use the MVP idea: set up the smallest possible standards
    • Be clear about the boundaries of our task. What are we not going to do?
  3. What will we offer our clients/users/audience
    • "Raw data"
    • Flexible repositories
    • Interoperable co-designed data + software
    • use cases
    • maintenance guidelines
    • API's
    • Libraries
    • existing tools + usage guidelines for research
  4. Who can make or break our goal
    Here, we noticed that we were unable to answer this question because it was not specific enough. We spent the rest of our workshop crafting a better question that we will answer next time:
    Who do we need to keep talking to in order to reach our goal?
    And for each person/group rate the along two axes
    1. To what extent can they make a decision that influences our success?
    2. To what extent can we learn from them

Lessons learned

  1. The success of the meeting is very much dependent on the quality of the question that you ask
  2. But if your question is not good enough you can make "what would be a better question" the next question
  3. The goal of these (or maybe any?) meetings is to create a common understanding across all participants.
  4. That trick where you raise your hand to get silence is really needed.
  5. Be explicit that you don't want to get it right the first time
  6. While the success of a meeting is dependent on the quality of the question, the smoothness of the meeting is dependent on the order and clarity in which you explain the workforms.
    • You can over-explain something
    • Always be clear when the explanation is done and people can start moving
    • Allow for questions before people start moving


We had a very positive kick-off session. I think it really set the stage  for people to start working on the CLaaS platform independently while still in collaboration.

We're failing forward gloriously :)

By Marijn Koolen and Liliana Melgar

The second meeting of the video interoperability interest group (VAINT), supported by the Dutch digital infrastructure CLARIAH, took place on July 13, 2019 in Amsterdam. Read our blog post to find out what this group is working on to make it possible for researchers to exchange their annotations across video annotation tools.

Film, television, oral historians, and other scholars who use audio-visual media in their research are increasingly integrating annotation tools in their analyses. Previous surveys of existing AV annotation tools have found more than fifty of those tools. Each tool has their own strengths and limitations: one video analysis tool may be good at automatically segmenting a video by detecting boundaries between shots, which saves a lot of time compared to manual segmentation, and allows the user to describe what happens in each shot. This may be useful, for example, for a film scholar doing aesthetic analyses based on shots. Another tool might be good at facilitating the manual or semi-automatic transcription of the audio speech; another tool may allow the user to add metadata to their selected video segments based on properties that are relevant to their research, to build a database that can be queried in the synthesis phase when the scholar is writing an article or book. This may be useful, for example, for a historian studying television programs showing Dutch migrants moving to Australia, for a project on Dutch-Australian migration, or for an oral historian studying immigrants’ life stories, interested in analyzing what people in a video are saying and which emotions they are expressing when telling their experiences on a certain topic.

Because using a single tool may influence or limit the analyses, an ideal situation is to be able to combine the affordances of different tools, exploiting the combination of their strengths. For making this possible, those tools (and their data) have to be interoperable. That is why we created the VAINT (Video Annotation INTeroperability) interest group.

What is the VAINT initiative and who is involved?

VAINT is an international interest group of developers and users of video annotation tools which aims to find solutions for making scholarly annotations of time-based media interoperable. This initiative started in 2017, after a CLARIAH symposium on video annotation that took place in Amsterdam. The first meeting of the VAINT interest group was in July 2018. The initiative is supported and funded by the Dutch infrastructure project CLARIAH. More information about the VAINT group can be found in their Github repository. The current members of this initiative are listed at the bottom of this post.

plaatjeThe VAINT interest group, second meeting in Amsterdam, July 13, 2019. From left to right: Hugo Huurdeman, Haan Sloetjes, Christian Olesen, Marijn Koolen, John Bell, Liliana Melgar, Joscha Jaeger, Gaudenz Halter, Jaap Blom


Which video annotation tools are involved?

The VAINT initiative includes developers and users of five video annotation tools: 

  • The CLARIAH Scholary Web Annotation Tool is a browser-based tool that is used in e.g. the CLARIAH Media Suite and allows researchers to segment and annotate videos with a range of annotation types, including links, classification codes, tags, comments and metadata cards. 
  • ELAN is a desktop tool used by researchers from many different disciplines, including linguists, conversation analysts, film scholars, communication scholars and sociologists and allows complex annotation of video and audio files using multiple layers of annotations.
  • FrameTrail is a browser-based tool to support annotation of, among others, parliamentary data. It has the ability of linking fragments in parliamentary videos to relevant documents (texts, images, other videos) that FrameTrail can then show in the sidebar to contextualise these fragments. FrameTrail can show annotations in different layers, such as a layer of transcription and a layer linking different segments to different related documents. 
  • Semantic Annotation Tool (SAT) is another browser-based annotation tool that is used in e.g. the Media Ecology Project and allows users to select segments of videos and annotation them with tags, comments and links. 
  • VIAN is a visual film annotation system centered on the semantic aspects of film color analysis.

What was the starting point of the meeting?

This was the second meeting of the VAINT initiative. In the first meeting, the tool developers had explored the needs and requirements for exchanging annotations between tools.

In preparation for this second meeting, the tool developers shared examples of their output data. John Bell, the developer of SAT, made an example annotation based on a format that was drafted in the first meeting, for everyone to start investigating what is needed to make import and export functionality in their tools. Han Sloetjes, developer of ELAN implemented a first version of the import and export for ELAN based on the example annotation, and listed what information from other tools is lost in import and what ELAN information is lost on export.

What happened in the meeting?

We first revisited and improved the scenarios and use cases for data exchange between the tools that would make sense for scholars, and what annotations we expected to exchange between the tools.

We decided on a flexible exchange format that sticks as close to the W3C Web Annotation standard as possible, and that tools need to figure out by themselves how to deal with aspects in the annotations that are uninterpretable by the tool. This points to the important point of whether, upon importing, each tool should store information that is unusable by that tool, so that it can be added back in the export.

We decided to make a minimal extension of the Web Annotation standard to allow for ways of grouping annotations that are implemented in the various annotation tools. For instance, in ELAN, a user can define different layers (so called 'tiers') and add annotations to specific layers, e.g. a layer for shot boundary segments, a layer for transcription of spoken dialogue, a layer for transcribing movement of people in the video and a layer for coding which person is visible in which shot. Upon export, it needs to be clear in the exchange data, what layer an individual annotation belongs to.

What will happen after the meeting?

The meeting closed with three main action points. First, each developer will list the most important elements in their tool output that should not be lost upon export/import.

Second, we will update and extend our draft specification for the exchange format and define elements that are new with respect to the W3C WA standard. For instance, we need a way of grouping annotations that belong to the same layer (maybe by e.g. ELAN or FrameTrail), so that other tools that don’t know the concept of layer can ignore this information but do not throw it away upon import or export.

Third, we will explore whether an existing standard for image viewer interoperability, the IIIF Presentation API, is suitable for our needs. We will discuss this with Tom Crane, one of the initiators of the Audiovisual extension of this standard, to see if this is a viable approach.


Marijn Koolen works at the KNAW Humanities Cluster as a researcher and developer. He is involved in several research projects and digital infrastructure projects within the fields of Digital Humanities, Information Retrieval and Recommender Systems. Part of his development work is related to the Dutch research infrastructure project CLARIAH.

Liliana Melgar-Estrada is user researcher at Utrecht University and The Netherlands Institute for Sound and Vision investigating how to support Scholarly Video Annotation in the Dutch infrastructure for the digital humanities CLARIAH, and conducting user testing and evaluation of the CLARIAH Media Suite.

VAINT members (2019):

  • Marijn Koolen (KNAW Humanities cluster), developer of the CLARIAH scholarly annotation client and server.
  • Jaap Blom (The Netherlands Institute for Sound and Vision, CLARIAH), developer of the CLARIAH Media Suite, and of the scholarly annotation client and server.
  • Han Sloetjes (Max Planck Institute for Psycholinguistics), developer of ELAN.
  • John Bell (Dartmouth College), developer of SAT (Semantic Annotation Tool: Waldorf.js & Statler for The Media Ecology project (led by Mark Williams)
  • Joscha Jaeger (Filmic Web), developer of FrameTrail
  • Gaudenz Halter (University of Zurich), developer of VIAN, an adaptation of ELAN for the Film Colors project, led by Prof. Barbara Flueckiger
  • Hugo Huurdeman (Timeless Future), who worked on the ReVI (REsource viewer) project for the CLARIAH Media Suite, also co-founder of, an interactive video framework
  • Liliana Melgar (Utrecht University and The Netherlands Institute for Sound and Vision), an information scientist at the CLARIAH project, whose focus is on the user requirements for scholarly annotations
  • Christian Olesen (Utrecht University), film scholar who also participated in the first and second meeting, contributing to the scholarly use cases. 

By Jan Odijk

The eScience Centre organizes each year its eScience Symposium in collaboration with partners. CLARIAH co-organised the Humanities sessions at this year’s symposium, which was held on 21 November 2019 in the Johan Cruyff Arena in Amsterdam. See .

escience2019Kick off by Prof. Rob van Nieuwpoort, acting director of the eScience Center

The theme of the symposium was ‘digital challenges in open science’. Our keynote speaker, Eero Hyvönen (University of Helsinki / Aalto University, Finland) gave a presentation titled "Using Linked Open Data in Digital Humanities: Shift from Data Publishing to Data Analysis and Artificial Intelligence".

The other speakers in the session were Marcel Broersma (University of Groningen) on "Transparency of machine learning for Humanities scholars" , Roser Morante, Piek Vossen & Antske Fokkens (VU University Amsterdam) on “Transparent NLP for the Digital Humanities: Processing Historical Texts’ and Daan Broeder (Humanities Cluster / CLARIN ERIC) on the SSHOC project.

The session was organized by Jan Odijk (Utrecht University) from CLARIAH and Carlos Martinez Ortiz from the eScience Centre, and was chaired by Jan Odijk

In addition to the Humanities session, there were posters and demo’s of projects co-organised and co-financed by CLARIAH and the eScience Centre through the Accelerating Scientific Discovery in the Arts and Humanities (ADAH) Call. The following projects presented their results at the poster and demo session at the eScience Symposium:

  • TICCLAT (Text induced Corpus Correction and Lexical Assessment Tool): demo by Martin Reynaert
  • NewsGac (News Genres. Advancing Media History by Transparant Automatic Genre Classification): demo, poster and running PowerPoint by Marcel Broersma, Eric Tjong Kim Sang & Tom Klaver
  • EviDENce (Ego Documents Events Modelling: How individuals recall mass violence): demo by Susan Hoogervorst  & Meiert Grootes.



By: Tom Slootweg

On November 18, 2019, the CLARIAH Audiovisual Data and Media Studies work package (WP5) organized a one-day meeting on computer vision and audiovisual data at Utrecht University. The main organizers, Jasmijn Van Gorp, Christian Olesen and Tom Slootweg, invited several Dutch media scholars, as well as computer & data scientists, to explore the potential of computer vision for the analysis of audiovisual data held by cultural heritage intuitions (e.g. film, television, photography). The organizers furthermore invited two special guests, Taylor Arnold and Lauren Tilton (University of Richmond). Arnold and Tilton kindly agreed to participate in the expert meeting, before bringing the day to a close with an interesting public lecture on their Distant Viewing Lab.

Expert meeting

Afb1 Blog ComVizThe expert meeting functioned as a first, exploratory step towards a more developed strategy for the upcoming computer vision processing pipeline in the CLARIAH Media Suite. Moreover, the organizers aimed to provide an informal platform on which the invited participants could share and discuss some of their preliminary expectations of and requirements for computer vision. To benefit from the expertise of Arnold and Tilton, but also from computer vision experts Nanne van Noord (NISV) and Melvin Wevers (DHLab KNAW), several short pitches were given by scholars who (plan to) work with audiovisual data, highlighting their expectations of computer vision with regard to their research interests.

Thomas Poell (University of Amsterdam) kicked off the meeting with a presentation entitled “Cross-Media Research & Computer Vision.” Poell and his team research reactions to the refugee crisis in Europe across several media: Facebook, Twitter, YouTube, newspapers and television broadcast. Besides distant reading strategies, grounded in textual analysis, he is also interested in the potential of computer vision for the analysis of the re-use of “symbolic events”, including their audiovisual “framing.” Christian Olesen and Nanne van Noord, in contrast, expounded on the insights gained by using computer vision algorithms as a basis for exploratory, serendipitous search in audiovisual archives. The foundations for this particular approach have been laid in the SEMIA project, and in the coming years Olesen and van Noord will further investigate the potential integration of the algorithms developed for this project into the CLARIAH Media Suite.

Susan Aasman and Rob Wegter (University of Groningen) continued with an overview of their first steps in ‘the land of computer vision.’ Their research project, “Intimate Histories: Finding Traces in the Early History of YouTube,” focuses on the early days of vlogging. Aasman and Wegter applied computer vision as a “(pre-)analytical method,” to ascertain whether the use of scene, frame and object recognition might arouse new questions about continuity and change of the visual dimensions in their data. In conclusion, Ruben Ros, a recent master graduate associated to the ReAct project, headed by Ann Rigney, elaborated on the deployment of computer vision on large historical protest photography datasets. The distant viewing strategies discussed by Ros are currently investigated by postdoctoral researcher and historian Thomas Smits, who seeks to develop them as a means to reveal a “visual grammar” of activism. 

Discussion and next steps

Afb2 Blog ComVizThroughout the expert meeting, the pitches provided ample ground for discussion. Melvin Wevers, for example, responded to them by underlining that scholars who plan to use computer vision should first and foremost adopt a data-driven approach. Moreover, he underscored the importance of indexing the dataset used, before developing any further research questions. This recommendation was endorsed by Nanne van Noord, who added that ‘any meaningful analysis requires indexing beforehand, because only then can one opt for a robust application of computer vision methods.’ A data-driven approach, as the experts furthermore explained, also entails an awareness of the distinction between systems and variables: what is the information required and how can this then be turned into relevant output for further analysis?

Other relevant topics for discussion were also raised. Taylor Arnold, for instance, emphasized that computer vision should not be used in and of itself. Lauren Tilton added to this remark by arguing that we should eventually be working towards modes of analysis in which computer vision is used in combination with other, complementary methods. The issues of documentation and openness were also flagged by Arnold and Tilton as relevant points for discussion. Many experiments with computer vision and audiovisual data are currently taking place. However, a more or less standardized model to report on important steps taken, or on how to avoid certain pitfalls, is currently lacking. As a response to this undesirable situation, Arnold and Tilton therefore endeavor to make their documentation, toolkits and code freely available on their website. They rightly encourage others to follow suit.

Jasmijn Van Gorp brought the expert meeting to a close with some additional remarks. Based on the issues raised during the meeting, Van Gorp stressed the importance of further debate on how to meaningfully integrate computer vision algorithms and processing pipelines into the Media Suite. Of particular relevance will also be the question of whether the implementation and development of computer vision should solely occur in a “closed system,” due to copyright protected archival materials. Or should we instead strive to make available our code and develop toolkits for those who will not have access to the CLARIAH research infrastructure? Many options are still open and in the coming years important decisions need to be made in this regard. Whatever the outcome may be, Van Gorp concluded the meeting by emphasizing the importance of maintaining the trading zone between humanities scholars and computer scientists we have now established.

Public lecture

The public lecture rounded off the day and gave a valuable insight into the pioneering work done by Arnold and Tilton at the intersection of media studies and digital humanities. Their earlier work on Friends (NBC 1994-2004), but also on the television series Bewitched (ABC, 1964-1972) and I Dream of Jeannie (NBC, 1965-1970), sought to explore how distant viewing can ‘help identify, at scale, the cultural assumptions expressed and reflected in these TV-series.’ The results of this project can be found here. Currently, however, their research also includes other media, such as photography and film. The broadening of scope is necessary, they explained in the lecture, in order to develop various new, complementary computational methods for the distant viewing of visual culture at large.

Afb3 Blog ComVizLauren Tilton and Taylor Arnold (University of Richmond) during the public lecture about their Distant Viewing Lab (Utrecht, 2019).

This broader, more inclusive ambition not only necessitates the rapid prototyping of new toolkits, but also requires a formal framework in which ‘new ontologies can help in predicting features of media.’ A first step to that end is discussed at length in their excellent article “Distant Viewing: Analyzing Large Visual Corpora.” The next task is to delve deeper into the topics of visual style and formal complexity, which will be done in an upcoming paper. To wrap up the lecture, Taylor Arnold again opted to stress the importance of making tools re-useable for other interested parties, also beyond one’s own circle. With this remark, the audience was imparted with the most important takeaways of this special computer vision day: cherish openness and create a fruitful trading zone for the exchange of ideas and tools.


By: Liliana Melgar (Utrecht University / The Netherlands Institute for Sound and Vision)

The largest Digital Humanities conference thus far (DH2019) took place in Utrecht from July 8 to 12, 2019. More than one thousand (digital) humanists from all over the world participated in the annual gathering of scholars, curators, librarians, information and computer scientists, publishers, among others, who are incorporating, experimenting, and innovating in digital methods for doing or supporting scholarly work.

The use of digital sources and methods in the humanities (called “digital humanities”) has clear foundations and development in textual domains (such as literary studies or historical research). But, for some years already, the increasing availability of audio-visual materials has started to draw the scholars’ attention to the potential of those sources within fields traditionally dominated by the text, and to see the possibilities of using computational methods in visually-based domains, such as (digital) art history.

In this two-part blog post, I summarize most of the AV-related workshops, panels, and presentations at DH2019, and share some impressions about the sessions that I attended, or which papers I read after the conference. In Part1 I introduce the central workshop (AVinDH). Then, in Part 2, I group the talks into different themes. The entire DH2019 program can be accessed here, and the book of abstracts here (I provide the links to each abstract).

AVinDH SIG Workshop

(see this part of the blog post also at The Netherlands Institute for Sound and Vision blog)

The awareness of the increasing relevance of AV sources for scholarly work led to the idea of founding a special interest group (AVinDH SIG) during the DH Conference in Lausanne in 2014. The group has the aim “to facilitate communication and interaction between researchers from various disciplines including domains such as media studies, history, oral history studies, visual culture studies, social signal processing, archeology, anthropology, linguistics."

On Monday 8 July the fifth AVinDH workshop organized by this interest group took place at DH2019. The workshop, chaired this time by Lauren Tilton (University of Richmond) and Jasmijn van Gorp (Utrecht University), had around 20 participants from domains such as film, television and media studies, cultural history, stylometry, spatial humanities, information and computer science, arts, design and linguistics.

image1 AVinDHgroupImage 1. AVinDH SIG workshop at DH2019, from left to right: John Bell, Dimakatso Mathe, Christian Olesen, Melvin Wevers, Liliana Melgar, Susan Aasman, Mark Williams, Lauren Tilton, Carol Chiodo, David Wrisley, Julia Noordegraaf, Taylor Arnold, Joanna Byszuk, Jasmijn van Gorp, Manuel Burghardt, Nanne van Noord, and Daniel Chavez Heras.

The AVinDH workshop included parallel tutorials and short “lightning” talks. The slides and materials are linked from the workshop’s web page. These were the tutorials:

  • A tutorial by Julia Noordegraaf, Jasmijn van Gorp and myself about using the CLARIAH Media Suite, showing the potential of using the automatic speech transcripts (ASR) that are progressively being added to The Netherlands Institute for Sound and Vision audiovisual collection, and made available to researchers via the Media Suite. The tutorial showed, from a source and tool criticism perspective, how the Media Suite makes it possible to search AV content using the ASR transcripts. Participants were invited to reflect (using the tools provided for metadata inspection) on the consequences of doing research with automatic annotations which are constantly growing (but often “incomplete”), and cannot be 100% accurate.
  • A tutorial by Bernhard Rieder and Thomas Poell (University of Amsterdam) exemplified how to do research about online social media activity based on material produced by public broadcasters. They explained how they extracted and used YouTube data (with one of the tools offered by the Digital methods initiative at the University of Amsterdam) in combination with broadcast programs from The Netherlands Institute for Sound and Vision’s archive, which is made available to researchers via the CLARIAH Media Suite. Their project on the European refugee crisis (from 2013), consisted on finding YouTube clips from public broadcasters by means of matching the automatic speech transcripts, and on the analysis of the related online social media activity.
  • Lauren Tilton and Taylor Arnold’s tutorial on using the “Distant Viewing” Toolkit for the analysis of images using deep learning. The tutorial offered the participants the opportunity to learn the basics of image processing in Python, the concepts of deep learning for images, and to apply the Distant Viewing toolking to moving ImagesT
  • An introduction to the Media Ecology Project (MEP) and a practical hands-on tutorial with the Semantic Annotation Tool (SAT) was given by John Bell and Mark Williams. The participants learned how to easily embed the SAT annotation client, “waldorf.js” plugin (“a drop-in module that facilitates the creation and sharing of time-based media annotations on the Web”) in any website that streams video content. These annotations can be stored and made collaboratively thanks to the SAT “statler” server.  


The AVinDH workshop also included “lightning talks” in which the participants presented their ongoing AV-related research:

  • Manuel Burghardt, from the Computational Humanities Group at Leipzig University, introduced the “Scalable MovieBarcodes,” an exploratory interface for the analysis of movies.
  • Nanne van Noord, from Amsterdam University, described the “Sensory Moving Image Archive” (SEMIA) project, and how they used computer vision to analyse non-verbal syntactic features in moving image material.
  • Joanna Byszuk, a stylometrist from the Institute of Polish Language, introduced her work on Distant reading television, “a stylometry of textual layer of television shows and a few problems related to its representativeness, features, and authorship.”
  • Susan Aasman, media scholar from Groningen University, presented work-in-progress of the research project "Intimate histories; a web-archaeological approach to the early history of YouTube."
  • Melvin Wevers, postdoctoral researcher in the Digital Humanities Lab at the KNAW Humanities Cluster, explained why “drawing boxes” is difficult, showing the challenges of using computer vision for the identification of people’s gender, to be used in the subsequent study of gender representation in newspaper advertisements.
  • Liliana Melgar, from Utrecht University and The Netherlands Institute for Sound and Vision (NISV), also on behalf of Mari Wigham, data scientist from NISV, both working for the CLARIAH project, argued on how the CLARIAH Media Suite’s graphic user interface (GUI) should work in combination with Jupyter Notebooks, facilitating the analysis of audiovisual data in a flexible and transparent way.

Access to AV collections in digital research environments

Automatic speech transcripts (ASR), automatic identification of objects and faces, are tangible and already applied benefits of computer vision to improve access to AV collections provided by archives and broadcasters. The following presentations dealt with “portals,” “platforms,” or “research environments” that combine archival metadata with automatic annotations, plus other annotation facilities to support researchers in the exploration and work with AV collections.

I-Media-Cities: aggregated portal of European film archival collections

The I-Media Cities platform offers access to film archival material from and about the nine participating cities (Athens, Barcelona, Bologna, Brussels, Copenhagen, Frankfurt, Stockholm, Turin, and Vienna). The platform is the result of a H2020 project which started in 2016 and finished this year. The presentation focused on explaining the project and the functionality of the platform that gives access to these collections to researchers and the general public. A positive aspect of the platform is that the copyright status of each media item is clearly specified and used for filtering. This is useful since it gives the creators of this aggregated platform the flexibility of having an open platform, and still lock certain content for viewing (in those cases, content access has to be requested by the user to the provider). This approach to access AV collections which are heavily restricted by copyright is different in the CLARIAH Media Suite, mentioned below.

Finding and discovering AV content via the I-Media Cities platform is facilitated by a combination of manually-generated metadata (provided by the archives) and automatically generated metadata (mostly using shot-boundary detection and object detection algorithms that were obtained in collaboration with the Fraunhofer Institute). User annotations can be added to an individual shot or, if the user chooses to annotate a fragment, to the group of shots that compose the fragment (see image below). Those annotations can be tags, geotags, notes, links, or references. The automatic annotations (e.g., the label “single person”) are added to each shot together to the user-created annotations.

The fact that the annotations are added and displayed per shot invites a high level of granularity (see the purple and orange dots in the image below). However, one wonders whether this approach will suit the needs of researchers (and the general public) who need to annotate fragments that unfold temporally, and not the “static phenomena deduced from individual frames,” as it was argued during the panel “Between Data Mining and Human Experience – Digital Approaches to Film, Television and Video Game Analysis” organized by the CinePoetics Center for advanced Film Studies. I hope that future collaborations between IMedia Cities and other groups of researchers investigating scholarly annotations of moving images will facilitate sharing expertise that reverts in the benefit of the user community of these valuable audiovisual heritage.

image2 screenshotImediaCitiesImage2. Screenshot from “I Media Cities” platform (July, 2019).

The project ended this year, but the consortium will take care of the sustainability of the platform, also inviting other institutions to become content providers. It was also mentioned during the presentation that user participation via crowdsourcing is envisioned, but from the presentation it was not clear yet which approach will be used for user engagement and for keeping the connection between the workflows for user and automatic annotations. The code of this platform will be made available open source, as it is announced in the project’s website.

The CLARIAH Media Suite: transparent research support

The CLARIAH Media Suite, a research environment built to facilitate access and research to important audio-visual archives in The Netherlands, was presented at the AVinDH workshop (synopsis above), and in a paper at the Panel “Exploring Audiovisual Corpora in the Humanities: Methods, Infrastructure, and Software.” In the panel’s paper, Jasmijn van Gorp (also on behalf of her co-authors) introduced the project focusing on the participation of “users” in a “co-development” approach for building a sustainable infrastructure. The paper shows examples that we used in this project to involve scholars in the process of incorporating collections and building functionalities to work with them. The Media Suite has, for the first time, provided online access to the entire audiovisual collection of The Netherlands Institute for Sound and Vision. In the context of the CLARIAH Core project, workflows are in place to progressively generate automatic transcripts from the audio signal. Within the CLARIAH Plus project, which is just starting, other automatic annotations, and other audiovisual collections will be made available via the Media Suite. To incorporate these collections in a way that they can be used by researchers, their constant and active participation is required, which also demands for more innovative user-centered design approaches. Our presentation is available here.

Computer vision to improve access to the BBC and BFI collections

In the paper “Computer Vision Software for AV Search and Discovery” presented at the panel “Exploring Audiovisual Corpora in the Humanities: Methods, Infrastructure, and Software,” Giles Bergel, from the Visual Geometry Group (VGG) at Oxford University, introduced ongoing collaborative efforts for integrating automatic annotations in the collection management of the BBC and the British Film Institute, mostly focused on providing access to and experimenting with facial recognition across large datasets.

In the same vein as the Media Ecology project, the presenter highlights the need for creating an “ecosystem,” (in terms of integrative data flows and collaboration between institutions) in which archival metadata and automatic annotations can improve reciprocally. The presenter also proposes a “multimodal approach to search,” which benefits from a machine learning applied to exploiting the correspondences between the audio and visual content of videos. This research has resulted in a live demo to perform visual search of BBC news based on objects, persons, text, and query by example. More live demos of image search in important collections from the VGG group are available on their website!

image3 screenshotBBCNewsSearchImage 3. Screenshot from “BBC News Search” powered by Oxford’s Visual Geometry Group (July, 2019)

Boldly computational

  • The presentation “Seen by Machine: Computational Spectatorship in the BBC television archive” by Daniel Martinez Heras (also on behalf of his co-authors) showed and reflected upon the BBC’s “Made by machine” project, which used machine learning to produce a new program based on footage extracted from the BBC archives. Both the selection of the clips and their processing used machine learning approaches.The automatically-generated program, which the author sees as a promising connection with contemporary aleatory music, was broadcast on BBC Four in September 2018. The presenter drove his attention to the comments added by the spectators of this machine-made program. Daniel showed the negative, confused, but also sometimes enthusiastic and enlightening feedback, all of which he constructively uses to build the concept of computational spectatorship: “a way to understand how our visual regimes are increasingly mediated by machine-seers.”
  • Lauren Tilton and Taylor Arnold presented their “Distant viewing” project and toolkit for using deep learning in the analysis of images. This is a software library that enables research into moving images via the automatic detection of stylistic elements (colour, lightning), structural elements (shot boundaries), content elements (object and face identification), sound elements, and transcripts of spoken word. Lauren and Taylor gave a tutorial at the AVinDH workshop, and a paper presentation at the panel “Exploring Audiovisual Corpora in the Humanities: Methods, Infrastructure, and Software.”
  • “Early Modern Computer Vision”’s paper by Leonardo Laurence Impett, shows an intriguing and exciting alternative angle to the way current computer vision is understood in artificial intelligence. He revisits historical theories of vision and early experiments. He also experiments with those theories by presenting a prototype based on Italian theories of optics, vision and visual art of the 16th century. In my view, this is a great example on how critical experimentation can be applied in humanities scholarship, by testing the interaction between foundational premises and the way systems work. This idea is aligned with Geoffrey Rockwell’s call during his talk “Zombies as tools” at the DLS workshop for more replication-like experiments with early tools.
  • “Deep Watching: Towards New Methods of Analyzing Visual Media in Cultural Studies”, by Bernhard Bermeitinger and others, discusses two examples of using computational methods for the identification of objects, symbols and “persons, and their mimics and postures” in YouTube videos; and to the analysis of trading cards of the actress Marlene Dietrich.

Assisted AV manual and semi-automatic annotation

  • Film scholar Mark Williams and his team named his project “Media Ecology” (MEP). This name conveys the idea of how the need to train computer vision algorithms, both for curatorial and scholarly use, has created “collaborative synergies” between archives and researchers, to which annotation is a central activity. The MEP group of tools that facilitates this “ecology” in the creation and exchange between scholarly annotations and the AV media archives are: the Semantic Annotation Tool (SAT) for semantic annotation; for vocabulary sharing; and the “Machine Vision” prototype for searching automatic annotations.
  • The panel “Between Data Mining and Human Experience – Digital Approaches to Film, Television and Video Game Analysis” organized by the CinePoetics Center for advanced Film Studies at the Freie Universität Berlin gave prominent attention to the methodological aspects of annotating “temporal arts and media” (film, television, web videos, and video games) for scholarly purposes, and to the practical implementations they have been working on. The presenters challenge existing qualitative methodologies in the humanities, which may not be suitable for the study of aesthetic and compositional patterns in temporally-based media, since they focus on the “isolation of features,” while aesthetic analyses have a more integrative perspective. One of the papers (“Researching and Annotating Audiovisual Patterns – Methodological Considerations”) diggs into the requirements for annotation tools that would facilitate annotation practices rooted in film theory and qualitative research altogether, showing their experience with the video annotation tool ADVENE. To tackle the issue of the high time investment that these fine-grained annotation of films requires (“involving several hours of work per minute of film”), the CinePoetics team has worked in developing and integrating a controlled vocabulary and ontology of film analytical concepts into the annotation tool (see the ADA ontology project, ADA public Github repository; and ontology in Ontoviz), and on developing semi-automatic classification of audiovisual patterns (for shot detection, colour range detection, automatic speech recognition, visual concept detection, and image search support). This systematic approach to annotation, in combination with a tool that supports a layered annotation, a scholarly-based ontology, and the combination of automatic and manual annotations, results in very impressive visualizations of what the presenters call “film scores” (image below). These annotations are published as linked open data in the sparql end point. The interactive querying and visualization of the semantic annotations, plus the viewing of the semantically annotated videos can be done via the ADA annotation Explorer, developed with the web-based, open source, hypervideo software FrameTrail.
  • The presentation by Michael Bourgatte (Institut Catholique de Paris), “Prendre en Compte le Contexte d’Usage et le Contexte Technique dans le Développement du Service d’Annotation Vidéo Cellulloid” showed the manual video annotation tool (Cellulloid) that was developed within a socio-constructivist approach to support annotation, as an essential activity for “active engagement” with audiovisual sources in education and research contexts. This work was inspired by Lignes de Temps, and other relevant annotation tools used in film or performative analyses, proposing, however, a different approach to the display of the annotations (not as separate from the video, but integrated within it).

 image4 advene screenshotImage4. “Film scores”: different levels of annotations, screenshot from ADVENE (taken from paper published at: (July, 2019).

Doing history with recorded interviews

Oral historians have relied in the past, to a great extent, on the analysis of the manual transcriptions of the audio or video recordings of the interviews they conduct as part of their investigations. But AV technologies are bringing new opportunities to doing history with digital AV sources. At DH2019 oral historians were well present with two workshops and a presentation:

  • The workshop “A transcription portal for oral history research and beyond”, organized by Henk van den Heuvel, and coauthors, introduced the participants to the recently launched transcription chain prototype, the “T-Chain portal.” This portal, built with together with oral historians, linguists and researchers from other disciplines interested in doing research with and about automatic speech recognition, allows researchers to upload their audio recorded interviews, then use the open source automatic speech recognition software available for the language of the interview (English, Dutch, German, Italian), and to correct and export the automatically generated transcript. 
  • The workshop “Oral history: A Multidisciplinary Approach To The Use Of Technology In Research: The Case Of Interview Data” organized by Arjan van Hessen and coauthors, presented the work done in a previous series of workshops, supported by CLARIN on this topic ( These workshops had a focus on the “multidisciplinary potential of interview data (history, oral and written language, audio-visual communication),” and on seeking synergy between the different methods and tools to work with AV data used in different disciplines. The focus of the DH workshop was also, along those lines, on sharing experiences about the organization and conclusions of these series of workshops, and on developing skills of participants working with digital tools to study interviews.
  • The presentation by Norah Karrouche “Still Waters Run Deep. Including Minority Voices in the Oral History Archive Through Digital Practice” critically reflects upon three aspects: 1) the content/focus of the oral history projects conducted in the past two decades in The Netherlands, which have given priority to WWII memories, excluding other underrepresented topics and groups; 2) the lack of integration of digital methods in oral history, due to a neglect by other disciplines, but also within the discipline itself, about the validity and usefulness of oral history and digital methods, which are only marginally incorporated at universities in The Netherlands; 3) the difficulties and lack of awareness among oral historians about the different regulations that could make more oral history collections open. Norah combines these reflections into a proposal that seeks cooperation between CLARIAH (Common Lab Research Infrastructure for the Arts and Humanities), a large-scale digital humanities research infrastructure project in the Netherlands, and the small community archive Verhalenhuis Belvédère in Rotterdam. The aim is to analyze and improve the workflows that could make community archives more open to researchers, and how they can benefit from existing and in development tools that facilitate research with AV data. There will be a workshop after DH where Norah will bring all these ideas into practice.

Other related presentations

I have listed and commented about some of the most explicitly AV-related contributions at DH2019, but there were several other papers which indirectly dealt with the topic, for example:

Concluding remarks

image5 tweet Image5. Tweet by Martijn Kleppe (July 12, 2019), post presented a comprehensive (but certainly incomplete!) summary of the main AV-related contributions at the DH2019 conference. The amount and breadth of the projects listed here, many of which have adopted computational methods in either a basic or more adventurous way, is the explicit sign of “a turn toward the visual in digital humanities research,” about which historians Melvin Wevers and Thomas Smith, innovators in doing computationally-based historical research on images of digitized Dutch newspapers wrote a significant contribution (Wevers and Smith, 2019).

An obvious conclusion from this overview is the great potential for “collaborative synergies” (as film scholar Mark Williams likes to emphasize) between the discussed AV-based projects in scholarly research and digital curation. As most of these projects have the annotation scholarly primitive as a basis, more tools will be developed, with different “ergonomics” adapted to their user groups (e.g., to the needs of media, television, performing scholars, oral historians, linguists, or curators). To achieve stronger synergies, sharing these annotations will become more urgent, which calls for an interoperability framework for sharing and reusing scholarly annotations. This framework doesn't necessarily have to be based only on sharing the vocabularies used for annotation (since these are discipline-dependent, as we saw in the ADA ontology project described here). A more discipline-independent approach exists in which by using the W3C annotation model it would be possible to “enable annotations to be shared and reused across different hardware and software platforms.” With the aim to explore this further, we have promoted, within CLARIAH, the creation of an expert interest group of developers of AV-annotation tools, called VAINT, which stands for Video (time-based media) Annotation Interoperability iNterest Group. This group works on investigating how to adapt the generalistic W3C annotation model for sharing scholarly annotations, with a focus on time-based media, also looking for synergies with the IIIF interoperability framework.

Looking forward to seeing more humanities research using the wealth of increasingly digitally available sound and audio-visual archives at DH2020 in Ottawa!