clarin logoSTSubmitted by Karolina Badzmierowska on 23 October 2017

Tour de CLARIN

“Tour de CLARIN” is a new CLARIN ERIC initiative that aims to periodically highlight prominent User Involvement (UI) activities of a particular CLARIN national consortium. The highlights include an interview with one or more prominent researchers who are using the work of national consortium’s infrastructure and can tell us more about their experience with CLARIN in general; one or more use cases that the consortium is particularly proud of and any relevant user involvement activities carried out. “Tour de CLARIN“ helps to increase the visibility of the national consortia, reveal the richness of the CLARIN landscape, and to display the full range of activities throughout the network. The content is disseminated via the CLARIN Newsflashblog posts and linked to on our social media: Twitter and Facebook.

The Netherlands

CLARIAH-NL is a project in the Netherlands that is setting up a distributed research infrastructure that provides humanities researchers with access to large collections of digital data and user-friendly processing tools. The Netherlands is a member of both CLARIN ERIC and DARIAH ERIC, so CLARIAH-NL contributes therefore not only to CLARIN but also to DARIAH. CLARIAH-NL not only covers humanities disciplines that work with natural language (the defining characteristics of CLARIN) but also disciplines that work with structured quantitative data. Though CLARIAH aims to cover the humanities as a whole in the long run, it currently focusses on three core disciplines: linguistics, social-economic history, and media studies.

CLARIAH-NL is a partnership that involves around 50 partners from universities, knowledge institutions, cultural heritage organizations and several SAB-companies, the full list of which can be found here. Currently, the data and applications of CLARIAH-NL are managed and sustained at eight centres in the Netherlands: Huygens Ing, the Meertens Institute, DANS, the International Institute for Social History, the Max Planck Institute for Psycholinguistics, the Netherlands Institute for Sound and Vision, the National Library of the Netherlands, and the Institute of Dutch Language. Huygens Ing, The Meertens Institute, the Max Planck Institute for Psycholinguistics, and the Institute of Dutch Language  are Certified CLARIN Type B centres. The consortium is led by an eight-member board and its director and national coordinator for CLARIN ERIC is Jan Odijk.

The research, development and outreach activities at CLARIAH-NL are distributed among five work packages: Dissemination and Education (WP1) and Technology (WP2) deal respectively with User Involvement and the technical design and construction of the infrastructure, whereas the  remaining three work packages focus on three selected research areas: Linguistics (WP3), Social and Economic History (WP4) and Media Studies (WP5).


The full blog can be read here:



17 october 2017, Christian Olesen

Early September, Liliana Melgar and I (Christian Olesen) received an invitation from Barbara Flückiger, Professor in Film Studies at the University of Zürich, to participate in the “Colloquium Visualization Strategies for the Digital Humanities”. The aim of the day was to bring together experts to discuss film data visualization opportunities in relation to Professor Flückiger’s current research projects on the history of film colors. Currently, Flückiger leads two large-scale projects on this topic: the ERC Advanced Grant FilmColors (2015-2020) and the Filmfarben project funded by the Swiss National Science Foundation (2016-2020). A presentation of the projects’ team members can be found here.

As a scholar, Barbara Flückiger has in-depth expertise on the interrelation between film technology, aesthetics and culture covering especially aspects of film sound, special effects, film digitization and film colors in her research. In recent years, her research has increasingly focussed on film colors, especially since the launch of the online database of film colors Timeline of Historical Film Colors in 2012 after a successful crowdfunding campaign. The Timeline of Historical Film Colors has since grown to become one of the leading authoritative resources on the history and aesthetics of film colors – it is presented as “a comprehensive resource for the investigation of film color technology and aesthetics, analysis and restoration”. It is now consolidating this position as it is being followed up by the two large-scale research projects mentioned above which merge perspectives from film digitization, restoration, aesthetic and cultural history.

These projects are entering a phase in which the involved researchers are beginning to conceive ways of visualizing the data they have created so far and need to consider the potential value which data visualization may have for historical research on film color aesthetics, technology and reception.

In the full report with a lot of impressions from the vist can be read here.

On Friday, October 6th 2017 an enthusiastic group of engineers and digital humanities scholars gathered for the third annual CLARIAH Tech Day. There was an activist mood, this time we would do things differently!

Many developers in the project wanted a meeting in which building stuff would be the focus instead of listening to presentations on how other people had built stuff. The weeks before had seen a flurry of emails on the contents of such a day and the agenda, but also on doubts and concerns. And the truth was: none of us actually had the foggiest idea of how to do this.

gertjanNo one had the foggiest idea of how to do this Tech day!

I was asked to take the lead, and together with Roeland Ordelman, Richard Zijdeman and Marieke van Erp we sat down during the CLARIN Meeting in Budapest to kick around some ideas. We settled on a hackathon/unconference-style format. The agenda would be open to suggestions from the community and not be set until the meeting itself. And I’ll confess - I had some prior hesitations on this open format: what if nobody would come up with anything? Wouldn’t people want to know what the meeting was about before making time in busy schedules? But this was what the community itself had repeatedly asked for, so damn the torpedoes - full steam ahead.

And we were not disappointed! The ideas, suggestions and questions poured in and were eventually gathered into four main topics:

  1. Integration and modelling of shared data between the various domains and the generic CLARIAH infrastructure;
  2. Continued development of GRLC;
  3. A discussion on workflows, and how tool selection based on data mime-type can provide guidance for users;
  4. TEI/exist-db/TEIPublisher and Oxygen as the basis for digital editions and linguistic querying.
techdag 1 techdag 2 techdag 3 techdag 4 techdag 5

The enthusiastic response continued into the event itself. It became immediately obvious that the restyled Tech Day would also be a lot of fun. The smiles, enthusiasm and flexibility were fantastic. The number of developers who had come from all over CLARIAH had brought many guests, turning this into a truly international day that generated a very positive vibe of its own.

After a five minute pitch for each topic, the community basically took over the pantry, restaurant and meeting rooms at the IISH building. You could find groups of engineers working, discussing and building stuff everywhere. And these groups were extremely varied: people from Media Studies discussing GRLC with engineers working in the field of Social Economic History, and Linguists and Lexicographers getting stuff done with developers working on generic infrastructure. Many new ideas were born that day.

iisgLunch at the IISH

A lot of progress was made on the four main topics. Both Open Dutch Wordnet and the first version of the diachronous lexical corpus Diamant (INT, Kathrien Depuydt and Jesse de Does) were connected to the generic infrastructure, as were catalogues provided by NISV, and the Gemeente Geschiedenis dataset on Dutch municipalities (by Hic Sunt Leones). Carlos Martinez and a group of engineers added to GRLC the automatic inclusion of SPARQL queries stored in github. And there were plenty of discussions on planned and unplanned subjects. Jan Odijk and Jesse de Does ran a very interesting meeting on workflow systems and Eduard Drenth (Fryske Akademy) presented his ideas on digital editions followed by a very detailed open discussion on the pro’s and cons of the software stack he proposed.

Completely spontaneous, Richard Zijdeman showed us a new way of implementing HDMI for the improvement of health in CLARIAH, and Roeland Ordelman and Liliana Melgar came up with very interesting ideas on a user workspace that may eventually become part of the generic infrastructure. Although interest in the first was quite short-lived, the latter we are definitely going to test.

In short: the CLARIAH tech community rallied around the open format! During the final meeting I was happy to announce that given the excitement and energy, the board had decided right then and there, that we could run another Tech-meeting in late winter, early spring 2018. And with illustrating enthusiasm the first ideas for this meeting are already coming in.


Gertjan Filarski


BudapestThe 2017 CLARIN Annual Conference was held from September 18 through September 21 in Budapest.

The pre-conference part of the first day  was dedicated to committee and task force meetings such as the national coordinator’s forum (Jan Odijk participating on behalf of the Netherlands), the user involvement group (NL represented by Patricia Alkhoven), the Standards committee (with Daan Broeder and Jan Odijk on behalf of NL), and others.

There were about 170 participants from the 19 CLARIN ERIC members and 2 observer states, as well as from organisations and countries with whom cooperation discussions are on-going. The Netherlands delegation was relatively small in comparison to other years (only 7 delegates this year), but both key note speakers were from the Netherlands: Karina van Dalen-Oskam (Huygens ING) told about her work on  stylometric research and Piek Vossen (Free University Amsterdam) about the principles and research questions behind extraction information from natural language texts and representing this as linked data.

I found a number of things noteworthy: first, many data from Europeana have now been included in the Virtual Language Observatory, bringing the number of metadata records from some 900K to over 1.6 million.
Second, improvements and extensions of the CLARIN Language Resource Switchboard  (CLRS) created in the CLARIN-PLUS project were reported on. CLRS makes it possible to automatically associate data with applications that apply to them: such an application can then be applied to the data by a user through a single click. This significantly lowers the barrier for using these applications, and it is worthwhile to investigate whether more Dutch applications can be included in the CLRS (currently only some of the Nijmegen applications are included). The concept could be applied in the other CLARIAH core disciplines (social economic history and media studies) as well.
Third, Poland has a very active community and is providing an increasing number of data sets, applications and web services. And finally, all member and observer countries are now connected through federated login, connecting more than 20 countries and thousands of organisations, an impressive achievement indeed!

This year, Paul Meurer from  Uni Research Computing, Norway was awarded the Steven Krauwer award for CLARIN Achievements. As in the past years, the Bazaar was again an informal and very lively event to share the latest ideas and developments. Very interesting was for instance Ramble On by the Italian DH Group from Trento. Ramble On allows you to analyse e.g. the mobility of past famous individuals by using Natural Language Processing modules applied to unstructured texts.

The social programme was very attractive as well. On the first day we had a reception at the Academy Building, which was a quite impressive building and the trip towards it made it possible to view some of the parts of Budapest around the Danube river. On Tuesday there was a dinner on a boat which travelled up and down the Danube river, with spectacular views on the beauty of Budapest.

Budapest, by the way, is considered the birthplace of CLARIN: It was pointed out by Tamás Váradi, the local organizer and confirmed by Steven Krauwer that CLARIN originated here in a workshop  in 2006.

I enjoyed the conference very much and I am looking forward to the 2018 CLARIN Conference, for which the dates and locations are not yet known.

Jan Odijk

(by Marieke van Erp)

Schermafbeelding 2017 06 29 om 16.38.46On 19 and 20 June, the First International Conference on Language, Data and Knowledge (LDK2017) took place in Galway, Ireland. The conference wasn’t too big (~80 participants) and featured a broad and interesting single track programme. It had been a while since I had attended a single track conference, and I had kind of forgotten how much I liked it, so I hope the organisers keep that for the next edition (Leipzig 2019).

CLARIAH collaborator Antal van den Bosch kicked off the conference with the first keynote titled “Processing Text as Socio-Economic and Cultural Data” in which he featured several social sciences and digital humanities text analytics use cases. I really liked his call for a holistic approach to language (which I interpreted as lying at the heart of the conference theme) namely combining whatever information and approaches are available to answer the deeper questions:

From: Processing Text as Socio-Economic and Cultural Data

After his talk, an audience member mentioned that he found that approaches presented at digital humanities conferences are often still fairly coarse-grained, which may be a result of researcher expecting 100% accuracy. This is something that I have noticed before and which was also a big theme in the second keynote of the conference, by Zoltán Slávik (IBM) who argued that technology developers have a huge responsibility to manage expectations. I think Antal’s answer to the audience question reflected this, and he included a remark on keeping the human in the loop, which is also the direction IBM seems to be taking with porting Watson to the medical domain.



Of particular interest

Most of the talks were really interesting, and for the full programme and proceedings see the conference website. Here few papers that I think are particularly interesting to the CLARIAH community.

On the creation of resources: There was an interesting paper on the creation of an ontology for linguistic terminology (OnLiT: An Ontology for Linguistic Terminology, Bettina Klimek, John P. McCrae, Christian Chiarcos and Sebastian Hellmann) which aims to provide an interoperable model and dataset for linguistic terminology. One of the things we have run into in WP3 is that there are different glossaries etc around for describing different linguistic concepts, perhaps OnLiT is an interesting option to look at to start integrating them. One issue that may arise came up in Maria Keet’s presentation on Representing and aligning similar relations: parts and wholes in isiZulu vs English where certain concepts present in one language, may not exist in another, or only partly. I am not sure yet whether OnLiT can represent all of this, but it is still a work in progress.

Another issue in resource creation is the fact that the resource will always be a snapshot of the language at the time the resource was created. One of our most commonly used resources in language technology is WordNet, but it hasn’t been updated for 10 years. “To tweet” then was a verb that applied to birds, now it refers to creating a microblog. John P. McCrae and Ian Wood presented a paper they wrote together with Amanda Hicks in which they aimed to extend WordNet with Neologisms by gathering terms from Twitter and Reddit and filtering them through various sieves.

Poster session

During the very nice poster session, some interesting digital humanities use cases were presented. The first two are by the group of Hyvönen in Finland: Named Entity Linking in a Complex Domain: Case Second World War History by Erkki Heino, Minna Tamper, Eetu Mäkelä, Petri Leskinen, Esko Ikkala, Jouni Tuominen, Mikko Koho and Eero Hyvönen and Reassembling and Enriching the Life Stories in Printed Biographical Registers: High School Alumni on the Semantic Web by Eero Hyvönen, Petri Leskinen, Erkki Heino, Jouni Tuominen and Laura Sirola

What I liked about these is that they deal with real dirty data, and provide interesting examples for the things we can do with data from for example NIOD and biographical resources.

Another highlight of the poster session for me was  Exploring the Role of Gender in 19th Century Fiction Through the Lens of Word Embeddings by Siobhán Grayson, Maria Mulvany, Karen Wade, Gerardine Meaney and Derek Greene. One reason for me to be interested in this is that I’m supervising an MSc thesis that deals with automatic analysis of novels, the other is that I liked how they visualised their results, which I think is very important, especially when working in interdisciplinary settings.

posterExploring the Role of Gender in 19th Century Fiction Through the Lens of Word Embeddings by Siobhán Grayson, Maria Mulvany, Karen Wade, Gerardine Meaney and Derek Greene



  • Gaelic is super interesting, but also super complex (as Graham Isaac’s keynote made clear)
  • The crowd was still quite technical, more humanities researchers in attendance may spark even more interesting cross-disciplinary conversations
  • The weather in Ireland is really not that bad (but don’t forget your waterproof jacket)
  • Kathleen McKeown (University of Columbia, the third keynote speaker) is definitely someone whose work to look into as I already mentioned in this blog post.
  • Why aren’t more conferences doing BBQs?