Search written and spoken Dutch with OpenSoNaR
On Tuesday 9 April, the Institute for the Dutch Language launched a new version of the OpenSoNaR web application, which allows you to search in large quantities of written and spoken Dutch. The application provides access to data from the SoNaR Corpus, a collection of written texts of more than 500 million words, and the Corpus Gesproken Nederlands (CGN), a collection of 900 hours of Dutch speech.
The new web application makes it possible to search the data of the two collections (corpora). The texts are provided with additional linguistic information such as part of speech and lemma, and the sound fragments of the Corpus Gesproken Nederlands can be played. In the application you can easily search for a word, or do a more complex search by selecting a specific annotation or by using regular expressions. It is also possible to save the search results, consult the search history and view frequency lists.
CGN and SoNaR
The Corpus Gesproken Nederlands (CGN) is a collection of 900 hours (almost 9 million words) of contemporary Dutch speech from Flemish and Dutch speakers. The speech fragments (spontaneous and prepared) are provided with various transcriptions (e.g. orthographic, phonetic) and annotations (syntactic, POS-tags). The SoNaR Corpus contains more than 500 million words of text from various domains and genres.
All texts were automatically tokenized, POS-tagged and lemmatized. The named entities were also labelled.
OpenSoNaR can be accessed free of charge with a university user account, or with a CLARIN account. The application was developed by a team from the Institute for the Dutch Language, Tilburg University and Radboud University, within the CLARIN-NL and CLARIAH projects.