NAMES: Dutch corpus of person name variants

Spelling variation, variants and digitization errors in person names are serious obstacles for search operations in historical documents. The NAMES project aimed to standardize 564,000 different surnames and 190,113 different given names with the help of the Clariah tool TICCL.

Principal investigator
  • Gerrit Bloothooft
Variant cloud of Elisabeth where edges denote proven variant pairs. The size of a node is proportional to name frequency, which is >9 for this set.

Spelling variation, variants and digitization errors in person names are serious obstacles for search operations in historical documents. A solution could be the spelling standardization of surnames and given names. But ambiguities and alternative interpretations make this a non-trivial task which requires expert evaluation assisted by automatic analyses.

The NAMES project aimed to standardize 564,000 different surnames and 190,113 different given names from 19th century sources with 52.5 million tokens with the help of the Clariah tool TICCL. A subset of these names was already automatically related to a standard as they could be identified as having been used for the same individual. This subset has been reviewed by experts which resulted in 127,154 surnames associated to 11,278 standards and 49,804 given names associated to 782 gender independent standards. Unfortunately, TICCL did not succeed to support the extension of this set. Instead, brute force comparison of the remaining names to names with a standard, and extending the number of standards, increased the coverage of standardized tokens to 99,43% for given names and 98,51% for surnames.

Data will be made available in RDF format for linked open data and as Lexicon service. In addition, digital versions of name dictionaries will be made accessible.

Researchers

Gerrit Bloothooft
PI
Gerrit Bloothooft

Assistant professor, Utrecht University