[2506.13792] ICE-ID: A Novel Historical Census Dataset for Longitudinal Identity Resolution
Summary
ICE-ID is a comprehensive historical census dataset featuring over 984,000 records from 16 census waves in Iceland, aimed at improving longitudinal identity resolution in AI applications.
Why It Matters
This dataset addresses significant challenges in identity resolution by providing a rich historical context, which is crucial for developing more accurate AI models. It offers insights into temporal data handling and enhances the understanding of person identification across time, benefiting researchers and practitioners in AI and data science.
Key Takeaways
- ICE-ID includes 984,028 records from 220 years of Icelandic census data.
- The dataset addresses unique challenges like hierarchical geography and patronymic naming conventions.
- It provides tools for interactive exploration and analysis of identity resolution.
- Baseline model comparisons are included to benchmark performance against classical datasets.
- The dataset is publicly available for research and development purposes.
Computer Science > Artificial Intelligence arXiv:2506.13792 (cs) [Submitted on 11 Jun 2025 (v1), last revised 23 Feb 2026 (this version, v2)] Title:ICE-ID: A Novel Historical Census Dataset for Longitudinal Identity Resolution Authors:Gonçalo Hora de Carvalho, Lazar S. Popov, Sander Kaatee, Mário S. Correia, Kristinn R. Thórisson, Tangrui Li, Pétur Húni Björnsson, Eiríkur Smári Sigurðarson, Jilles S. Dibangoye View a PDF of the paper titled ICE-ID: A Novel Historical Census Dataset for Longitudinal Identity Resolution, by Gon\c{c}alo Hora de Carvalho and 8 other authors View PDF HTML (experimental) Abstract:We introduce \textbf{ICE-ID}, a benchmark dataset comprising 984,028 records from 16 Icelandic census waves spanning 220 years (1703--1920), with 226,864 expert-curated person identifiers. ICE-ID combines hierarchical geography (farm$\to$parish$\to$district$\to$county), patronymic naming conventions, sparse kinship links (partner, father, mother), and multi-decadal temporal drift -- challenges not captured by standard product-matching or citation datasets. This paper presents an artifact-backed analysis of temporal coverage, missingness, identifier ambiguity, candidate-generation efficiency, and cluster distributions, and situates ICE-ID against classical ER benchmarks (Abt--Buy, Amazon--Google, DBLP--ACM, DBLP--Scholar, Walmart--Amazon, iTunes--Amazon, Beer, Fodors--Zagats). We also define a deployment-faithful temporal OOD protocol and release the dataset, splits, reg...