WWW2025
MixedSAND: Semantic Annotation of Mixed-unit Numeric Data
Amir Behrad Khorram Nazari, Davood Rafiei, Mario A. Nascimento
被引用 1 次
摘要
Quantitative information about entities constitutes a significant portion of tabular data in open sources and data lakes. Such tables often lack consistent labeling and proper schema, posing significant challenges for querying and integration. This paper studies the problem of numerical column annotation in scenarios where quantitative data may be gathered from different sources and unit consistency is a concern. For instance, weight measurements may vary between entities, expressed in kilograms for some and pounds for others, with no accompanying unit information. We investigate the conditions for effectively annotating mixed-unit numeric data, introduce a benchmark for such an annotation task, and propose an algorithm that reliably detects semantic types (e.g., height) and links them to the corresponding types present in a knowledge graph. Our evaluation on a diverse set of columns with mixed units and varying levels of annotation difficulty shows that our method significantly outperforms strong baselines such as GPT-4o-mini and SAND in terms of accuracy, excelling in both detecting mixed units and annotating them with appropriate semantic labels. (All our code and data will be publicly released upon acceptance of the paper.