EMNLP2025

Lemmatization of Polish Multi-word Expressions

Magdalena Król, Aleksander Smywinski-Pohl, Zbigniew Kaleta, Pawel Lewkowicz

摘要

This paper explores the lemmatization of multi-word expressions (MWEs) and proper names in Polish -tasks complicated by linguistic irregularities and historical factors. Instead of using rule-based methods, we apply a machine learning approach with fine-tuned plT5 and mT5 models. We trained and validated the models on enhanced gold-standard data from the 2019 Pol-Eval task and evaluated the impact of additional fine-tuning on a silver-standard dataset derived from Wikipedia. Two setups were tested: one without context, and one using leftside context of the target MWE. Our best model achieved 86.23% AccCS (Accuracy Case-Sensitive), 89.43% AccCI (Accuracy Case-Insensitive), and a combined score of 88.79%, setting a new state-of-the-art for Polish MWE and named entity lemmatization, as confirmed by the PolEval maintainers. We also evaluated optimization and quantization techniques to reduce model size and inference time with modest quality loss.