NeurIPS2022

OccGen: Selection of Real-world Multilingual Parallel Data Balanced in Gender within Occupations

Marta R. Costa-jussà, Christine Basta, Oriol Domingo, André Rubungo

被引用 6 次

摘要

This paper describes the O CC G EN toolkit, which allows extracting multilingual parallel data balanced in gender within occupations. O CC G EN can extract datasets that reflect gender diversity (beyond binary) more fairly in society to be further used to explicitly mitigate occupational gender stereotypes. We propose two use cases that extract evaluation datasets for machine translation in four high-resource languages from different linguistic families and in a low-resource African language. Our analysis of these use cases shows that translation outputs in high-resource languages tend to worsen in feminine subsets (compared to masculine), specially in the directions containing English. This is confirmed by the human evaluation. We hypothesize that a sound language generation may contribute to pay less attention to the source sentence and to overgeneralize to the most frequent gender forms.