ACL2021
Better Chinese Sentence Segmentation with Reinforcement Learning
Srivatsan Srinivasan, Chris Dyer
摘要
A long-standing challenge in Chinese-English machine translation is that sentence boundaries are ambiguous in Chinese orthography, but inferring good splits is necessary for obtaining high quality translations. To solve this, we use reinforcement learning to train a segmentation policy that splits Chinese texts into segments that can be independently translated so as to maximise the overall translation quality. We compare to a variety of segmentation strategies and find that our approach improves the baseline BLEU score on the WMT2020 Chinese-English news translation task by +0.3 BLEU overall and improves the score on input segments that contain more than 60 words by +3 BLEU.