ACL2021

Better Chinese Sentence Segmentation with Reinforcement Learning

Srivatsan Srinivasan, Chris Dyer

摘要

A long-standing challenge in Chinese-English machine translation is that sentence boundaries are ambiguous in Chinese orthography, but inferring good splits is necessary for obtaining high quality translations. To solve this, we use reinforcement learning to train a segmentation policy that splits Chinese texts into segments that can be independently translated so as to maximise the overall translation quality. We compare to a variety of segmentation strategies and find that our approach improves the baseline BLEU score on the WMT2020 Chinese-English news translation task by +0.3 BLEU overall and improves the score on input segments that contain more than 60 words by +3 BLEU.