EMNLP2021
Fast WordPiece Tokenization
Xinying Song, Alex Salcianu, Yang Song, Dave Dopson, Denny Zhou
Abstract
Tokenization is a fundamental preprocessing step for almost all NLP tasks. In this paper, we propose efficient algorithms for the Word-Piece tokenization used in BERT, from singleword tokenization to general text (e.g., sentence) tokenization. When tokenizing a single word, WordPiece uses a longest-matchfirst strategy, known as maximum matching. The best known algorithms so far are ( 2 ) (where is the input length) or ( ) (where is the maximum vocabulary token length). We propose a novel algorithm whose tokenization complexity is strictly ( ). Our method is inspired by the Aho-Corasick algorithm. We introduce additional linkages on top of the trie built from the vocabulary, allowing smart transitions when the trie matching cannot continue. For general text, we further propose an algorithm that combines pre-tokenization (splitting the text into words) and our linear-time Word-Piece method into a single pass. Experimental results show that our method is 8.2x faster than HuggingFace Tokenizers and 5.1x faster than TensorFlow Text on average for general text tokenization. * Research conducted while working at Google.