ACL2022

More Than Words: Collocation Retokenization for Latent Dirichlet Allocation Models

Jin Cheevaprawatdomrong, Alexandra Schofield, Attapol Rutherford

Abstract

Traditionally, Latent Dirichlet Allocation 001 (LDA) ingests words in a collection of doc-002 uments to discover their latent topics us-003 ing word-document co-occurrences. Previous 004 studies show that representing bigrams collo-005 cations in the input can improve topic coher-006 ence in English. However, it is unclear how 007 to achieve the best results for languages with-008 out marked word boundaries such as Chinese 009 and Thai. Here, we explore the use of reto-010 kenization based on chi-squared measures, t-011 statistics, and raw frequency to merge frequent 012 token ngrams into collocations when prepar-013 ing input to the LDA model. Based on the 014 goodness of fit and the coherence metric, we 015 show that topics trained with merged tokens 016 result in topic keys that are clearer, more coher-017 ent, and more effective at distinguishing topics 018 than those of unmerged models. 019 1 Introduction 020 Latent Dirichlet allocation (LDA) models provide 021 useful insights into themes and trends in a large 022 text collection through the unsupervised inference 023 of topics, or probability distributions over unigram 024 word types in the corpus (Blei et al., 2003). Topics 025 from these models are often interpreted based on 026 their highest-probability words, with documents 027 expressed as vectors of proportions of each topic. 028 Unfortunately, the context in which these tokens 029 arise can be obscured in the bag-of-words render-030 ing of text as unigram counts in documents. For 031 instance, a topic with high probabilities of both 032 "coffee" and "table" is tempting to interpret as fo-033 cusing on the furniture item "coffee table", but both 034 words could be frequent in a discussion of cafes 035 containing no coffee tables. This problem is ampli-036 fied in languages without marked word boundaries, 037 such as Chinese and Thai: while existing tokeniz-038 ers in these languages can segment characters into 039 words, there is always a question about to what 040 extent the tokenizers should group words together. 041 131 proach to capture fit while accounting for changes 132 in vocabulary (Schofield and Mimno, 2016). 133 3 Collocations as LDA Token 134 Collocations consist of two or more words that 135 express conventional meaning, which can convey 136 information about multi-word entities, context, and 137 word usage. We hypothesize that the introduction 138 of multi-word tokens, which capture collocations 139 as bigrams or trigrams by way of concatenation 140 of adjacent tokens, can help achieve more useful 141 and coherent topic models. For languages without 142 clear word boundaries, there is a possible additional 143 benefit to multi-word tokens: it can be hard to 144 intuit whether inferred word boundaries will have a 145 large impact on the final results. Merging adjacent 146 words into 'multi-word' tokens may help remedy 147 the potential problem of a segmentation that is not 148 optimal for topic modeling purposes. 149 Many methods are possible to select colloca-150 tions to merge from tokenized text (Manning and 151 Schutze, 1999). In this paper, we evaluate the chi-152 squared statistics (χ 2 ), the t-statistic and raw fre-153 quency as approaches to develop a threshold for 154 merging collocations into multi-word tokens prior 155 to topic model training. The chi-squared measure 156 χ 2 (w 1 , w 2 ) and t(w 1 , w 2 ) t-statistic for two adja-157 cent tokens w 1 and w 2 are defined as: