ACL2022

Signal in Noise: Exploring Meaning Encoded in Random Character Sequences with Character-Aware Language Models

Mark Chu, Bhargav Srinivasa Desikan, Ethan O. Nadler, Donald Ruggiero Lo Sardo, Elise Darragh-Ford, Douglas Guilbeault

Abstract

Natural language processing models learn word representations based on the distributional hypothesis, which asserts that word context (e.g., co-occurrence) correlates with semantic meaning. We propose that n-grams composed of random character sequences, or garble, provide a novel context for studying word meaning both within and beyond extant language. In particular, randomly-generated character n-grams lack semantic meaning but contain primitive information based on the distribution of characters they contain. By studying the embeddings of a large corpus of garble, extant language, and pseudowords using CharacterBERT, we identify an axis in the model's high-dimensional embedding space that separates these classes of n-grams. Furthermore, we show that this axis relates to structure within extant language, including word part of speech, morphology, and concreteness. Thus, in contrast to studies that are mainly limited to extant language, our work reveals that semantic meaning and primitive information are intrinsically linked. plore this by studying the embeddings of randomly-045 generated character n-grams (referred to as garble), 046 which contain primitive communicative informa-047 tion but are devoid of semantic meaning, using the 048 CharacterBERT model (El Boukkouri et al., 2020). 049 Such randomly-generated character n-grams are 050 textual analogues of paralinguistic vocalizations. 051 Our analyses contribute to the growing under-052 standing of BERTology (Rogers et al., 2020) by 053 identifying a dimension, which we refer to as the 054 information axis, that separates extant and garble 055 n-grams. This finding is supported by a Markov 056 model that produces a probabilistic information 057 measure for character n-grams based on their statis-058 tical properties. Strikingly, this information dimen-059 sion correlates with properties of extant language; 060 for example, parts of speech separate along the in-061 formation axis, and word concreteness varies along 062 a roughly orthogonal dimension in our projection 063 of CharacterBERT embedding space. Although the 064 information axis we identify separates extant and 065 randomly-generated n-grams very effectively, we 066 demonstrate that these classes of n-grams mix into 067 each other in detail, and that pseudowords-i.e., 068 phonologically coherent character n-grams with-069 out lexical meaning-lie between the two in our 070 CharacterBERT embeddings. 071 This paper is organized as follows. We first dis-072 cuss concepts from computational linguistics, in-073 formation theory, and linguistics relevant to our 074 study. We then analyse CharacterBERT representa-075 tions of extant and randomly-generated character 076 sequences and how the relation between the two 077 informs the structure of extant language, including 078 morphology, part-of-speech, and word concrete-079 ness. Finally, we ground our information axis in a 080 predictive Markov language model.