ACL2020

Interpreting Pretrained Contextualized Representations via Reductions to Static Embeddings

Rishi Bommasani, Kelly Davis, Claire Cardie

137 citations

Abstract

Contextualized representations (e.g. ELMo, BERT) have become the default pretrained representations for downstream NLP applications. In some settings, this transition has rendered their static embedding predecessors (e.g. Word2Vec, GloVe) obsolete. As a side-effect, we observe that older interpretability methods for static embeddings -while more mature than those available for their dynamic counterparts -are underutilized in studying newer contextualized representations. Consequently, we introduce simple and fully general methods for converting from contextualized representations to static lookup-table embeddings which we apply to 5 popular pretrained models and 9 sets of pretrained weights. Our analysis of the resulting static embeddings notably reveals that pooling over many contexts significantly improves representational quality under intrinsic evaluation. Complementary to analyzing representational quality, we consider social biases encoded in pretrained representations with respect to gender, race/ethnicity, and religion and find that bias is encoded disparately across pretrained models and internal layers even for models that share the same training data. Concerningly, we find dramatic inconsistencies between social bias estimators for word embeddings. The first is subword pooling: the application of a pooling mechanism over the k subword representations generated for w in context c in order to compute a single representation for w in c, i.e. w 1 c , . . . , w k c → w c . Beyond this, we define context combination to be the mapping from representations w c 1 , . . . , w cn of w in different contexts c 1 , . . . , c n to a single static embedding w that is agnostic of context. Subword Pooling. The tokenization procedure for BERT can be decomposed into two steps: performing a simple word-level tokenization and then potentially deconstructing a word into multiple subwords, yielding w 1 , . . . , w k such that cat(w 1 , . . . , w k ) = w where cat(•) indicates concatenation. Then, every layer of the model computes vectors w 1 c , . . . , w k c . Given these vectors, we consider four pooling mechanisms to compute w c : mean, last min(•), max(•) are element-wise min/max pooling, mean(•) is the arithmetic mean and last(•) indicates selecting the last vector, w k c . Context Combination. Next, we describe two approaches for specifying contexts c 1 , . . . , c n and combining the associated representations w c 1 , . . . , w cn .