ACL2021

Substructure Substitution: Structured Data Augmentation for NLP

Haoyue Shi, Karen Livescu, Kevin Gimpel

Abstract

We study a family of data augmentation methods, substructure substitution (SUB 2 ), that generalizes prior methods. SUB 2 generates new examples by substituting substructures (e.g., subtrees or subsequences) with others having the same label. This idea can be applied to many structured NLP tasks such as part-of-speech tagging and parsing. For more general tasks (e.g., text classification) which do not have explicitly annotated substructures, we present variations of SUB 2 based on text spans or parse trees, introducing structureaware data augmentation methods to general NLP tasks. For most cases, training with a dataset augmented by SUB 2 achieves better performance than training with the original training set. Further experiments show that SUB 2 has more consistent performance than other investigated augmentation methods, across different tasks and sizes of the seed dataset. 1