EMNLP2025

Tracing L1 Interference in English Learner Writing: A Longitudinal Corpus with Error Annotations

Poorvi Acharya, J. Elizabeth Liebl, Dhiman Goswami, Kai North, Marcos Zampieri, Antonios Anastasopoulos

Abstract

Language transfer is an important topic of research in second language acquisition and computational linguistics. The availability of suitable learner corpora is paramount for the study of second language acquisition (SLA) and language transfer. However, curating learner corpora is a challenging endeavor as high quality learner data is rarely publicly available. This results in only a few such corpora available to the community. To address this important gap, in this paper we present LENS, a novel English learner corpus with longitudinal data which enables researchers to investigate language learning over time. LENS contains 687 instances written by speakers of 15 different L1s. We use LENS two perform two important tasks at the intersection of SLA and Computational Linguistics: (1) Native Language Identification (NLI); and (2) an evaluation of large language models as a tool for high-precision, semi-automated annotation of L1 interference features. 1 -Definition: Errors involving incorrect word choice due to misunderstanding of meaning. -Examples: * among ! below * borrow ! lend • Phonological Confusion -Definition: Errors where words are confused due to phonological similarities, often involving metathesis, substitution of similar phonemes, or confusion between near-homophones. -Examples: * aboard ! abroad (Metathesis: reversed phonemes) * form ! from (Transposition of adjacent sounds) * claps ! class (Substitution of "p" for "s") 3. Morphological Subcategories • Morphemic Errors with Affixes -Definition: Incorrect handling of prefixes or suffixes. -Examples: * beautifull ! beautiful * hoping ! hopping • Overgeneralization of Spelling Rules -Definition: Applying English morphological or spelling rules too broadly. -Examples: * buyed ! bought * goed ! went 4. L1 Interference Subcategories • Orthographic Interference -Definition: Applying L1 spelling conventions to English. -Examples: * esplendid ! splendid (Spanish: adding "e" before "s" clusters) * colur ! colour (British vs. American orthography confusion) • Lexical Interference -Definition: Using L1-based lexical forms or cognates in English. -Examples: * telefon ! telephone (Spanish or German influence) * faciliter ! facilitate (French influence) • Grammatical Interference -Definition: Applying L1 grammatical patterns to English. -Examples: * She has 24 years ! She is 24 years old (Spanish: "Ella tiene 24 años") * He doesn't know nothing ! He doesn't know anything (Negative concord in some L1s) • Syntactic Interference -Definition: Applying L1 syntactic structures to English. -Examples: * He to the store goes ! He goes to the store (German word order influence) * Beautiful is she ! She is beautiful (Japanese syntax influence) 5. Grammatical Subcategories • Grammatical Errors -Definition: Errors in grammar, syntax, word order, or agreement. -Examples: * She go yesterday ! She went yesterday * He like apples ! He likes apples Categories and Subcategories: We define a hierarchical categorization system using Python enums for clarity and consistency: