ICLR2021

Disambiguating Symbolic Expressions in Informal Documents

Dennis Müller, Cezary Kaliszyk

Abstract

We propose the task of disambiguating symbolic expressions in informal STEM documents in the form of L A T E X files -that is, determining their precise semantics and abstract syntax tree -as a neural machine translation task. We discuss the distinct challenges involved and present a dataset with roughly 33,000 entries. We evaluated several baseline models on this dataset, which failed to yield even syntactically valid L A T E X before overfitting. Consequently, we describe a methodology using a transformer language model pre-trained on sources obtained from arxiv.org, which yields promising results despite the small size of the dataset. We evaluate our model using a plurality of dedicated techniques, taking the syntax and semantics of symbolic expressions into account.