ICML2025
Foundation Molecular Grammar: Multi-Modal Foundation Models Induce Interpretable Molecular Graph Languages
Michael Sun, Weize Yuan, Gang Liu, Wojciech Matusik, Jie Chen
Abstract
Recent data-efficient molecular generation approaches exploit graph grammars to introduce interpretability into the generative models. However, grammar learning therein relies on expert annotation or unreliable heuristics for algorithmic inference. We propose Foundation Molecular Grammar (FMG), which leverages multi-modal foundation models (MMFMs) to induce an interpretable molecular language. By exploiting the chemical knowledge of an MMFM, FMG renders molecules as images, describes them as text, and aligns information across modalities using prompt learning. FMG can be used as a drop-in replacement for the prior grammar learning approaches in molecular generation and property prediction. We show that FMG not only excels in synthesizability, diversity, and data efficiency but also offers built-in chemical interpretability for automated molecular discovery workflows. Code is available at https://github.com/ shiningsunnyday/induction . Foundation Molecular Grammar: Multi-Modal Foundation Models Induce Interpretable Molecular Graph Languages Prompt I want you to think like a chemist performing a detailed analysis of the chemical composition of a toxic compound through its constituent motifs. I will highlight for you "'24"' of the substructures of a molecule. They are numbered one-by-one from Motif 0 to Motif "'23"', inclusive. I want you to explain, concisely, what each numbered motif is. Make sure to start from Motif 0 and go in order of the numbering. MAKE SURE you describe EVERY MOTIF! Response Certainly! Here is a concise description of each motif: Motif 0: A carbonyl group (C=O) attached to a carbon chain. Motif 1: A nitrile group (C ≡ N ) attached to a tertiary carbon. Motif 2: A di-substituted carbon chain with two adjacent nitrile groups (N=C-C=C-N). Motif 3: Similar to Motif 2 but with an additional carbon on one end. Motif 4: An imine group (C=N) connected to a vinylic carbon chain (C=C). Motif 5: Benzene ring (C6H6) -a six-membered aromatic carbon ring. Motif 6: A singly unsaturated carbon chain. Motif 7: A doubly unsaturated carbon chain. Motif 8: Di-substituted carbon chain with a nitrile group (N=C) on one end and another nitrile (N ≡ C) in the middle. Motif 9: Similar to Motif 8 but with different placement of carbons. Motif 10: An amino group (NH2) attached to a C=N group (imino). Motif 11: A ketone (C=O) with adjacent carbons. Motif 12: A cyano-ketone (C ≡ N and C=O) attached to adjacent carbons. Motif 13: An alkyl cyanide (R -C ≡ N ) where R is an alkyl chain. Motif 14: An amino group (NH2) connected to a nitrile (C ≡ N ). Motif 15: A benzene ring (C6H6) -slightly different representation of Motif 5. Motif 16: Another representation of a benzene ring (C6H6). Motif 17: A nitrile group (C ≡ N ) attached to a doubly unsaturated carbon chain. Motif 18: An alkyl halide (carbon chain with a chlorine, C-Cl). Motif 19: Similar to Motif 18 with a different number of carbons. Motif 20: A doubly canned ketone (C=O) group attached to an imine (C=N) group.