ICLR2026
Take Note: Your Molecular Dataset Is Probably Aligned
Peter Lippmann, Roman Remme, Manuel V. Klockow, Fred A. Hamprecht
摘要
Massive training datasets are fueling the astounding progress in molecular machine learning. Since these datasets are typically generated with computational chemistry codes which do not randomize pose, the resulting molecular geometries are usually not randomly oriented. While cheminformaticians are well aware of this fact, it can be a real pitfall for machine learners entering the burgeoning field of molecular machine learning. We demonstrate that molecular poses in the popular datasets QM9, QMugs, and OMol25 are indeed biased. While the fact can easily be overlooked by visual inspection alone, we show that a simple classifier can separate original data samples from randomly rotated ones with high accuracy. Second, we empirically validate that neural networks can and do exploit the orientation bias in these datasets by successfully training a model on chemical property prediction using molecular orientation as sole input. Third, we present visualizations of all molecular orientations and confirm that chemically similar molecules tend to have similar canonical poses. In summary, we recall and document orientation bias in the prevalent datasets that machine learners should be aware of. In this paper, we make the following contributions: we demonstrate, using QM9, QMugs, and OMol25 as prominent examples, that molecules in many popular ML datasets are not randomly oriented by training a simple classifier that distinguishes between randomly rotated and unrotated samples with very high accuracy. We show that the accuracy remains high even when the default atom positions are perturbed with substantial noise and random rotations of up to 90 • . Further, we demonstrate that neural networks can leverage the canonical orientation to achieve artificially high accuracy in an extreme scenario: using only the normalized principal components of atom positions as input, we regress molecular properties and observe performance on the three standard datasets that exceeds the best possible accuracy expected for randomly oriented data. Lastly, we visualize the orientations of all molecules in these datasets and show that chemically similar molecules tend to be oriented similarly (see Fig. 1 ). We make our code publicly available as a toolbox to visualize and quantify orientation bias in molecular datasets at https://github.com/sciai-lab/are-my-molecules-aligned . an additional loss. The MLIPs are trained on OMol25 and the MD17 dataset, which can both be shown to exhibit strong orientation bias (cf. Fig. 5b and Fig. 11 ), which may influence the training and evaluation of these models. Motivated by the "bitter lesson" Sutton (2019), the conformer generation model presented in (Wang et al., 2023b) is based on an efficient and scalable diffusion model that operates directly on 3D atomic positions without enforcing rotational equivariance. The authors conduct experiments on QM9 and the strongly aligned GEOM dataset (cf. Fig. 7a ). Notably, the authors observe that randomly rotating their training set prior to training negatively impacts their performance and hypothesize that the reason may be that "DFT simulations used to generate the data might be implicitly encoding a canonical coordinate system, which affects generalization if broken" (Wang et al., 2023b) (p. 8). INVESTIGATING ORIENTATIONS IN MOLECULAR DATASETS: METHODS, RESULTS, AND IMPLICATIONS Clearly, the first step that comes to mind when investigating the orientations of molecular geometries is to visually inspect the 3D geometries for obvious alignment. Figure 2 shows 100 randomly sampled molecular geometries from each dataset. For QM9, clear structure is visible. Most strikingly, the first bond (adjacent to the origin) almost perfectly aligns with the Cartesian y-axis. The original QM9 paper (Ramakrishnan et al., 2014) invoked the cheminformatics tool Corina (version 3.491, 2013 ) (Sadowski & Gasteiger, 1993) to generate 3D structures from SMILES strings. The geometries were then relaxed using Kohn-Sham DFT calculations at the B3LYP/6-31G(2df,p) level. The Corina algorithm (closed-source) is likely responsible for the alignment with the y-axis, while the subsequent geometry relaxation softens the strict alignment. For QMugs and OMol25, no similarly