NeurIPS2025
Zero-shot World Models via Search in Memory
Federico Malato, Ville Hautamäki
摘要
World Models have vastly permeated the field of Reinforcement Learning. Their ability to model the transition dynamics of an environment have greatly improved sample efficiency in online RL. Among them, the most notorious example is Dreamer, a model that learns to act in a diverse set of image-based environments. In this paper, we leverage similarity search and stochastic representations to approximate a world model without a training procedure. We establish a comparison with PlaNet, a well-established world model of the Dreamer family. We evaluate the models on the quality of latent reconstruction and on the perceived similarity of the reconstructed image, on both next-step and long horizon dynamics prediction. The results of our study demonstrate that a search-based world model is comparable to a training based one in both cases. Notably, our model show stronger performance in long-horizon prediction with respect to the baseline on a range of visually different environments. Our study draws its main inspirations from PlaNet [8] and its evolution Dreamer [9, 10] . Moreover, we base our study on previous work on similarity search, namely [13, 15, 12, 1]. In this Section, we briefly revise and introduce the main concepts of each work. In [8] , authors propose PlaNet, a model-based RL agents that learns to plan from pixels. In their study, authors model the state of an environment as composed by a deterministic and a stochastic part. To successfully predict future states, they define a recurrent state-space model (SSM) composed of a variational autoencoder (VAE) [11] , an observation model P(o t |s t ), a transition model P(s t |s t-1 , a t-1 ), a reward model P(r t |s t ) and a recurrent, deterministic state model h t = f (h t-1 , s t-1 , a t-1 ). While leaving the general structure of the model substantially unaltered, evolutions of PlaNet introduce, respectively, a discrete underlying distribution of the latent space [9] and an improved loss for more stable predictions [10] . In [1] , authors define locally weighted learning (LWL), a framework to train a model for continuous control by using an ensemble of local models. In particular, the dataset is projected into a metric state space and divided into neighborhoods, hence producing subsets of closely related data. Then, a local model is trained on each specialized dataset. Finally, an agent selects actions by querying each local model and performing a weighted average over their answers.