ICLR2026

MIRACLE: Model-free Imitation and Reinforcement Learning for Adaptive Cut-Selection

Arjun M., Rijul Tandon, Agam Gupta, HARIPRASAD KODAMANA, Manojkumar Ramteke

Abstract

Mixed-Integer Programming (MIP) solvers rely heavily on cutting planes to tighten LP relaxations, but traditional approaches generate thousands of cuts that consume gigabytes of memory while providing minimal benefit. We present an intelligent cut selection framework that achieves a 98.1% reduction in memory usage while maintaining competitive solving with an objective gap of approximately 0.08%. Within this RL framework, we use Proximal Policy Optimization (PPO) to learn a behavioral model that imitates the expert solver's decisions. The adversarially imitated behavioral model drives an agent comprising these key innovations: (i) a cut-selection policy trained via curriculum learning; and (ii) adaptive inference that dynamically adjusts computational budgets. Through comprehensive evaluation across SetCover and diverse MIPLIB problems, we demonstrate consistent speedups (3.78× average on MIPLIB) and achieve a 100% success rate on instances where traditional SCIP fails 47-53% of the time. Our method also reduces peak memory consumption from 3.03GB to 46 MB, enabling optimization in previously inaccessible and other resource-constrained environments where traditional solvers face fundamental limitations. Published as a conference paper at ICLR 2026 2024) and Paulus et al. (2022) , which often require heavy architectures or look-ahead rollouts, our approach utilizes a lightweight, budget-constrained policy optimized via PPO and GAIL to achieve an order-of-magnitude memory reduction while maintaining solution reliability in resource-constrained environments. While these methods have demonstrated performance gains, they are built on a paradigm that suffers from three fundamental limitations: 1. The Black-Box Fallacy: Existing approaches treat the MIP solver as a black box. They learn to copy expert decisions or interact in a model-free fashion, but they fail to model the underlying dynamics of the optimization process (Deza & Khalil, 2023; Zhang et al., 2024) . They do not learn how adding a cut will change the subsequent state of the LP relaxation, a process instead handled by an external, non-differentiable solver (Huang et al., 2022). 2. Myopic Planning: A direct consequence of the black-box approach is that learned policies are restricted to myopic decisions. Lacking a model of the environment, they cannot plan ahead or reason about the long-term consequences of their actions, preventing the discovery of more sophisticated, non-local strategies. 3. Resource Inefficiency as an Afterthought: Prior work has predominantly focused on improving solution time, largely ignoring memory overhead as a critical performance metric. This makes them ill-suited for the very resource-constrained scenarios where learned, efficient heuristics are most desperately needed. This work addresses a critical gap: Can we learn intelligent cut selection policies that achieve significant memory reductions while maintaining competitive solving performance? Our approach reframes cut selection as an RL problem where we learn a behavioral model of expert cut selection rather than attempting to model the complex LP dynamics directly. KEY CONTRIBUTIONS Our work makes the following contributions: • Memory-First Optimization Paradigm: We demonstrate that intelligent cut selection can achieve significant memory reductions (97.7-98.1% on SetCover benchmarks, 68.1-69.1% on diverse MIPLIB problems) while maintaining or improving solution quality. • Robust Behavioral Modeling Framework: Our PPO-based approach learns a behavioral model of expert cut selean average reduction of 86.3%) and reliability improvements (a 100% success rate compared to 53% arning and adaptive inference, we eliminate manual parameter tuning and provide a deployment-ready system with inference complexity independent of problem size. • Comprehensive Empirical Validation: We provide a systematic evaluation across 300 instances spanning SetCover training problems and diverse MIPLIB test cases (Huang et al., 2024) . Our analysis includes statistical significance testing, confidence intervals, and effect size measurements, demonstrating both memory efficiency (an average reduction of 86.3%) and reliability improvements (100% vs. 53% success rate on challenging instances). • Ablation Studies and Robustness: We demonstrate that our framework's performance remains stable across different hyperparameter configurations (cut budgets 10-50, iteration limits 1-10, various early stopping criteria), indicating reliable real-world deployment characteristics essential for industrial adoption. To this extent, our approach reframes cut selection through the lens of RL. Rather than attempting to model the prohibitively complex LP transition function, we learn to imitate and ultimately improve upon SCIP's implicit selection policy. SCIP's cut-selection module serves as the expert policy because it encapsulates decades of solver engineering and remains the strongest publicly available heuristic ba