ASE2025

Automated Evolutionary Hyperparameter Tuning for NLP-Based Test Case Generation

Ivan P. Malashin, Igor S. Masich, Sergei Kurashkin, Andrei P. Gantimurov, Aleksey S. Borodulin, Vladimir A. Neluyb, Vadim Tynchenko

DOI 出版方

摘要

Automated generation of executable test suites from natural-language requirements remains challenging due to linguistic ambiguity and sensitivity of generative models to decoding and training hyperparameters. This paper introduces a hierarchical, multi-level evolutionary framework that treats model hyperparameters and decoding strategies as upper-level decision variables and employs lower-level fitnesses that directly measure test-quality objectives (structural coverage, semantic diversity, redundancy, and runtime efficiency). The approach integrates retrieval-augmented grounding, surrogate-assisted preselection, lightweight LoRA adaptation and optional HIL evaluation. Empirical evaluation on PURE, PROMISE_exp and FR_NFR benchmarks (repeated runs, n = 10; paired twosided t-tests, $\alpha=0.05$ ) shows consistent gains: on PURE mean code coverage reaches $\mathbf{8 2. 4 \%}$ (vs. $\mathbf{7 5. 1 \%}$ for Bayesian optimisation and 68.9% for random search) with 145 unique scenarios and modest runtime overhead $(\approx 58.3 \mathrm{~s}$ , $\approx 6 \%$ above Bayesian). Ablations confirm component effects (e.g., removing diversity reduces unique scenarios $\approx 18 \%$ ; disabling the surrogate increases wall-clock $\approx \mathbf{4 2} \boldsymbol{\%}$ ; disabling RAG drops grounded consistency $\approx 12 \%$ ). Results indicate that co-optimising hyperparameters for explicit test-quality metrics, together with grounding and realistic execution, yields more useful, executable test suites. Future work will explore adaptive objective weighting, transfer warm-starts and probabilistic surrogates.