ASE2025

On the Importance of Context Filtering in Retrieval-Augmented Code Completion

Sergey Sedov, Vsevolod Savinskiy, Andrei Arzhantsev

被引用 1 次

摘要

We present a retrieval-augmented pipeline for code completion task developed by our NoMoreActimel team during the JetBrains & Mistral AI Context Collection Competition. Our approach separates offline index pre-processing and online query processing parts. We highlight the importance of asymmetric approach to RAG in code completion tasks, separating index-specific and query-specific heuristics. We argue that stronger embedding models should perform increasingly better than BM25 baselines when applied on large databases, which leads us to retrieval across all repositories instead of a single one. However, as our experiments show that sometimes less context is better, retrieval over larger code-bases increases the significance of proper context filtering. Therefore, we identify the main challenges of model-based RAG in code completion as poor context relevancy and extensive generality of chunk embeddings in particular. We focus on experiments with different chunking strategies, introducing a hole-centered query chunking strategy as our first modification that controls query relevance. We propose several reweighting penalties for similarity scores in order to increase relevancy of in-context chunks, penalizing by length and distance to completion hole. Filtering by simple similarity score thresholds also helps the final model performance. Besides that, we find that generation of short textual descriptions of completion target significantly improves metrics as well. While textual descriptions can be generated with much smaller model $(1.5 \mathrm{~B})$ and token budget (1-2 sentences), they deal with context-overfitting compared to potential code-completion generations provided in-context. The described approach achieved top results: 1st in Python and 2nd in Kotlin private phases using lightweight Qwen 0.6B embedding model. Heavier Nomic 7B model gave the substantial lead of 13% to the second-best solution in the Python public phase leaderboard. The code is available on github.