EMNLP2025
SLlama: Parameter-Efficient Language Model Architecture for Enhanced Linguistic Competence Under Strict Data Constraints
Victor Adelakun Omolaoye, Babajide Alamu Owoyele, Gerard de Melo
Abstract
Scaling data and model size has driven recent advances in language modeling, but this strategy falters under scenarios with strict data constraints, as in the BabyLM Challenge. However, insights from training compute-optimal large language models highlight that smaller models trained on more data outperform larger counterparts trained inadequately, emphasizing the need for compact architectures. Furthermore, while embedding weight tying is a common parameter-reduction technique, we find that it significantly diminishes linguistic competence in compact models. In response, we explore alternative architectural strategies that preserve the parameter-efficiency of tied models without sacrificing the representational benefits of untied embeddings. Consequently, we introduce SLlama, a Llama-3 architecture variant that incorporates targeted modifications-Repeated Reduced Hidden Size and Projection (RRHP), Permutated Weight Attention (PWA), Shared Projection Multi-Layer Perceptron (SPMLP), and Layer Weight Sharing-to compress Transformer components. Without relying on distillation, SLlama achieves a 31.72% improvement in linguistic knowledge acquisition over the Baby Llama baseline, with a comparable GLUE score and significantly lower parameter count. These results demonstrate that welldesigned, compact models can rival larger ones under strict data constraints.