ACL2024

LoRA Meets Dropout under a Unified Framework

Sheng Wang, Liheng Chen, Jiyue Jiang, Boyang Xue, Lingpeng Kong, Chuan Wu

Abstract

With the remarkable capabilities, large language models (LLMs) have emerged as essential elements in numerous NLP applications, while parameter-efficient finetuning, especially LoRA, has gained popularity as a lightweight approach for model customization. Meanwhile, various dropout methods, initially designed for full finetuning with all the parameters updated, alleviates overfitting associated with excessive parameter redundancy. Hence, a possible contradiction arises from negligible trainable parameters of LoRA and the effectiveness of previous dropout methods, which has been largely overlooked. To fill this gap, we first confirm that parameter-efficient LoRA is also overfitting-prone. We then revisit transformerspecific dropout methods, and establish their equivalence and distinctions mathematically and empirically. Building upon this comparative analysis, we introduce a unified framework for a comprehensive investigation, which instantiates these methods based on dropping position, structural pattern and compensation measure. Through this framework, we reveal the new preferences and performance comparisons of them when involved with limited trainable parameters. This framework also allows us to amalgamate the most favorable aspects into a novel dropout method named HiddenKey. Extensive experiments verify the remarkable superiority and sufficiency of HiddenKey across multiple models and tasks, which highlights it as the preferred approach for high-performance and parameter-efficient finetuning of LLMs. 043 has been widely adopted as a lightweight method, 044 which generally freezes the majority of parameters 045 while only updating or adding negligible trainable 046 parameters. Among these methods, LoRA (Hu 047 et al., 2021) gains the most popularity due to its 048 high effectiveness, robustness and generality. 049 In parallel with this, dropout (Hinton et al., 2012) 050 has been widely adopted to mitigate overfitting, 051 which is generally caused by excessive parameter 052 redundancy. Its variants, including DropKey (Li 053 et al., 2023), DropAttention (Zehui et al., 2019) and 054 HiddenCut (Chen et al., 2021), have also demon-055 strated superiority for transformers. With a speci-056 fied probability, they randomly deactivate attention 057 logits, weights and hidden representations, respec-058 tively. However, the effectiveness of these meth-059 ods is only verified in full finetuning scenarios, 060 where all the parameters are updated and easily 061 lead to excessive redundancy. When it comes to 062 LoRA-based PEFT scenarios, a potential contra-063 diction arises. Specifically, since overfitting pri-064 marily stems from excessive parameter redundancy, 065 dropout may prove ineffective in LoRA-based fine-066 tuning because of the extremely limited trainable 067 parameters. Besides, all the above methods are pro-068 posed independently, lacking a clear guideline to 069 unify them systematically, which hinders compre-070 hensive comparative analysis and the development 071 of more effective dropout methods. 072 In this study, we first conduct extensive exper-073 iments and confirm that LoRA also suffers from 074 overfitting easily, which serves as a prerequisite 075 for our following analysis. As shown in Figure 5, 076 as the rank and trainable parameters increase, the 077 model's performance initially improves but gradu-078 ally deteriorates due to the intensifying overfitting. 079 Much more experiments in Sec. 4 provide further 080 evidence and affirm that this overfitting susceptibil-081 ity can be improved with dropout methods. Besides, 082 we compare the above transformer-specific dropout 083 where p, l, w j , w j , and w ′ j denote the dropout rate, 155 sequence length, original, masked, and rescaled 156 attention weights. NoGrad() and Bernoulli() rep-157 resent the gradient stopping operator and sampling 158 from the Bernoulli distribution, respectively 1 . 159 DropKey. As a dropout-before-softmax scheme, 160 DropKey (Li et al., 2023) takes attention logits g j 161 instead of weights as the basic units, as formulated 162 in Eq. 3. Since the subsequent softmax() ensures 163 the sum of weights to be one, rescaling is no longer 164 necessary. 165 g ′ j = m + gj, m = 0, with probability 1 -p -∞, with probability p (3) 166 HiddenCut. In contrast, HiddenCut (Chen et al., 167 2021) focuses on preventing the co-adaptation of 168 hidden representations in the feed-forward mod-169 ule. The core idea is to cut single contiguous span, 170 which may contain more semantic information and 171 be more difficult to be restored. Besides, JS loss is 172 applied to encourage the perturbed representations 173 to be as close to those in inference as possible.