ICLR2025
When narrower is better: the narrow width limit of Bayesian parallel branching neural networks
Zechen Zhang, Haim Sompolinsky
摘要
The infinite width limit of random neural networks is known to result in Neural Networks as Gaussian Process (NNGP) (Lee et al. (2018) ), characterized by taskindependent kernels. It is widely accepted that larger network widths contribute to improved generalization (Park et al. (2019) ). However, this work challenges this notion by investigating the narrow width limit of the Bayesian Parallel Branching Neural Network (BPB-NN), an architecture that resembles neural networks with residual blocks. We demonstrate that when the width of a BPB-NN is significantly smaller compared to the number of training examples, each branch exhibits more robust learning due to a symmetry breaking of branches in kernel renormalization. Surprisingly, the performance of a BPB-NN in the narrow width limit is generally superior to or comparable to that achieved in the wide width limit in bias-limited scenarios. Furthermore, the readout norms of each branch in the narrow width limit are mostly independent of the architectural hyperparameters but generally reflective of the nature of the data. We demonstrate such phenomenon primarily for branching graph neural networks, where each branch represents a different order of convolutions of the graph; we also extend the results to other more general architectures such as the residual-MLP and show that the narrow width effect is a general feature of the branching networks. Our results characterize a newly defined narrow-width regime for parallel branching networks in general. Published as a conference paper at ICLR 2025 2. We show that in the Bayesian setting the bias will decrease and saturate at a narrow hidden layer width, a surprising phenomenon due to kernel renormalization. We demonstrate that this can be understood as a robust learning effect of each branch in the student-teacher task, where each student's branch is learning the teacher's branch. 3. We demonstrate this narrow-width limit in the real-world dataset Cora and understand each branch's importance as a nature of the dataset. 4. We further show that this narrow-width effect is a general feature of Bayesian parallel branching neural networks (BPB-NNs), with the residual-MLP architecture as an example. RELATED WORKS Infinitely wide neural networks: Our work follows a long tradition of mathematical analysis of infinitely-wide neural networks (Neal ( 2012 ); Jacot et al. (2018); Lee et al. (2018); Bahri et al. (2024)), resulting in NTK or NNGP kernels. Recently, such analysis has been extended to structured neural networks, including GCNs (Du et al. (2019); Walker & Glocker (2019); Huang et al. ( 2021 )). However, they do not provide an analysis of feature learning in which the kernel depends on the tasks. Kernel renormalization and feature learning: There has been progress in understanding simple MLPs in the feature-learning regime as the shape of the kernel changes with task or time (Li & Sompolinsky (2021); Atanasov et al. (2021); Avidan et al. (2023); Wang & Jacot (2023)). We develop such understanding in graph-based networks. Theoretical analysis of GCN: There is a long line of works that theoretically analyze the expressiveness (Xu et al. (2018); Geerts & Reutter (2022)) and generalization performance (Tang & Liu (2023); Garg et al. (2020); Aminian et al. (2024)) of GCN. However, it is challenging to calculate the dependence of generalization errors on tasks. In particular, the PAC-Bayes approach Liao et al. (2020); Ju et al. ( 2023 ) results in generalization bounds that are too large and that can be only computed with norms of learned weights. To our knowledge, our work is first to decompose the generalization error into bias and variance a priori (not dependent on learned weights) for linear GCNs with residual-like structures. The architecture closest to our linear BPB-GCN is the linearly decoupled GCN proposed by Cong et al. (2021) ; however, the overall readout vector is shared for all branches, which will not result in kernel renormalization for different branches.