ICLR2026
Scaling Direct Feedback Learning with Jacobian Alignment Guarantees
Paul Caillon, Erwan Fagnou, Blaise Delattre, Alexandre Allauzen
摘要
Deep neural networks rely on backpropagation (BP) for optimization, but its strictly sequential backward pass hinders parallelism and scalability. Direct Feedback Alignment (DFA) has been proposed as a promising approach for parallel learning of deep neural networks, relying on fixed random projections to enable layer-wise parallel updates, but fails on deep convolutional networks, and performs poorly on modern transformer architectures. We introduce GrAPE (Gradient-Aligned Projected Error), a hybrid feedback-alignment method that (i) estimates rank-1 Jacobians via forward-mode JVPs and (ii) aligns each layer's feedback matrix by minimizing a local cosine-alignment loss. To curb drift in very deep models, GrAPE performs infrequent BP anchor steps on a single mini-batch, preserving mostly parallel updates. We show that the forward-gradient estimator has strictly positive expected cosine with the true Jacobian. We relate this estimator-level guarantee to a standard stochastic-approximation result under a positive expected-cosine condition on the update direction, providing theoretical support for GrAPE's alignment objective. Empirically, GrAPE consistently outperforms prior alternatives to BP, enabling the training of modern architectures, closing a large fraction of the gap to BP while retaining layer-parallel updates for the vast majority of steps. 1. Gradient-guided feedback. We introduce GrAPE (Gradient Aligned Projected Error), which computes a local cosine-alignment loss with forward-gradient estimates. This realigns each layer's feedback matrix toward Jacobian-aligned directions prior to the parallel DFA update. 2. Leveraging forward-mode gradients, we derive a positive expected alignment bound for our rank-1 Jacobian estimator. We also recall a standard conditional convergence-in-expectation result under a positive expected-cosine assumption on the update direction, which provides theoretical motivation for GrAPE's alignment objective. 3. Occasional BP calibration. To further mitigate drift in very deep or highly structured networks, we apply a true BP step to a single mini-batch every T epochs, using its exact gradient to realign the weights. This yields a hybrid two-timescale scheme in which most updates are layer-parallel GrAPE steps, interleaved with sparse BP synchronizations. 4. Scalability. We show for the first time that a DFA-style method can train VGG-16, ResNet-20/56 and Transformer models, narrowing the performance gap with full BP. The paper is organized as follows: in Section 2 we briefly recall the necessary background and notation (a more detailed survey can be found in the Appendix). Section 3 describes the GrAPE algorithm and the occasional BP calibration strategy. Section 4 reports empirical results. BACKGROUND AND RELATED WORKS Let f (x; θ) be a feed-forward neural network with L layers, where x ≡ h 0 is the input and θ = W l L l=1 is the set of parameters. Each layer computes a l = W l h l-1 followed by a non-linearity h l = σ l (a l ), encompassing both linear and convolutional operations. The output is ŷ = h L . Given a loss function L(ŷ, y), the goal of backpropagation (BP) is to compute gradients ∇L l = ∂L/∂a l recursively, starting from the output layer. The corresponding weight update is: This algorithm is by construction sequential: the update at layer l depends on the backpropagation of errors through all subsequent layers. This reliance on weight symmetry and stepwise computation hinders parallelism. As architectures attain increasing size and depth, alternative methods that allow non-symmetric error transmission and enable parallelized training have emerged (see Figure 1 ). LEARNING WITH RANDOM FEEDBACK Feedback Alignment (FA) proposes a biologically inspired alternative to backpropagation by replacing transposed weights with fixed random feedback matrices B l (Lillicrap et al., 2016) . The error is still propagated sequentially, but independently of the forward weights (W l ): This removes the weight symmetry constraint, aligning better with biological learning (Lillicrap et al., 2020) but fails to scale to convolutional networks (Bartunov et al., 2018; Moskovitz et al., 2018) . Adaptive variants using weight mirroring (Akrout et al., 2019) can however approach BP performance, but remain sequential and thus offer limited practical advantages. Direct Feedback Alignment (DFA) (Nøkland, 2016) removes the need for sequential error propagation by projecting the output error directly to each hidden layer: This enables parallel updates but remains limited on complex architectures like CNNs and Transformers. Attempts to mitigate this include adaptive feedback (e.g., weight mirroring (Akrout et al., 2019)) or architectural variants like DRTP (Frenkel et al., 2021) , falling short behind BP on large-scale tasks. Launay et al. ( 2020 ) applied DFA to Transformers using either block-wise ('macro') or layerwise ('micro') feedback, yet BP remains necessary within attention layers.