Transformer-Evolution-Paper

Ctrlk

Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks

论文地址：

https://arxiv.org/abs/2002.10444

整体思路以及计算方式

思路和ReZero类似，用加权残差取代Normalize，唯一的区别是这篇讨论的是Batch Normalize，而ReZero讨论的是Layer Normalize：

\mathbf {x}_{i+1}=\mathbf {x}_{i}+\alpha_{i} F\left(\mathbf {x}_{i}\right)

后续部分从略。

PreviousReZero is All You Need Fast Convergence at Large Depth NextImproving Deep Transformer with Depth-Scaled Initialization and Merged Attention

Last updated 3 years ago