Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks
PreviousReZero is All You Need Fast Convergence at Large DepthNextImproving Deep Transformer with Depth-Scaled Initialization and Merged Attention
Last updated
Last updated