Transformer-Evolution-Paper
search
⌘Ctrlk
Transformer-Evolution-Paper
  • README
  • 数学符号
  • Act
  • Arch
  • FFN
  • Head
  • Memory
  • MHA
  • Normalize_And_Residual
    • ReZero is All You Need Fast Convergence at Large Depth
    • Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks
    • Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention
    • RealFormer Transformer Likes Residual Attention
    • On Layer Normalizations and Residual Connections in Transformers
    • Transformers without Tears: Improving the Normalization of Self-Attention
    • Query-Key Normalization for Transformers
    • Understanding the difficulty of training transformers
  • Pe
  • Pretrain
  • Softmax
  • Others
  • LongConv
  • Rnn
  • CrossAttention
  • Inference
  • Peft
  • LLM
gitbookPowered by GitBook
block-quoteOn this pagechevron-down

Normalize_And_Residual

ReZero is All You Need Fast Convergence at Large Depthchevron-rightBatch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networkschevron-rightImproving Deep Transformer with Depth-Scaled Initialization and Merged Attentionchevron-rightRealFormer Transformer Likes Residual Attentionchevron-rightOn Layer Normalizations and Residual Connections in Transformerschevron-rightTransformers without Tears: Improving the Normalization of Self-Attentionchevron-rightQuery-Key Normalization for Transformerschevron-rightUnderstanding the difficulty of training transformerschevron-right
PreviousSkip-Attention: Improving Vision Transformers by Paying Less Attentionchevron-leftNextReZero is All You Need Fast Convergence at Large Depthchevron-right

Last updated 2 years ago