Transformer-Evolution-Paper
search
⌘Ctrlk
Transformer-Evolution-Paper
  • README
  • 数学符号
  • Act
  • Arch
  • FFN
    • Large Memory Layers with Product Keys
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • GLU Variants Improve Transformer
    • Simple Recurrence Improves Masked Language Models
    • Pay Attention to MLPs
    • S2-MLP Spatial-Shift MLP Architecture for Vision
    • S2-MLPv2 Improved Spatial-Shift MLP Architecture for Vision
    • HyperMixer An MLP-based Green AI Alternative to Transformers
    • DeFINE: DEep Factorized INput Token Embeddings for Neural Sequence Modeling & DeLighT: Deep and Light-weight Transformer
    • When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism
    • Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?
  • Head
  • Memory
  • MHA
  • Normalize_And_Residual
  • Pe
  • Pretrain
  • Softmax
  • Others
  • LongConv
  • Rnn
  • CrossAttention
  • Inference
  • Peft
  • LLM
gitbookPowered by GitBook
block-quoteOn this pagechevron-down

FFN

Large Memory Layers with Product Keyschevron-rightTransformer Feed-Forward Layers Are Key-Value Memorieschevron-rightGLU Variants Improve Transformerchevron-rightSimple Recurrence Improves Masked Language Modelschevron-rightPay Attention to MLPschevron-rightS2-MLP Spatial-Shift MLP Architecture for Visionchevron-rightS2-MLPv2 Improved Spatial-Shift MLP Architecture for Visionchevron-rightHyperMixer An MLP-based Green AI Alternative to Transformerschevron-rightDeFINE: DEep Factorized INput Token Embeddings for Neural Sequence Modeling & DeLighT: Deep and Light-weight Transformerchevron-rightWhen Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanismchevron-rightSparse MLP for Image Recognition: Is Self-Attention Really Necessary?chevron-right
PreviousGeneralization through Memorization: Nearest Neighbor Language Modelschevron-leftNextLarge Memory Layers with Product Keyschevron-right

Last updated 2 years ago