Transformer-Evolution-Paper
  • README
  • 数学符号
  • Act
    • A survey on recently proposed activation functions for Deep Learning
  • Arch
    • Supplementary Material Implementation and Experiments for GAU-based Model
    • MetaFormer is Actually What You Need for Vision
    • Deeper vs Wider A Revisit of Transformer Configuration
    • Perceiver General Perception with Iterative Attention
    • General-purpose, long-context autoregressive modeling with Perceiver AR
    • Hierarchical Transformers Are More Efficient Language Models
    • Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding
    • Generalization through Memorization: Nearest Neighbor Language Models
  • FFN
    • Large Memory Layers with Product Keys
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • GLU Variants Improve Transformer
    • Simple Recurrence Improves Masked Language Models
    • Pay Attention to MLPs
    • S2-MLP Spatial-Shift MLP Architecture for Vision
    • S2-MLPv2 Improved Spatial-Shift MLP Architecture for Vision
    • HyperMixer An MLP-based Green AI Alternative to Transformers
    • DeFINE: DEep Factorized INput Token Embeddings for Neural Sequence Modeling & DeLighT: Deep and Light-weight Transformer
    • When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism
    • Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?
  • Head
    • Multi-Head Attention Collaborate Instead of Concatenate
    • Fast Transformer Decoding: One Write-Head is All You Need
  • Memory
    • Compressive Transformers for Long-Range Sequence Modelling
    • Memformer The Memory-Augmented Transformer
    • Memory Transformer
    • Do Transformers Need Deep Long-Range Memory
    • LaMemo Language Modeling with Look-Ahead Memory
    • GMAT Global Memory Augmentation for Transformers
    • Block-Recurrent Transformers
    • Augmenting Self-attention with Persistent Memory
    • Recurrent Memory Transformer
    • Memorizing Transformers
    • Scaling Transformer to 1M tokens and beyond with RMT
    • Adapting Language Models to Compress Contexts
  • MHA
    • FFT
      • Fourier Neural Operator for Parametric Partial Differential Equations
      • Global Filter Networks for Image Classification
      • Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers
      • FNet: Mixing Tokens with Fourier Transforms
    • LocalGlobal
      • CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention
      • Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding
      • Neighborhood Attention Transformer
      • FMMformer: Efficient and Flexible Transformer via Decomposed Near-field and Far-field Attention
      • Adaptive Attention Span in Transformers
      • CoLT5: Faster Long-Range Transformers with Conditional Computation
    • MatrixMethod
      • Skyformer Remodel Self-Attention with Gaussian Kernel and Nyström Method
      • Is Attention Better Than Matrix Decomposition
    • RightProduct
      • Kronecker Attention Networks
      • An Attention Free Transformer
      • Transformer with Fourier Integral Attentions
      • Linear Complexity Randomized Self-attention Mechanism
      • UFO-ViT: High Performance Linear Vision Transformer without Softmax
      • XCiT: Cross-Covariance Image Transformers
      • SimpleTRON: Simple Transformer with O(N) Complexity
      • A Dot Product Attention Free Transformer
      • On Learning the Transformer Kernel
      • Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization
    • SparseOrLowRank
      • Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection
      • Scatterbrain: Unifying Sparse and Low-rank Attention Approximation
      • Sparse Factorization of Large Square Matrices
      • Blockwise Self-Attention for Long Document Understanding
      • H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences
      • ChunkFormer: Learning Long Time Series with Multi-stage Chunked Transformer
      • Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting
      • Fast Transformers with Clustered Attention
      • Long-Short Transformer: Efficient Transformers for Language and Vision
      • LongT5: Efficient Text-To-Text Transformer for Long Sequences
      • Luna: Linear Unified Nested Attention
      • Memory-efficient Transformers via Top-k Attention
      • Separable Self-attention for Mobile Vision Transformers
      • Simple Local Attentions Remain Competitive for Long-Context Tasks
      • You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling
    • Others
      • Synthesizer: Rethinking Self-Attention in Transformer Models
      • Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kern
      • Combiner Full Attention Transformer with Sparse Computation Cost
      • Ripple Attention for Visual Perception with Sub-quadratic Complexity
      • Sinkformers: Transformers with Doubly Stochastic Attention
      • SOFT: Softmax-free Transformer with Linear Complexity
      • Value-aware Approximate Attention
      • EL-Attention: Memory Efficient Lossless Attention for Generation
      • Flowformer: Linearizing Transformers with Conservation Flows
      • ETSformer: Exponential Smoothing Transformers for Time-series Forecasting
      • IGLOO: Slicing the Features Space to Represent Sequences
      • Swin Transformer V2: Scaling Up Capacity and Resolution
      • Skip-Attention: Improving Vision Transformers by Paying Less Attention
  • Normalize_And_Residual
    • ReZero is All You Need Fast Convergence at Large Depth
    • Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks
    • Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention
    • RealFormer Transformer Likes Residual Attention
    • On Layer Normalizations and Residual Connections in Transformers
    • Transformers without Tears: Improving the Normalization of Self-Attention
    • Query-Key Normalization for Transformers
    • Understanding the difficulty of training transformers
  • Pe
    • A Simple and Effective Positional Encoding for Transformers
    • DeBERTa Decoding-enhanced BERT with Disentangled Attention
    • DecBERT Enhancing the Language Understanding of BERT with Causal Attention Masks
    • Encoding word order in complex embeddings
    • Improve Transformer Models with Better Relative Position Embeddings
    • KERPLE Kernelized Relative Positional Embedding for Length Extrapolation
    • PermuteFormer Efficient Relative Position Encoding for Long Sequences
    • Rethinking Positional Encoding in Language Pre-training
    • Transformer-XL Attentive Language Models Beyond a Fixed-Length Context
    • Translational Equivariance in Kernelizable Attention
    • Transformer Language Models without Positional Encodings Still Learn Positional Information
    • Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding
    • Randomized Positional Encodings Boost Length Generalization of Transformers
  • Pretrain
    • XLNet Generalized Autoregressive Pretraining for Language Understanding
    • Transcormer Transformer for Sentence Scoring with Sliding Language Modeling
    • Optimus Organizing Sentences via Pre-trained Modeling of a Latent Space
    • ELECTRA Pre-training Text Encoders as Discriminators Rather Than Generators
    • Cramming: Training a Language Model on a Single GPU in One Day
  • Softmax
    • Transformer with a Mixture of Gaussian Keys
    • Normalized Attention Without Probability Cage
  • Others
    • Accelerating Neural Transformer via an Average Attention Network
    • Do Transformer Modifications Transfer Across Implementations and Applications?
    • Object-Centric Learning with Slot Attention
    • Do Transformer Modifications Transfer Across Implementations and Applications?
    • Why self-attention is Natural for Sequence-to-Sequence Problems? A Perspective from Symmetries
  • LongConv
    • Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks
    • Parallelizing Legendre Memory Unit Training
    • Simplified State Space Layers for Sequence Modeling
    • Pretraining Without Attention
    • What Makes Convolutional Models Great on Long Sequence Modeling?
    • Hungry Hungry Hippos: Towards Language Modeling with State Space Models
    • Hyena Hierarchy: Towards Larger Convolutional Language Models
    • RWKV
    • Simple Hardware-Efficient Long Convolutions for Sequence Modeling
    • Time-aware large kernel convolutions
    • Resurrecting Recurrent Neural Networks for Long Sequences
    • CKConv: Continuous Kernel Convolution For Sequential Data
    • FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes
    • Towards a General Purpose CNN for Long Range Dependencies in ND
  • Rnn
    • When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute
    • Linear Transformers Are Secretly Fast Weight Programmers
    • Going Beyond Linear Transformers with Recurrent Fast Weight Programmers
    • Parallelizing Linear Recurrent Neural Nets Over Sequence Length
    • Quasi-recurrent neural networks
  • CrossAttention
    • Neural Machine Translation in Linear Time
  • Inference
    • Extrapolation
      • Parallel Context Windows for Large Language Models
      • Structured Prompting: Scaling In-Context Learning to 1,000 Examples
      • Naive Bayes-based Context Extension
  • Peft
    • Parameter-Efficient Fine-Tuning without Introducing New Latency
    • Make Your Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning
  • LLM
    • LLM Details Summary
    • What Language Model to Train if You Have One Million GPU Hours?
Powered by GitBook
On this page
  • 整体思路以及计算方式
  • 时间复杂度
  • 训练以及loss
  • 代码
  • 实验以及适用场景
  • 细节
  • 简评
  1. MHA
  2. Others

Combiner Full Attention Transformer with Sparse Computation Cost

PreviousTransformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of KernNextRipple Attention for Visual Perception with Sub-quadratic Complexity

Last updated 2 years ago

论文地址:

整体思路以及计算方式

整体思路:

  • vanilla Attention中,每个token和全部token交互;

  • combiner将全部token分解为几组(其中有一个组只有一个token,其余为多个),每个token只和组中某个元素交互,从而减少计算量,是一种Sparse的方法;

思路其实不难,但实现比较复杂,具体如下。

该论文首先将Attention的计算理解为条件期望:

A(xi)=Ep(j∣i)[vj],p(j∣i)=1Z(xi)exp⁡(qidkj⊤)(1)A\left(x_{i}\right)=\mathbb{E}_{p(j \mid i)}\left[v_{j}\right], \quad p(j | i)=\frac{1}{Z\left(x_{i}\right)} \exp \left(\frac{q_{i}}{\sqrt{d}} k_{j}^{\top}\right) \tag 1A(xi​)=Ep(j∣i)​[vj​],p(j∣i)=Z(xi​)1​exp(d​qi​​kj⊤​)(1)

然后将条件概率利用全概率公式进行分解:

p(j∣i)=∑r=0nip(j,Ωir∣i)=∑r=0nip(j∣Ωir,i)p(Ωir∣i)=p(j∣Ωirj,i)p(Ωirj∣i)(2)p(j | i)=\sum_{r=0}^{n_{i}} p\left(j, \Omega_{i}^{r} | i\right)=\sum_{r=0}^{n_{i}} p\left(j | \Omega_{i}^{r}, i\right) p\left(\Omega_{i}^{r} | i\right)=p\left(j | \Omega_{i}^{r_{j}}, i\right) p\left(\Omega_{i}^{r_{j}} | i\right) \tag 2p(j∣i)=r=0∑ni​​p(j,Ωir​∣i)=r=0∑ni​​p(j∣Ωir​,i)p(Ωir​∣i)=p(j∣Ωirj​​,i)p(Ωirj​​∣i)(2)

其中Ωi\Omega_iΩi​表示iii可取的全部集合全体,Ωir\Omega_{i}^rΩir​表示集合分解:

∪r=0niΩir=Ωi,Ωir∩Ωis=∅,∀r≠s\cup_{r=0}^{n_{i}} \Omega_{i}^{r}=\Omega_{i}, \Omega_{i}^{r} \cap \Omega_{i}^{s}=\varnothing, \forall r \neq s∪r=0ni​​Ωir​=Ωi​,Ωir​∩Ωis​=∅,∀r=s

因为这里i,ji, ji,j都属于:

[L]={k∣1≤k≤L,k∈Z}[L]=\{k| 1\le k \le L, k\in \mathbb Z\}[L]={k∣1≤k≤L,k∈Z}

所以根据上述分解,有且仅有一个rjr_jrj​,使得:

p(j∣Ωirj,i)≠0p\left(j | \Omega_{i}^{r_{j}}, i\right) \neq 0p(j∣Ωirj​​,i)=0

将公式(2)带入(1)可得:

A(xi)=Ep(j∣i)[vj]=∑r=0ni∑j∈Ωirp(j,Ωir∣i)vj=∑j∈Ωirp(j,Ωi0∣i)vj+∑r=1ni∑j∈Ωirp(j,Ωir∣i)vj=∑j∈Ωi0p~(j∣i)vj⏟direct expectation +∑r=1nip(Ωir∣i)(∑j∈Ωirp(j∣Ωir)vj)⏟local expectation =∑j∈Ωi[I(j∈Ωi0)p~(j∣i)+∑r=1niI(j∈Ωir)p(j∣Ωir)p(Ωir∣i)]⏟the new effective conditional probability q(j∣i)vj\begin{aligned} A\left(x_{i}\right) &=\mathbb{E}_{p(j | i)}\left[v_{j}\right]\\ &=\sum_{r=0}^{n_{i}} \sum_{j \in \Omega_{i}^{r}} p\left(j, \Omega_{i}^{r} | i\right) v_{j} \\ &= \sum_{j \in \Omega_{i}^{r}} p\left(j, \Omega_{i}^{0} | i\right) v_{j} +\sum_{r=1}^{n_{i}} \sum_{j \in \Omega_{i}^{r}} p\left(j, \Omega_{i}^{r} | i\right) v_{j}\\ &=\underbrace{\sum_{j \in \Omega_{i}^{0}} \tilde{p}(j | i) v_{j}}_{\text {direct expectation }}+\sum_{r=1}^{n_{i}} p\left(\Omega_{i}^{r} | i\right) \underbrace{\left(\sum_{j \in \Omega_{i}^{r}} p\left(j | \Omega_{i}^{r}\right) v_{j}\right)}_{\text {local expectation }}\\ &= \sum_{j \in \Omega_{i}} \underbrace{\left[\mathbb{I}\left(j \in \Omega_{i}^{0}\right) \tilde{p}(j | i)+\sum_{r=1}^{n_{i}} \mathbb{I}\left(j \in \Omega_{i}^{r}\right) p\left(j | \Omega_{i}^{r}\right) p\left(\Omega_{i}^{r} | i\right)\right]}_{\text {the new effective conditional probability } q(j | i)} v_{j} \\ \end{aligned}A(xi​)​=Ep(j∣i)​[vj​]=r=0∑ni​​j∈Ωir​∑​p(j,Ωir​∣i)vj​=j∈Ωir​∑​p(j,Ωi0​∣i)vj​+r=1∑ni​​j∈Ωir​∑​p(j,Ωir​∣i)vj​=direct expectation j∈Ωi0​∑​p~​(j∣i)vj​​​+r=1∑ni​​p(Ωir​∣i)local expectation ​j∈Ωir​∑​p(j∣Ωir​)vj​​​​=j∈Ωi​∑​the new effective conditional probability q(j∣i)[I(j∈Ωi0​)p~​(j∣i)+r=1∑ni​​I(j∈Ωir​)p(j∣Ωir​)p(Ωir​∣i)]​​vj​​

中括号内有三项:

  • p~(j∣i)∝exp⁡(qidkj⊤)\tilde{p}(j | i) \propto \exp \left(\frac{q_{i}}{\sqrt{d}} k_{j}^{\top}\right)p~​(j∣i)∝exp(d​qi​​kj⊤​)

  • p(Ωir∣i)∝exp⁡(qidkΩir⊤)p\left(\Omega_{i}^{r} | i\right) \propto \exp \left(\frac{q_{i}}{\sqrt{d}} k_{\Omega_{i}^{r}}^{\top}\right)p(Ωir​∣i)∝exp(d​qi​​kΩir​⊤​)

  • p(j∣Ωir)∝exp⁡(qΩirdkj⊤)p\left(j | \Omega_{i}^{r}\right) \propto \exp \left(\frac{q_{\Omega_{i}^{r}}}{\sqrt{d}} k_{j}^{\top}\right)p(j∣Ωir​)∝exp(d​qΩir​​​kj⊤​)

划分集合的方式见论文。

时间复杂度

O(nn)O(n\sqrt n)O(nn​)或O(nlog⁡n)O(n\log n)O(nlogn)。

训练以及loss

不变。

代码

实验以及适用场景

总体来说效果还行,打败对手方法,但是无法完全超越Transformer。

细节

暂无。

简评

这篇论文提供的信息和其他Sparse Transformer类似,即Attention中只有部分计算是必要的,不过方法实现起来有点复杂。

https://arxiv.org/abs/2107.05768
https://github.com/google-research/google-research/tree/master/combiner