Transformer-Evolution-Paper
  • README
  • 数学符号
  • Act
    • A survey on recently proposed activation functions for Deep Learning
  • Arch
    • Supplementary Material Implementation and Experiments for GAU-based Model
    • MetaFormer is Actually What You Need for Vision
    • Deeper vs Wider A Revisit of Transformer Configuration
    • Perceiver General Perception with Iterative Attention
    • General-purpose, long-context autoregressive modeling with Perceiver AR
    • Hierarchical Transformers Are More Efficient Language Models
    • Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding
    • Generalization through Memorization: Nearest Neighbor Language Models
  • FFN
    • Large Memory Layers with Product Keys
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • GLU Variants Improve Transformer
    • Simple Recurrence Improves Masked Language Models
    • Pay Attention to MLPs
    • S2-MLP Spatial-Shift MLP Architecture for Vision
    • S2-MLPv2 Improved Spatial-Shift MLP Architecture for Vision
    • HyperMixer An MLP-based Green AI Alternative to Transformers
    • DeFINE: DEep Factorized INput Token Embeddings for Neural Sequence Modeling & DeLighT: Deep and Light-weight Transformer
    • When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism
    • Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?
  • Head
    • Multi-Head Attention Collaborate Instead of Concatenate
    • Fast Transformer Decoding: One Write-Head is All You Need
  • Memory
    • Compressive Transformers for Long-Range Sequence Modelling
    • Memformer The Memory-Augmented Transformer
    • Memory Transformer
    • Do Transformers Need Deep Long-Range Memory
    • LaMemo Language Modeling with Look-Ahead Memory
    • GMAT Global Memory Augmentation for Transformers
    • Block-Recurrent Transformers
    • Augmenting Self-attention with Persistent Memory
    • Recurrent Memory Transformer
    • Memorizing Transformers
    • Scaling Transformer to 1M tokens and beyond with RMT
    • Adapting Language Models to Compress Contexts
  • MHA
    • FFT
      • Fourier Neural Operator for Parametric Partial Differential Equations
      • Global Filter Networks for Image Classification
      • Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers
      • FNet: Mixing Tokens with Fourier Transforms
    • LocalGlobal
      • CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention
      • Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding
      • Neighborhood Attention Transformer
      • FMMformer: Efficient and Flexible Transformer via Decomposed Near-field and Far-field Attention
      • Adaptive Attention Span in Transformers
      • CoLT5: Faster Long-Range Transformers with Conditional Computation
    • MatrixMethod
      • Skyformer Remodel Self-Attention with Gaussian Kernel and Nyström Method
      • Is Attention Better Than Matrix Decomposition
    • RightProduct
      • Kronecker Attention Networks
      • An Attention Free Transformer
      • Transformer with Fourier Integral Attentions
      • Linear Complexity Randomized Self-attention Mechanism
      • UFO-ViT: High Performance Linear Vision Transformer without Softmax
      • XCiT: Cross-Covariance Image Transformers
      • SimpleTRON: Simple Transformer with O(N) Complexity
      • A Dot Product Attention Free Transformer
      • On Learning the Transformer Kernel
      • Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization
    • SparseOrLowRank
      • Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection
      • Scatterbrain: Unifying Sparse and Low-rank Attention Approximation
      • Sparse Factorization of Large Square Matrices
      • Blockwise Self-Attention for Long Document Understanding
      • H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences
      • ChunkFormer: Learning Long Time Series with Multi-stage Chunked Transformer
      • Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting
      • Fast Transformers with Clustered Attention
      • Long-Short Transformer: Efficient Transformers for Language and Vision
      • LongT5: Efficient Text-To-Text Transformer for Long Sequences
      • Luna: Linear Unified Nested Attention
      • Memory-efficient Transformers via Top-k Attention
      • Separable Self-attention for Mobile Vision Transformers
      • Simple Local Attentions Remain Competitive for Long-Context Tasks
      • You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling
    • Others
      • Synthesizer: Rethinking Self-Attention in Transformer Models
      • Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kern
      • Combiner Full Attention Transformer with Sparse Computation Cost
      • Ripple Attention for Visual Perception with Sub-quadratic Complexity
      • Sinkformers: Transformers with Doubly Stochastic Attention
      • SOFT: Softmax-free Transformer with Linear Complexity
      • Value-aware Approximate Attention
      • EL-Attention: Memory Efficient Lossless Attention for Generation
      • Flowformer: Linearizing Transformers with Conservation Flows
      • ETSformer: Exponential Smoothing Transformers for Time-series Forecasting
      • IGLOO: Slicing the Features Space to Represent Sequences
      • Swin Transformer V2: Scaling Up Capacity and Resolution
      • Skip-Attention: Improving Vision Transformers by Paying Less Attention
  • Normalize_And_Residual
    • ReZero is All You Need Fast Convergence at Large Depth
    • Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks
    • Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention
    • RealFormer Transformer Likes Residual Attention
    • On Layer Normalizations and Residual Connections in Transformers
    • Transformers without Tears: Improving the Normalization of Self-Attention
    • Query-Key Normalization for Transformers
    • Understanding the difficulty of training transformers
  • Pe
    • A Simple and Effective Positional Encoding for Transformers
    • DeBERTa Decoding-enhanced BERT with Disentangled Attention
    • DecBERT Enhancing the Language Understanding of BERT with Causal Attention Masks
    • Encoding word order in complex embeddings
    • Improve Transformer Models with Better Relative Position Embeddings
    • KERPLE Kernelized Relative Positional Embedding for Length Extrapolation
    • PermuteFormer Efficient Relative Position Encoding for Long Sequences
    • Rethinking Positional Encoding in Language Pre-training
    • Transformer-XL Attentive Language Models Beyond a Fixed-Length Context
    • Translational Equivariance in Kernelizable Attention
    • Transformer Language Models without Positional Encodings Still Learn Positional Information
    • Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding
    • Randomized Positional Encodings Boost Length Generalization of Transformers
  • Pretrain
    • XLNet Generalized Autoregressive Pretraining for Language Understanding
    • Transcormer Transformer for Sentence Scoring with Sliding Language Modeling
    • Optimus Organizing Sentences via Pre-trained Modeling of a Latent Space
    • ELECTRA Pre-training Text Encoders as Discriminators Rather Than Generators
    • Cramming: Training a Language Model on a Single GPU in One Day
  • Softmax
    • Transformer with a Mixture of Gaussian Keys
    • Normalized Attention Without Probability Cage
  • Others
    • Accelerating Neural Transformer via an Average Attention Network
    • Do Transformer Modifications Transfer Across Implementations and Applications?
    • Object-Centric Learning with Slot Attention
    • Do Transformer Modifications Transfer Across Implementations and Applications?
    • Why self-attention is Natural for Sequence-to-Sequence Problems? A Perspective from Symmetries
  • LongConv
    • Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks
    • Parallelizing Legendre Memory Unit Training
    • Simplified State Space Layers for Sequence Modeling
    • Pretraining Without Attention
    • What Makes Convolutional Models Great on Long Sequence Modeling?
    • Hungry Hungry Hippos: Towards Language Modeling with State Space Models
    • Hyena Hierarchy: Towards Larger Convolutional Language Models
    • RWKV
    • Simple Hardware-Efficient Long Convolutions for Sequence Modeling
    • Time-aware large kernel convolutions
    • Resurrecting Recurrent Neural Networks for Long Sequences
    • CKConv: Continuous Kernel Convolution For Sequential Data
    • FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes
    • Towards a General Purpose CNN for Long Range Dependencies in ND
  • Rnn
    • When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute
    • Linear Transformers Are Secretly Fast Weight Programmers
    • Going Beyond Linear Transformers with Recurrent Fast Weight Programmers
    • Parallelizing Linear Recurrent Neural Nets Over Sequence Length
    • Quasi-recurrent neural networks
  • CrossAttention
    • Neural Machine Translation in Linear Time
  • Inference
    • Extrapolation
      • Parallel Context Windows for Large Language Models
      • Structured Prompting: Scaling In-Context Learning to 1,000 Examples
      • Naive Bayes-based Context Extension
  • Peft
    • Parameter-Efficient Fine-Tuning without Introducing New Latency
    • Make Your Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning
  • LLM
    • LLM Details Summary
    • What Language Model to Train if You Have One Million GPU Hours?
Powered by GitBook
On this page
  • 整体思路以及计算方式
  • RFA和重要度抽样
  • RA
  • LARA
  • 时间复杂度
  • 训练以及loss
  • 代码
  • 实验以及适用场景
  • 细节
  • 简评
  1. MHA
  2. RightProduct

Linear Complexity Randomized Self-attention Mechanism

PreviousTransformer with Fourier Integral AttentionsNextUFO-ViT: High Performance Linear Vision Transformer without Softmax

Last updated 2 years ago

论文地址:

整体思路以及计算方式

之前像RFA和Performer(后续统称为RFA)都是exp⁡(q⊤v)\exp(q^{\top} v)exp(q⊤v)的无偏估计,但并不是exp⁡(q⊤v)/(∑vexp⁡(q⊤v))\exp(q^{\top} v)/(\sum_v \exp(q^{\top} v))exp(q⊤v)/(∑v​exp(q⊤v))的无偏估计,这偏论文的主要出发点就是解决这点,论文的整体思路如下:

  • 介绍RFA和重要度抽样;

  • 指出RFA不是无偏估计,通过重要度抽样引入RA(Randomized Attention);

  • 指出RA的计算复杂度太高,作为一个折中方案,引入LARA(Linear Randomized Attention);

RFA和重要度抽样

RFA:

如果有:

exp⁡(x⊤y)=Eω∼N(ω;0,I)[ξ(x,ω)⊤ξ(y,ω)](1)\exp \left({x}^{\top} {y}\right)=\mathbb{E}_{\omega \sim \mathcal{N}(\omega ; 0, {I})}\left[\xi({x}, \omega)^{\top} \xi({y}, \omega)\right] \tag 1exp(x⊤y)=Eω∼N(ω;0,I)​[ξ(x,ω)⊤ξ(y,ω)](1)

那么:

∑m=1Mexp⁡(qn⊤km)vm⊤∑m′=1Mexp⁡(qn⊤km′)≈∑m=1M∑s=1Sξ(qn,ωs)⊤ξ(km,ωs)vm⊤∑m′=1M∑s=1Sξ(qn,ωs)⊤ξ(km′,ωs)=∑s=1Sξ(qn,ωs)⊤∑m=1Mξ(km,ωs)vm⊤∑s=1Sξ(qn,ωs)⊤∑m′=1Mξ(km′,ωs):=RFA⁡(qn,K,V)\begin{aligned} &\frac{\sum_{m=1}^{M} \exp \left({q}_{n}^{\top} {k}_{m}\right) {v}_{m}^{\top}}{\sum_{m^{\prime}=1}^{M} \exp \left({q}_{n}^{\top} {k}_{m^{\prime}}\right)} \\ &\approx \frac{\sum_{m=1}^{M} \sum_{s=1}^{S} \xi\left({q}_{n}, \omega_{s}\right)^{\top} \xi\left({k}_{m}, \omega_{s}\right) {v}_{m}^{\top}}{\sum_{m^{\prime}=1}^{M} \sum_{s=1}^{S} \xi\left({q}_{n}, \omega_{s}\right)^{\top} \xi\left({k}_{m^{\prime}}, \omega_{s}\right)} \\ &=\frac{\sum_{s=1}^{S} \xi\left({q}_{n}, \omega_{s}\right)^{\top} \sum_{m=1}^{M} \xi\left({k}_{m}, \omega_{s}\right) {v}_{m}^{\top}}{\sum_{s=1}^{S} \xi\left({q}_{n}, \omega_{s}\right)^{\top} \sum_{m^{\prime}=1}^{M} \xi\left({k}_{m^{\prime}}, \omega_{s}\right)} \\ &:=\operatorname{RFA}\left({q}_{n}, {K}, {V}\right) \end{aligned}​∑m′=1M​exp(qn⊤​km′​)∑m=1M​exp(qn⊤​km​)vm⊤​​≈∑m′=1M​∑s=1S​ξ(qn​,ωs​)⊤ξ(km′​,ωs​)∑m=1M​∑s=1S​ξ(qn​,ωs​)⊤ξ(km​,ωs​)vm⊤​​=∑s=1S​ξ(qn​,ωs​)⊤∑m′=1M​ξ(km′​,ωs​)∑s=1S​ξ(qn​,ωs​)⊤∑m=1M​ξ(km​,ωs​)vm⊤​​:=RFA(qn​,K,V)​

从这里不难看出,尽管公式(1)是exp\mathrm {exp}exp的无偏估计,但是RFA并不是Attention的无偏估计,这里是利用了如下事实:

E[xi]=x,E[yi]=y⇏E[xiyi]=xy(2)\mathbb E [x_i] = x,\mathbb E [y_i] = y \not \Rightarrow \mathbb E\left[\frac{x_i}{y_i} \right] = \frac x y \tag 2E[xi​]=x,E[yi​]=y⇒E[yi​xi​​]=yx​(2)

这也是本文的主要出发点,注意到公式(2)涉及到分母,这一点是比较难处理的,因此,作者引入了重要度抽样的方法:

Ep(ω)[f(ω)]=Eg(ω)[p(ω)g(ω)f(ω)]≈1S∑s=1Sp(ωs)g(ωs)f(ωs)(3)\mathbb{E}_{p(\omega)}[f(\omega)]=\mathbb{E}_{g(\omega)}\left[\frac{p(\omega)}{g(\omega)} f(\omega)\right] \approx \frac{1}{S} \sum_{s=1}^{S} \frac{p\left(\omega_{s}\right)}{g\left(\omega_{s}\right)} f\left(\omega_{s}\right) \tag 3Ep(ω)​[f(ω)]=Eg(ω)​[g(ω)p(ω)​f(ω)]≈S1​s=1∑S​g(ωs​)p(ωs​)​f(ωs​)(3)

注意到概率分布p(ωs)p(\omega _s)p(ωs​)一般可以写成:

p(ω)=p~(ω)/Zp(\omega)=\tilde{p}(\omega) / Zp(ω)=p~​(ω)/Z

而ZZZ作为分母通常很难计算,所以公式(3)通常无法直接使用,为了消去ZZZ,在公式(3)中取f=1f=1f=1:

1=Ep(ω)[1]=Eg(ω)[p(ω)g(ω)]≈1S∑s=1Sp(ωs)g(ωs)(3)1=\mathbb{E}_{p(\omega)}[1]=\mathbb{E}_{g(\omega)}\left[\frac{p(\omega)}{g(\omega)} \right] \approx \frac{1}{S} \sum_{s=1}^{S} \frac{p\left(\omega_{s}\right)}{g\left(\omega_{s}\right)} \tag 31=Ep(ω)​[1]=Eg(ω)​[g(ω)p(ω)​]≈S1​s=1∑S​g(ωs​)p(ωs​)​(3)

那么:

Ep(ω)[f(ω)]=Eg(ω)[p(ω)g(ω)f(ω)]Eg(ω)[p(ω)g(ω)]≈1S∑s=1S1Zp~(ωs)g(ωs)f(ωs)1S∑s=1S1Zp~(ωs)g(ωs)=∑s=1Sp~(ωs)g(ωs)f(ωs)∑s=1Sp~(ωs)g(ωs)\begin{aligned} &\mathbb{E}_{p(\omega)}[f(\omega)]=\frac{\mathbb{E}_{g(\omega)}\left[\frac{p(\omega)}{g(\omega)} f(\omega)\right]}{\mathbb{E}_{g(\omega)}\left[\frac{p(\omega)}{g(\omega)}\right]} \\ &\approx \frac{\frac{1}{S} \sum_{s=1}^{S} \frac{1}{Z} \frac{\tilde{p}\left(\omega_{s}\right)}{g\left(\omega_{s}\right)} f\left(\omega_{s}\right)}{\frac{1}{S} \sum_{s=1}^{S} \frac{1}{Z} \frac{\tilde{p}\left(\omega_{s}\right)}{g\left(\omega_{s}\right)}}=\frac{\sum_{s=1}^{S} \frac{\tilde{p}\left(\omega_{s}\right)}{g\left(\omega_{s}\right)} f\left(\omega_{s}\right)}{\sum_{s=1}^{S} \frac{\tilde{p}\left(\omega_{s}\right)}{g\left(\omega_{s}\right)}} \end{aligned}​Ep(ω)​[f(ω)]=Eg(ω)​[g(ω)p(ω)​]Eg(ω)​[g(ω)p(ω)​f(ω)]​≈S1​∑s=1S​Z1​g(ωs​)p~​(ωs​)​S1​∑s=1S​Z1​g(ωs​)p~​(ωs​)​f(ωs​)​=∑s=1S​g(ωs​)p~​(ωs​)​∑s=1S​g(ωs​)p~​(ωs​)​f(ωs​)​​

这样就可以消去分母ZZZ,从而让重要度抽样的方法可计算。

RA

将之前的内容结合,最终作者得到如下结论:

Softmax⁡Attn⁡(qn,K,V)=Epn(ω)[fn(ω)](4)\operatorname{Softmax} \operatorname{Attn}\left({q}_{n}, {K}, {V}\right)=\mathbb{E}_{p_{n}(\omega)}\left[f_{n}(\omega)\right] \tag 4SoftmaxAttn(qn​,K,V)=Epn​(ω)​[fn​(ω)](4)

其中:

p(ω)=∑m=1MπmN(ω;qn+km,I)f(ω)=ξ(qn,ω)⊤∑m=1Mξ(km,ω)vm⊤ξ(qn,ω)⊤∑m′=1Mξ(km′,ω)\begin{aligned} p(\omega)&=\sum_{m=1}^{M} \pi_{m} \mathcal{N}\left(\omega ; {q}_{n}+{k}_{m}, {I}\right) \\ f(\omega)&=\frac{\xi\left({q}_{n}, \omega\right)^{\top} \sum_{m=1}^{M} \xi\left({k}_{m}, \omega\right) {v}_{m}^{\top}}{\xi\left({q}_{n}, \omega\right)^{\top} \sum_{m^{\prime}=1}^{M} \xi\left({k}_{m^{\prime}}, \omega\right)} \end{aligned}p(ω)f(ω)​=m=1∑M​πm​N(ω;qn​+km​,I)=ξ(qn​,ω)⊤∑m′=1M​ξ(km′​,ω)ξ(qn​,ω)⊤∑m=1M​ξ(km​,ω)vm⊤​​​

注意到这里一共涉及MNMNMN个概率分布p(ω)p(\omega)p(ω),所以时间复杂度依然为O(MNd)O(MNd)O(MNd),并没有带来速度提升,所以后续需要解决这点。

LARA

首先引出MIS(multiple importance sampling):

Epn(ω)[fn(ω)]≈∑c=1Cαnc(ωc)pn(ωc)gc(ωc)fn(ωc)ωc∼gc(ω),c=1,…,C∑c=1Cαnc=1\mathbb{E}_{p_{n}(\omega)}\left[f_{n}(\omega)\right] \approx \sum_{c=1}^{C} \alpha_{n c}\left(\omega_{c}\right) \frac{p_{n}\left(\omega_{c}\right)}{g_{c}\left(\omega_{c}\right)} f_{n}\left(\omega_{c}\right) \\ \omega_{c} \sim g_{c}(\omega), c=1,\ldots, C\\ \sum_{c=1}^{C} \alpha_{n c} = 1Epn​(ω)​[fn​(ω)]≈c=1∑C​αnc​(ωc​)gc​(ωc​)pn​(ωc​)​fn​(ωc​)ωc​∼gc​(ω),c=1,…,Cc=1∑C​αnc​=1

这样做的好处是可以将分布数量降低到C≪MNC\ll MNC≪MN,通过比较复杂的推导,最终作者给出:

αnc(ωc)=qc(ωc)∑c′=1Cgc′(ωc)+rnc′−1C∑c=1Crnc′rnc′=exp⁡(qn⊤q~c)∑n=1Nexp⁡(qn⊤q~c′)\begin{aligned} \alpha_{n c}\left(\omega_{c}\right)&=\frac{q_{c}\left(\omega_{c}\right)}{\sum_{c^{\prime}=1}^{C} g_{c^{\prime}}\left(\omega_{c}\right)}+r_{n c}^{\prime}-\frac{1}{C} \sum_{c=1}^{C} r_{n c}^{\prime} \\ r_{n c}^{\prime}&=\frac{\exp \left({q}_{n}^{\top} \tilde{{q}}_{c}\right)}{\sum_{n=1}^{N} \exp \left({q}_{n}^{\top} \tilde{{q}}_{c^{\prime}}\right)} \\ \end{aligned}αnc​(ωc​)rnc′​​=∑c′=1C​gc′​(ωc​)qc​(ωc​)​+rnc′​−C1​c=1∑C​rnc′​=∑n=1N​exp(qn⊤​q~​c′​)exp(qn⊤​q~​c​)​​

其中q~c\tilde{{q}}_{c}q~​c​是如何计算还没有完全理清,后续进行补充。

最终的计算式:

Epn(ω)[fn(ω)]≈∑c=1Cαnc(ωc)p~n(ωc)qc(ωc)fn(ωc)∑c=1Cαnc(ωc)p~n(ωc)qc(ωc):=LARA⁡(qn,K,V)p~n(ω)=N(ω;0,I)ξ(qn,ω)⊤∑m=1Mξ(km,ω)\begin{aligned} \mathbb{E}_{p_{n}(\omega)}\left[f_{n}(\omega)\right] & \approx \frac{\sum_{c=1}^{C} \alpha_{n c}\left(\omega_{c}\right) \frac{\tilde{p}_{n}\left(\omega_{c}\right)}{q_{c}\left(\omega_{c}\right)} f_{n}\left(\omega_{c}\right)}{\sum_{c=1}^{C} \alpha_{n c}\left(\omega_{c}\right) \frac{\tilde{p}_{n}\left(\omega_{c}\right)}{q_{c}\left(\omega_{c}\right)}} \\ &:=\operatorname{LARA}\left({q}_{n}, {K}, {V}\right)\\ \tilde p_n (\omega )&=\mathcal{N}(\omega ; 0, {I}) \xi\left({q}_{n}, \omega\right)^{\top} \sum_{m=1}^{M} \xi\left({k}_{m}, \omega\right) \end{aligned}Epn​(ω)​[fn​(ω)]p~​n​(ω)​≈∑c=1C​αnc​(ωc​)qc​(ωc​)p~​n​(ωc​)​∑c=1C​αnc​(ωc​)qc​(ωc​)p~​n​(ωc​)​fn​(ωc​)​:=LARA(qn​,K,V)=N(ω;0,I)ξ(qn​,ω)⊤m=1∑M​ξ(km​,ω)​

时间复杂度

不难看出为O(NCd2)O(NC d^2)O(NCd2)。

训练以及loss

不变。

代码

暂无,详细的伪代码可以参考原论文。

实验以及适用场景

论文主要测试了Encoder,效果还不错,Decoder还没进行测试。

细节

实现的时候应该有不少技巧,等后续复现的时候进行讨论。

简评

理论性很强的一篇文章,但是写的很容易懂,出发点也比较明确,个人感觉比Performer这篇更值得关注。

https://arxiv.org/abs/2204.04667