Transformer-Evolution-Paper
  • README
  • 数学符号
  • Act
    • A survey on recently proposed activation functions for Deep Learning
  • Arch
    • Supplementary Material Implementation and Experiments for GAU-based Model
    • MetaFormer is Actually What You Need for Vision
    • Deeper vs Wider A Revisit of Transformer Configuration
    • Perceiver General Perception with Iterative Attention
    • General-purpose, long-context autoregressive modeling with Perceiver AR
    • Hierarchical Transformers Are More Efficient Language Models
    • Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding
    • Generalization through Memorization: Nearest Neighbor Language Models
  • FFN
    • Large Memory Layers with Product Keys
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • GLU Variants Improve Transformer
    • Simple Recurrence Improves Masked Language Models
    • Pay Attention to MLPs
    • S2-MLP Spatial-Shift MLP Architecture for Vision
    • S2-MLPv2 Improved Spatial-Shift MLP Architecture for Vision
    • HyperMixer An MLP-based Green AI Alternative to Transformers
    • DeFINE: DEep Factorized INput Token Embeddings for Neural Sequence Modeling & DeLighT: Deep and Light-weight Transformer
    • When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism
    • Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?
  • Head
    • Multi-Head Attention Collaborate Instead of Concatenate
    • Fast Transformer Decoding: One Write-Head is All You Need
  • Memory
    • Compressive Transformers for Long-Range Sequence Modelling
    • Memformer The Memory-Augmented Transformer
    • Memory Transformer
    • Do Transformers Need Deep Long-Range Memory
    • LaMemo Language Modeling with Look-Ahead Memory
    • GMAT Global Memory Augmentation for Transformers
    • Block-Recurrent Transformers
    • Augmenting Self-attention with Persistent Memory
    • Recurrent Memory Transformer
    • Memorizing Transformers
    • Scaling Transformer to 1M tokens and beyond with RMT
    • Adapting Language Models to Compress Contexts
  • MHA
    • FFT
      • Fourier Neural Operator for Parametric Partial Differential Equations
      • Global Filter Networks for Image Classification
      • Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers
      • FNet: Mixing Tokens with Fourier Transforms
    • LocalGlobal
      • CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention
      • Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding
      • Neighborhood Attention Transformer
      • FMMformer: Efficient and Flexible Transformer via Decomposed Near-field and Far-field Attention
      • Adaptive Attention Span in Transformers
      • CoLT5: Faster Long-Range Transformers with Conditional Computation
    • MatrixMethod
      • Skyformer Remodel Self-Attention with Gaussian Kernel and Nyström Method
      • Is Attention Better Than Matrix Decomposition
    • RightProduct
      • Kronecker Attention Networks
      • An Attention Free Transformer
      • Transformer with Fourier Integral Attentions
      • Linear Complexity Randomized Self-attention Mechanism
      • UFO-ViT: High Performance Linear Vision Transformer without Softmax
      • XCiT: Cross-Covariance Image Transformers
      • SimpleTRON: Simple Transformer with O(N) Complexity
      • A Dot Product Attention Free Transformer
      • On Learning the Transformer Kernel
      • Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization
    • SparseOrLowRank
      • Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection
      • Scatterbrain: Unifying Sparse and Low-rank Attention Approximation
      • Sparse Factorization of Large Square Matrices
      • Blockwise Self-Attention for Long Document Understanding
      • H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences
      • ChunkFormer: Learning Long Time Series with Multi-stage Chunked Transformer
      • Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting
      • Fast Transformers with Clustered Attention
      • Long-Short Transformer: Efficient Transformers for Language and Vision
      • LongT5: Efficient Text-To-Text Transformer for Long Sequences
      • Luna: Linear Unified Nested Attention
      • Memory-efficient Transformers via Top-k Attention
      • Separable Self-attention for Mobile Vision Transformers
      • Simple Local Attentions Remain Competitive for Long-Context Tasks
      • You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling
    • Others
      • Synthesizer: Rethinking Self-Attention in Transformer Models
      • Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kern
      • Combiner Full Attention Transformer with Sparse Computation Cost
      • Ripple Attention for Visual Perception with Sub-quadratic Complexity
      • Sinkformers: Transformers with Doubly Stochastic Attention
      • SOFT: Softmax-free Transformer with Linear Complexity
      • Value-aware Approximate Attention
      • EL-Attention: Memory Efficient Lossless Attention for Generation
      • Flowformer: Linearizing Transformers with Conservation Flows
      • ETSformer: Exponential Smoothing Transformers for Time-series Forecasting
      • IGLOO: Slicing the Features Space to Represent Sequences
      • Swin Transformer V2: Scaling Up Capacity and Resolution
      • Skip-Attention: Improving Vision Transformers by Paying Less Attention
  • Normalize_And_Residual
    • ReZero is All You Need Fast Convergence at Large Depth
    • Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks
    • Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention
    • RealFormer Transformer Likes Residual Attention
    • On Layer Normalizations and Residual Connections in Transformers
    • Transformers without Tears: Improving the Normalization of Self-Attention
    • Query-Key Normalization for Transformers
    • Understanding the difficulty of training transformers
  • Pe
    • A Simple and Effective Positional Encoding for Transformers
    • DeBERTa Decoding-enhanced BERT with Disentangled Attention
    • DecBERT Enhancing the Language Understanding of BERT with Causal Attention Masks
    • Encoding word order in complex embeddings
    • Improve Transformer Models with Better Relative Position Embeddings
    • KERPLE Kernelized Relative Positional Embedding for Length Extrapolation
    • PermuteFormer Efficient Relative Position Encoding for Long Sequences
    • Rethinking Positional Encoding in Language Pre-training
    • Transformer-XL Attentive Language Models Beyond a Fixed-Length Context
    • Translational Equivariance in Kernelizable Attention
    • Transformer Language Models without Positional Encodings Still Learn Positional Information
    • Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding
    • Randomized Positional Encodings Boost Length Generalization of Transformers
  • Pretrain
    • XLNet Generalized Autoregressive Pretraining for Language Understanding
    • Transcormer Transformer for Sentence Scoring with Sliding Language Modeling
    • Optimus Organizing Sentences via Pre-trained Modeling of a Latent Space
    • ELECTRA Pre-training Text Encoders as Discriminators Rather Than Generators
    • Cramming: Training a Language Model on a Single GPU in One Day
  • Softmax
    • Transformer with a Mixture of Gaussian Keys
    • Normalized Attention Without Probability Cage
  • Others
    • Accelerating Neural Transformer via an Average Attention Network
    • Do Transformer Modifications Transfer Across Implementations and Applications?
    • Object-Centric Learning with Slot Attention
    • Do Transformer Modifications Transfer Across Implementations and Applications?
    • Why self-attention is Natural for Sequence-to-Sequence Problems? A Perspective from Symmetries
  • LongConv
    • Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks
    • Parallelizing Legendre Memory Unit Training
    • Simplified State Space Layers for Sequence Modeling
    • Pretraining Without Attention
    • What Makes Convolutional Models Great on Long Sequence Modeling?
    • Hungry Hungry Hippos: Towards Language Modeling with State Space Models
    • Hyena Hierarchy: Towards Larger Convolutional Language Models
    • RWKV
    • Simple Hardware-Efficient Long Convolutions for Sequence Modeling
    • Time-aware large kernel convolutions
    • Resurrecting Recurrent Neural Networks for Long Sequences
    • CKConv: Continuous Kernel Convolution For Sequential Data
    • FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes
    • Towards a General Purpose CNN for Long Range Dependencies in ND
  • Rnn
    • When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute
    • Linear Transformers Are Secretly Fast Weight Programmers
    • Going Beyond Linear Transformers with Recurrent Fast Weight Programmers
    • Parallelizing Linear Recurrent Neural Nets Over Sequence Length
    • Quasi-recurrent neural networks
  • CrossAttention
    • Neural Machine Translation in Linear Time
  • Inference
    • Extrapolation
      • Parallel Context Windows for Large Language Models
      • Structured Prompting: Scaling In-Context Learning to 1,000 Examples
      • Naive Bayes-based Context Extension
  • Peft
    • Parameter-Efficient Fine-Tuning without Introducing New Latency
    • Make Your Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning
  • LLM
    • LLM Details Summary
    • What Language Model to Train if You Have One Million GPU Hours?
Powered by GitBook
On this page
  • 整体思路以及计算方式
  • 时间复杂度
  • 训练以及loss
  • 代码
  • 实验以及适用场景
  • 细节
  • 简评
  1. FFN

Large Memory Layers with Product Keys

PreviousFFNNextTransformer Feed-Forward Layers Are Key-Value Memories

Last updated 2 years ago

论文地址:

参考资料:

整体思路以及计算方式

对Transformer中FFN\mathrm{FFN}FFN的改进,称为PKM\mathrm{PKM}PKM(Product Key Memory),注意这里也有q,k,v\mathrm q,\mathrm k,\mathrm vq,k,v,思路是值得借鉴的。

首先给出符号:

  • Tm\mathcal T_mTm​表示Top−m\mathrm{Top}-mTop−m

  • ki∈K,ki∈R1×d,∣K∣=n\mathrm k_i\in \mathcal K,\mathrm k_i\in \mathbb R^{1\times d}, |\mathcal K|= nki​∈K,ki​∈R1×d,∣K∣=n

  • q(x),vi∈R1×d\mathrm q(\mathrm x),\mathrm v_i \in \mathbb R^{1\times d}q(x),vi​∈R1×d

核心为如下计算问题:

I=Tm(q(x)⊤ki)w=Softmax⁡((q(x)⊤ki)i∈I)m(x)=∑i∈Iwivi\begin{aligned} \mathcal{I} &=\mathcal{T}_{m}\left(\mathrm q(\mathrm x)^{\top}\mathrm k_{i}\right) \\ \mathrm w &=\operatorname{Softmax}\left(\left(\mathrm q(\mathrm x)^{\top}\mathrm k_{i}\right)_{i \in \mathcal{I}}\right) \\ m(\mathrm x) &=\sum_{i \in \mathcal{I}} \mathrm w_{i} \mathrm v_{i} \end{aligned}Iwm(x)​=Tm​(q(x)⊤ki​)=Softmax((q(x)⊤ki​)i∈I​)=i∈I∑​wi​vi​​

注意该模块依然为Rd→Rd\mathbb R^d\to \mathbb R^dRd→Rd的映射,所以可以类比FFN\mathrm {FFN}FFN。

分析:

  • 第一步需要计算Tm(qk⊤)∈R1×m,k∈K\mathcal{T}_{m}(\mathrm q\mathrm k^{\top})\in \mathbb R^{1\times m},\mathrm k\in \mathcal KTm​(qk⊤)∈R1×m,k∈K,

    • 由于需要求出全部nnn项,每一项的计算复杂度为O(d)O(d)O(d),所以总计算复杂度为O(nd)O(nd)O(nd);

    • Tm\mathcal T_mTm​操作的时间复杂度为O(mlog⁡n)O(m\log n)O(mlogn);

  • 第二步的时间复杂度为O(m)O(m)O(m);

  • 第三步的时间复杂度为O(md)O(md)O(md);

由于第一步是主要开销,为了提速,论文里做了如下假设:

  • k∈K={(c,c′)∣c∈C,c′∈C}\mathbf k\in \mathcal K=\{(\mathbf c, \mathbf c')| \mathbf c\in \mathcal C, \mathbf c'\in \mathcal C\}k∈K={(c,c′)∣c∈C,c′∈C}

    • 这里(c,c′)(\mathbf c,\mathbf c')(c,c′)表示向量拼接,c,c′∈R1×d/2\mathbf c,\mathbf c'\in \mathbb R^{1\times d/2}c,c′∈R1×d/2

    • ∣c∣=∣c′∣=n|c|=|c'|= \sqrt{n}∣c∣=∣c′∣=n​

注意到:

qk⊤=q[:d/2]k[:d/2]⊤+q[d/2:]k[d/2:]⊤≜q(1)(k(1))⊤+q(2)(k(2))⊤\begin{aligned} \mathbf q\mathbf k^{\top} &= \mathbf q[:d/2]\mathbf k[:d/2]^{\top} +\mathbf q[d/2:] \mathbf k[d/2:]^{\top} \\ &\triangleq \mathbf q^{(1)} (\mathbf k^{(1)})^{\top} + \mathbf q^{(2)} (\mathbf k^{(2)})^{\top} \end{aligned}qk⊤​=q[:d/2]k[:d/2]⊤+q[d/2:]k[d/2:]⊤≜q(1)(k(1))⊤+q(2)(k(2))⊤​

结合假设:

q(1)(ki(1))⊤,q(2)(kj(2))⊤,ki(1)∈C,kj(2)∈C′\mathbf q^{(1)} (\mathbf k^{(1)}_i)^{\top}, \mathbf q^{(2)} (\mathbf k^{(2)}_j)^{\top},\mathbf k^{(1)}_i\in \mathcal C, \mathbf k^{(2)}_j\in \mathcal C'q(1)(ki(1)​)⊤,q(2)(kj(2)​)⊤,ki(1)​∈C,kj(2)​∈C′

所以:

  • 只要求出2n2\sqrt n2n​项即可,每一项的计算复杂度为O(d/2)O(d/2)O(d/2),所以总计算复杂度为O(nd)O(\sqrt nd)O(n​d)

接着要从2n2\sqrt n2n​项中恢复qk⊤\mathbf q\mathbf k^{\top}qk⊤,作者使用的方式为:

qk⊤={q(1)(ki(1))⊤,q(2)(kj(2))⊤∣ki(1)∈C,kj(2)∈C′}\mathbf q\mathbf k^{\top}=\{\mathbf q^{(1)} (\mathbf k^{(1)}_i)^{\top},\mathbf q^{(2)} (\mathbf k^{(2)}_j)^{\top}|\mathbf k^{(1)}_i\in {\mathcal C}, \mathbf k^{(2)}_j\in \mathcal C' \}qk⊤={q(1)(ki(1)​)⊤,q(2)(kj(2)​)⊤∣ki(1)​∈C,kj(2)​∈C′}

这里一共有n×n=n\sqrt n\times \sqrt n =nn​×n​=n个元素,从这nnn个元素中进行Tm\mathcal{T}_{m}Tm​运行即可,因此总时间复杂度为

O(nd+mlog⁡n+md+d)=O((n+m)d)O(\sqrt nd +m\log n + md +d ) = O((\sqrt n+m)d)O(n​d+mlogn+md+d)=O((n​+m)d)

备注,这里假设log⁡n<d\log n < dlogn<d。

时间复杂度

假设x∈RL×d\mathbf x\in \mathbb R^{L\times d}x∈RL×d,所以总时间复杂度为:

O(L(n+m)d)O(L (\sqrt n+m)d)O(L(n​+m)d)

注意到FFN\mathrm{FFN}FFN的时间复杂度为:

O(4Ld2)O(4Ld^2)O(4Ld2)

所以一般来说前者比FFN\mathrm{FFN}FFN快。

训练以及loss

保持一致。

代码

实验以及适用场景

因为是替换FFN,所以适用于所有场景,但是这样做的动机还不明确;从实验效果来说非常不错。

细节

实现细节需要看查看官方代码。

简评

总结:

  • 思路挺特别的,而且效果出人意料的好;

  • 值得复现;

https://arxiv.org/abs/1907.05242
https://zhuanlan.zhihu.com/p/76501184
https://jishuin.proginn.com/p/763bfbd60db9
https://github.com/facebookresearch/XLM/blob/main/PKM-layer.ipynb
https://www.pragmatic.ml/large-memory-layers-with-product-keys/
https://github.com/lucidrains/product-key-memory