Transformer-Evolution-Paper
  • README
  • 数学符号
  • Act
    • A survey on recently proposed activation functions for Deep Learning
  • Arch
    • Supplementary Material Implementation and Experiments for GAU-based Model
    • MetaFormer is Actually What You Need for Vision
    • Deeper vs Wider A Revisit of Transformer Configuration
    • Perceiver General Perception with Iterative Attention
    • General-purpose, long-context autoregressive modeling with Perceiver AR
    • Hierarchical Transformers Are More Efficient Language Models
    • Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding
    • Generalization through Memorization: Nearest Neighbor Language Models
  • FFN
    • Large Memory Layers with Product Keys
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • GLU Variants Improve Transformer
    • Simple Recurrence Improves Masked Language Models
    • Pay Attention to MLPs
    • S2-MLP Spatial-Shift MLP Architecture for Vision
    • S2-MLPv2 Improved Spatial-Shift MLP Architecture for Vision
    • HyperMixer An MLP-based Green AI Alternative to Transformers
    • DeFINE: DEep Factorized INput Token Embeddings for Neural Sequence Modeling & DeLighT: Deep and Light-weight Transformer
    • When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism
    • Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?
  • Head
    • Multi-Head Attention Collaborate Instead of Concatenate
    • Fast Transformer Decoding: One Write-Head is All You Need
  • Memory
    • Compressive Transformers for Long-Range Sequence Modelling
    • Memformer The Memory-Augmented Transformer
    • Memory Transformer
    • Do Transformers Need Deep Long-Range Memory
    • LaMemo Language Modeling with Look-Ahead Memory
    • GMAT Global Memory Augmentation for Transformers
    • Block-Recurrent Transformers
    • Augmenting Self-attention with Persistent Memory
    • Recurrent Memory Transformer
    • Memorizing Transformers
    • Scaling Transformer to 1M tokens and beyond with RMT
    • Adapting Language Models to Compress Contexts
  • MHA
    • FFT
      • Fourier Neural Operator for Parametric Partial Differential Equations
      • Global Filter Networks for Image Classification
      • Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers
      • FNet: Mixing Tokens with Fourier Transforms
    • LocalGlobal
      • CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention
      • Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding
      • Neighborhood Attention Transformer
      • FMMformer: Efficient and Flexible Transformer via Decomposed Near-field and Far-field Attention
      • Adaptive Attention Span in Transformers
      • CoLT5: Faster Long-Range Transformers with Conditional Computation
    • MatrixMethod
      • Skyformer Remodel Self-Attention with Gaussian Kernel and Nyström Method
      • Is Attention Better Than Matrix Decomposition
    • RightProduct
      • Kronecker Attention Networks
      • An Attention Free Transformer
      • Transformer with Fourier Integral Attentions
      • Linear Complexity Randomized Self-attention Mechanism
      • UFO-ViT: High Performance Linear Vision Transformer without Softmax
      • XCiT: Cross-Covariance Image Transformers
      • SimpleTRON: Simple Transformer with O(N) Complexity
      • A Dot Product Attention Free Transformer
      • On Learning the Transformer Kernel
      • Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization
    • SparseOrLowRank
      • Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection
      • Scatterbrain: Unifying Sparse and Low-rank Attention Approximation
      • Sparse Factorization of Large Square Matrices
      • Blockwise Self-Attention for Long Document Understanding
      • H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences
      • ChunkFormer: Learning Long Time Series with Multi-stage Chunked Transformer
      • Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting
      • Fast Transformers with Clustered Attention
      • Long-Short Transformer: Efficient Transformers for Language and Vision
      • LongT5: Efficient Text-To-Text Transformer for Long Sequences
      • Luna: Linear Unified Nested Attention
      • Memory-efficient Transformers via Top-k Attention
      • Separable Self-attention for Mobile Vision Transformers
      • Simple Local Attentions Remain Competitive for Long-Context Tasks
      • You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling
    • Others
      • Synthesizer: Rethinking Self-Attention in Transformer Models
      • Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kern
      • Combiner Full Attention Transformer with Sparse Computation Cost
      • Ripple Attention for Visual Perception with Sub-quadratic Complexity
      • Sinkformers: Transformers with Doubly Stochastic Attention
      • SOFT: Softmax-free Transformer with Linear Complexity
      • Value-aware Approximate Attention
      • EL-Attention: Memory Efficient Lossless Attention for Generation
      • Flowformer: Linearizing Transformers with Conservation Flows
      • ETSformer: Exponential Smoothing Transformers for Time-series Forecasting
      • IGLOO: Slicing the Features Space to Represent Sequences
      • Swin Transformer V2: Scaling Up Capacity and Resolution
      • Skip-Attention: Improving Vision Transformers by Paying Less Attention
  • Normalize_And_Residual
    • ReZero is All You Need Fast Convergence at Large Depth
    • Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks
    • Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention
    • RealFormer Transformer Likes Residual Attention
    • On Layer Normalizations and Residual Connections in Transformers
    • Transformers without Tears: Improving the Normalization of Self-Attention
    • Query-Key Normalization for Transformers
    • Understanding the difficulty of training transformers
  • Pe
    • A Simple and Effective Positional Encoding for Transformers
    • DeBERTa Decoding-enhanced BERT with Disentangled Attention
    • DecBERT Enhancing the Language Understanding of BERT with Causal Attention Masks
    • Encoding word order in complex embeddings
    • Improve Transformer Models with Better Relative Position Embeddings
    • KERPLE Kernelized Relative Positional Embedding for Length Extrapolation
    • PermuteFormer Efficient Relative Position Encoding for Long Sequences
    • Rethinking Positional Encoding in Language Pre-training
    • Transformer-XL Attentive Language Models Beyond a Fixed-Length Context
    • Translational Equivariance in Kernelizable Attention
    • Transformer Language Models without Positional Encodings Still Learn Positional Information
    • Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding
    • Randomized Positional Encodings Boost Length Generalization of Transformers
  • Pretrain
    • XLNet Generalized Autoregressive Pretraining for Language Understanding
    • Transcormer Transformer for Sentence Scoring with Sliding Language Modeling
    • Optimus Organizing Sentences via Pre-trained Modeling of a Latent Space
    • ELECTRA Pre-training Text Encoders as Discriminators Rather Than Generators
    • Cramming: Training a Language Model on a Single GPU in One Day
  • Softmax
    • Transformer with a Mixture of Gaussian Keys
    • Normalized Attention Without Probability Cage
  • Others
    • Accelerating Neural Transformer via an Average Attention Network
    • Do Transformer Modifications Transfer Across Implementations and Applications?
    • Object-Centric Learning with Slot Attention
    • Do Transformer Modifications Transfer Across Implementations and Applications?
    • Why self-attention is Natural for Sequence-to-Sequence Problems? A Perspective from Symmetries
  • LongConv
    • Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks
    • Parallelizing Legendre Memory Unit Training
    • Simplified State Space Layers for Sequence Modeling
    • Pretraining Without Attention
    • What Makes Convolutional Models Great on Long Sequence Modeling?
    • Hungry Hungry Hippos: Towards Language Modeling with State Space Models
    • Hyena Hierarchy: Towards Larger Convolutional Language Models
    • RWKV
    • Simple Hardware-Efficient Long Convolutions for Sequence Modeling
    • Time-aware large kernel convolutions
    • Resurrecting Recurrent Neural Networks for Long Sequences
    • CKConv: Continuous Kernel Convolution For Sequential Data
    • FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes
    • Towards a General Purpose CNN for Long Range Dependencies in ND
  • Rnn
    • When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute
    • Linear Transformers Are Secretly Fast Weight Programmers
    • Going Beyond Linear Transformers with Recurrent Fast Weight Programmers
    • Parallelizing Linear Recurrent Neural Nets Over Sequence Length
    • Quasi-recurrent neural networks
  • CrossAttention
    • Neural Machine Translation in Linear Time
  • Inference
    • Extrapolation
      • Parallel Context Windows for Large Language Models
      • Structured Prompting: Scaling In-Context Learning to 1,000 Examples
      • Naive Bayes-based Context Extension
  • Peft
    • Parameter-Efficient Fine-Tuning without Introducing New Latency
    • Make Your Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning
  • LLM
    • LLM Details Summary
    • What Language Model to Train if You Have One Million GPU Hours?
Powered by GitBook
On this page
  • Training
  • Time mixing
  • Feature mixing
  • V1
  • V2
  • V3
  • V4
  • Inference
  1. LongConv

RWKV

PreviousHyena Hierarchy: Towards Larger Convolutional Language ModelsNextSimple Hardware-Efficient Long Convolutions for Sequence Modeling

Last updated 2 years ago

项目地址:

Training

RWKV是一个训练时并行,推理时串行的算法,非常有意思,一共有4个版本,这里逐一介绍,假设输入为X∈Rn×d\mathbf X\in \mathbb R^{n\times d}X∈Rn×d,激活函数为fff,nnn为序列长度,ddd为特征维度。

Time mixing

Time mixing可以理解为强制bi-gram,每个token包含自己和前一个token的部分信息:

x = torch.cat([self.time_shift(x)[:,:T,:C//2], x[:,:T,C//2:]], dim=2)

Feature mixing

首先所有版本的feature mixing几乎是一样的:

  1. Time mix得到X1∈Rn×d\mathbf X_1\in \mathbb R^{n\times d}X1​∈Rn×d;

  2. K=X1Wk∈Rn×e,V=X1Wv∈Rn×e,R=X1Wr∈Rn×d\mathbf K =\mathbf X_1 \mathbf W_k\in \mathbb R^{n\times e},\mathbf V =\mathbf X_1 \mathbf W_v\in \mathbb R^{n\times e}, \mathbf R =\mathbf X_1 \mathbf W_r \in \mathbb R^{n\times d}K=X1​Wk​∈Rn×e,V=X1​Wv​∈Rn×e,R=X1​Wr​∈Rn×d;

  3. WKV=(f(K)⊙V)Ww∈Rn×d\mathbf {WKV}= (f(\mathbf K) \odot \mathbf V)\mathbf W_w \in \mathbb R^{n\times d}WKV=(f(K)⊙V)Ww​∈Rn×d;

  4. RWKV=Sigmoid(R)⊙WKV∈Rn×d\mathbf {RWKV} =\mathrm{Sigmoid}(\mathbf R)\odot \mathbf{WKV} \in \mathbb R^{n\times d}RWKV=Sigmoid(R)⊙WKV∈Rn×d;

V1

  1. Token mix得到X1∈Rn×d\mathbf X_1\in \mathbb R^{n\times d}X1​∈Rn×d;

  2. K=exp⁡(X1Wk)∈Rn×e,V=X1Wv∈Rn×e,R=X1Wr∈Rn×e\mathbf K =\exp(\mathbf X_1 \mathbf W_k)\in \mathbb R^{n\times e},\mathbf V =\mathbf X_1 \mathbf W_v\in \mathbb R^{n\times e}, \mathbf R =\mathbf X_1 \mathbf W_r \in \mathbb R^{n\times e} K=exp(X1​Wk​)∈Rn×e,V=X1​Wv​∈Rn×e,R=X1​Wr​∈Rn×e;

  3. K1=cumsum(K,d=0)∈Rn×e\mathbf K_{1}=\mathrm{cumsum}(\mathbf K, d=0)\in \mathbb R^{n\times e}K1​=cumsum(K,d=0)∈Rn×e;

  4. Ww∈Rn×n,Ww[i,j]=αiλi−jbj\mathbf W_w \in \mathbb R^{n\times n}, \mathbf W_w [i,j]=\alpha_i \lambda^{i-j}b_jWw​∈Rn×n,Ww​[i,j]=αi​λi−jbj​;

  5. KV=K⊙V∈Rn×e\mathbf {KV}= \mathbf K \odot \mathbf V \in \mathbb R^{n\times e}KV=K⊙V∈Rn×e;

  6. WKV=WwKV∈Rn×e\mathbf {WKV}=\mathbf W_w \mathbf {KV} \in \mathbb R^{n\times e}WKV=Ww​KV∈Rn×e;

  7. RWKV=Sigmoid(R)⊙WKV∈Rn×d/K1∈Rn×e\mathbf {RWKV} =\mathrm{Sigmoid}(\mathbf R)\odot \mathbf{WKV} \in \mathbb R^{n\times d} / \mathbf K_1 \in \mathbb R^{n\times e}RWKV=Sigmoid(R)⊙WKV∈Rn×d/K1​∈Rn×e;

  8. O=RWKV×Wo∈Rn×d\mathbf O= \mathbf {RWKV} \times \mathbf W_o \in \mathbb R^{n\times d}O=RWKV×Wo​∈Rn×d;

  9. O=O×γ∈Rn×d,γ∈Rn\mathbf O= \mathbf O \times \gamma \in \mathbb R^{n\times d}, \gamma \in \mathbb R^{n}O=O×γ∈Rn×d,γ∈Rn;

V2

步骤4修改为:

Ww[i,j,k]∈Rn×n×e,Ww[i,j,k]={0,i−j<0ck,i−j=0λki−j,i−j≥1\begin{aligned} \mathbf W_w [i,j, k]& \in \mathbb R^{n\times n \times e}, \\ \mathbf W_w [i,j, k]&=\begin{cases} 0 , i -j < 0\\ c_k, i -j =0\\ \lambda_k^{i-j}, i -j \ge 1 \end{cases} \end{aligned}Ww​[i,j,k]Ww​[i,j,k]​∈Rn×n×e,=⎩⎨⎧​0,i−j<0ck​,i−j=0λki−j​,i−j≥1​​

步骤3修改为:

WK[:,k]=Ww[:,:,k]K∈Rn×1K1=cumsum(WK,d=0)∈Rn×e\begin{aligned} \mathbf {WK}[:, k] & = \mathbf W_w[:, :, k] \mathbf K \in \mathbb R^{n\times 1}\\ \mathbf K_{1}&=\mathrm{cumsum}(\mathbf {WK}, d=0)\in \mathbb R^{n\times e} \end{aligned}WK[:,k]K1​​=Ww​[:,:,k]K∈Rn×1=cumsum(WK,d=0)∈Rn×e​

步骤6修改为:

WKV[:,k]=Ww[:,:,k]KV[:,k]∈Rn×1\mathbf {WKV}[:, k]=\mathbf W_w[:,:, k] \mathbf {KV}[:, k] \in \mathbb R^{n\times 1}WKV[:,k]=Ww​[:,:,k]KV[:,k]∈Rn×1

删除步骤9。

V3

步骤2修改为:

  • Xp=λpX+(1−λp)X1∈Rn×d,p=k,v,r\mathbf X_{p}=\lambda_p \mathbf X +(1-\lambda_p) \mathbf X_1\in \mathbb R^{n\times d}, p=k, v, rXp​=λp​X+(1−λp​)X1​∈Rn×d,p=k,v,r

  • K=exp⁡(XkWk)∈Rn×e,V=XvWv∈Rn×e,R=XrWr∈Rn×e\mathbf K =\exp(\mathbf X_k \mathbf W_k)\in \mathbb R^{n\times e},\mathbf V =\mathbf X_v \mathbf W_v\in \mathbb R^{n\times e}, \mathbf R =\mathbf X_r \mathbf W_r \in \mathbb R^{n\times e} K=exp(Xk​Wk​)∈Rn×e,V=Xv​Wv​∈Rn×e,R=Xr​Wr​∈Rn×e;

V4

利用a/b=(am)/bma/b= (am)/bma/b=(am)/bm,保证KKK这一项的数值大小,防止数值问题。

Inference

V2, V3, V4版本RWKV\mathbf {RWKV}RWKV可以递归计算,记:

at+1=λkat+KV[t+1,:]∈R1×ebt+1=at+(1−ck)KV[t+1,:]∈R1×ect+1=λkct+K[t+1,:]∈R1×edt+1=ct+(1−ck)K[t+1,:]∈R1×eRWKVt+1=(R[t+1,:]/R[t,:])⊙(at+1/bt+1)∈R1×e\begin{aligned} \mathbf a_{t+1} &= \lambda_k\mathbf a_{t} + \mathbf {KV}[t + 1,:]\in \mathbb R^{1\times e} \\ \mathbf b_{t+1} & =\mathbf a_t + (1-c_k) \mathbf {KV}[t + 1,:]\in \mathbb R^{1\times e} \\ \mathbf c_{t+1} &= \lambda_k\mathbf c_{t} + \mathbf {K}[t+1, :] \in \mathbb R^{1\times e} \\ \mathbf d_{t+1} &= \mathbf c_{t} + (1-c_k )\mathbf {K}[t+1, :] \in \mathbb R^{1\times e} \\ \mathbf {RWKV}_{t+1} &= (\mathbf R[t+1, :] /\mathbf R[t, :]) \odot (\mathbf a_{t+1} / \mathbf b_{t+1}) \in \mathbb R^{1\times e} \end{aligned}at+1​bt+1​ct+1​dt+1​RWKVt+1​​=λk​at​+KV[t+1,:]∈R1×e=at​+(1−ck​)KV[t+1,:]∈R1×e=λk​ct​+K[t+1,:]∈R1×e=ct​+(1−ck​)K[t+1,:]∈R1×e=(R[t+1,:]/R[t,:])⊙(at+1​/bt+1​)∈R1×e​
https://github.com/BlinkDL/RWKV-LM