Transformer-Evolution-Paper
  • README
  • 数学符号
  • Act
    • A survey on recently proposed activation functions for Deep Learning
  • Arch
    • Supplementary Material Implementation and Experiments for GAU-based Model
    • MetaFormer is Actually What You Need for Vision
    • Deeper vs Wider A Revisit of Transformer Configuration
    • Perceiver General Perception with Iterative Attention
    • General-purpose, long-context autoregressive modeling with Perceiver AR
    • Hierarchical Transformers Are More Efficient Language Models
    • Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding
    • Generalization through Memorization: Nearest Neighbor Language Models
  • FFN
    • Large Memory Layers with Product Keys
    • Transformer Feed-Forward Layers Are Key-Value Memories
    • GLU Variants Improve Transformer
    • Simple Recurrence Improves Masked Language Models
    • Pay Attention to MLPs
    • S2-MLP Spatial-Shift MLP Architecture for Vision
    • S2-MLPv2 Improved Spatial-Shift MLP Architecture for Vision
    • HyperMixer An MLP-based Green AI Alternative to Transformers
    • DeFINE: DEep Factorized INput Token Embeddings for Neural Sequence Modeling & DeLighT: Deep and Light-weight Transformer
    • When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism
    • Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?
  • Head
    • Multi-Head Attention Collaborate Instead of Concatenate
    • Fast Transformer Decoding: One Write-Head is All You Need
  • Memory
    • Compressive Transformers for Long-Range Sequence Modelling
    • Memformer The Memory-Augmented Transformer
    • Memory Transformer
    • Do Transformers Need Deep Long-Range Memory
    • LaMemo Language Modeling with Look-Ahead Memory
    • GMAT Global Memory Augmentation for Transformers
    • Block-Recurrent Transformers
    • Augmenting Self-attention with Persistent Memory
    • Recurrent Memory Transformer
    • Memorizing Transformers
    • Scaling Transformer to 1M tokens and beyond with RMT
    • Adapting Language Models to Compress Contexts
  • MHA
    • FFT
      • Fourier Neural Operator for Parametric Partial Differential Equations
      • Global Filter Networks for Image Classification
      • Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers
      • FNet: Mixing Tokens with Fourier Transforms
    • LocalGlobal
      • CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention
      • Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding
      • Neighborhood Attention Transformer
      • FMMformer: Efficient and Flexible Transformer via Decomposed Near-field and Far-field Attention
      • Adaptive Attention Span in Transformers
      • CoLT5: Faster Long-Range Transformers with Conditional Computation
    • MatrixMethod
      • Skyformer Remodel Self-Attention with Gaussian Kernel and Nyström Method
      • Is Attention Better Than Matrix Decomposition
    • RightProduct
      • Kronecker Attention Networks
      • An Attention Free Transformer
      • Transformer with Fourier Integral Attentions
      • Linear Complexity Randomized Self-attention Mechanism
      • UFO-ViT: High Performance Linear Vision Transformer without Softmax
      • XCiT: Cross-Covariance Image Transformers
      • SimpleTRON: Simple Transformer with O(N) Complexity
      • A Dot Product Attention Free Transformer
      • On Learning the Transformer Kernel
      • Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization
    • SparseOrLowRank
      • Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection
      • Scatterbrain: Unifying Sparse and Low-rank Attention Approximation
      • Sparse Factorization of Large Square Matrices
      • Blockwise Self-Attention for Long Document Understanding
      • H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences
      • ChunkFormer: Learning Long Time Series with Multi-stage Chunked Transformer
      • Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting
      • Fast Transformers with Clustered Attention
      • Long-Short Transformer: Efficient Transformers for Language and Vision
      • LongT5: Efficient Text-To-Text Transformer for Long Sequences
      • Luna: Linear Unified Nested Attention
      • Memory-efficient Transformers via Top-k Attention
      • Separable Self-attention for Mobile Vision Transformers
      • Simple Local Attentions Remain Competitive for Long-Context Tasks
      • You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling
    • Others
      • Synthesizer: Rethinking Self-Attention in Transformer Models
      • Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kern
      • Combiner Full Attention Transformer with Sparse Computation Cost
      • Ripple Attention for Visual Perception with Sub-quadratic Complexity
      • Sinkformers: Transformers with Doubly Stochastic Attention
      • SOFT: Softmax-free Transformer with Linear Complexity
      • Value-aware Approximate Attention
      • EL-Attention: Memory Efficient Lossless Attention for Generation
      • Flowformer: Linearizing Transformers with Conservation Flows
      • ETSformer: Exponential Smoothing Transformers for Time-series Forecasting
      • IGLOO: Slicing the Features Space to Represent Sequences
      • Swin Transformer V2: Scaling Up Capacity and Resolution
      • Skip-Attention: Improving Vision Transformers by Paying Less Attention
  • Normalize_And_Residual
    • ReZero is All You Need Fast Convergence at Large Depth
    • Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks
    • Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention
    • RealFormer Transformer Likes Residual Attention
    • On Layer Normalizations and Residual Connections in Transformers
    • Transformers without Tears: Improving the Normalization of Self-Attention
    • Query-Key Normalization for Transformers
    • Understanding the difficulty of training transformers
  • Pe
    • A Simple and Effective Positional Encoding for Transformers
    • DeBERTa Decoding-enhanced BERT with Disentangled Attention
    • DecBERT Enhancing the Language Understanding of BERT with Causal Attention Masks
    • Encoding word order in complex embeddings
    • Improve Transformer Models with Better Relative Position Embeddings
    • KERPLE Kernelized Relative Positional Embedding for Length Extrapolation
    • PermuteFormer Efficient Relative Position Encoding for Long Sequences
    • Rethinking Positional Encoding in Language Pre-training
    • Transformer-XL Attentive Language Models Beyond a Fixed-Length Context
    • Translational Equivariance in Kernelizable Attention
    • Transformer Language Models without Positional Encodings Still Learn Positional Information
    • Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding
    • Randomized Positional Encodings Boost Length Generalization of Transformers
  • Pretrain
    • XLNet Generalized Autoregressive Pretraining for Language Understanding
    • Transcormer Transformer for Sentence Scoring with Sliding Language Modeling
    • Optimus Organizing Sentences via Pre-trained Modeling of a Latent Space
    • ELECTRA Pre-training Text Encoders as Discriminators Rather Than Generators
    • Cramming: Training a Language Model on a Single GPU in One Day
  • Softmax
    • Transformer with a Mixture of Gaussian Keys
    • Normalized Attention Without Probability Cage
  • Others
    • Accelerating Neural Transformer via an Average Attention Network
    • Do Transformer Modifications Transfer Across Implementations and Applications?
    • Object-Centric Learning with Slot Attention
    • Do Transformer Modifications Transfer Across Implementations and Applications?
    • Why self-attention is Natural for Sequence-to-Sequence Problems? A Perspective from Symmetries
  • LongConv
    • Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks
    • Parallelizing Legendre Memory Unit Training
    • Simplified State Space Layers for Sequence Modeling
    • Pretraining Without Attention
    • What Makes Convolutional Models Great on Long Sequence Modeling?
    • Hungry Hungry Hippos: Towards Language Modeling with State Space Models
    • Hyena Hierarchy: Towards Larger Convolutional Language Models
    • RWKV
    • Simple Hardware-Efficient Long Convolutions for Sequence Modeling
    • Time-aware large kernel convolutions
    • Resurrecting Recurrent Neural Networks for Long Sequences
    • CKConv: Continuous Kernel Convolution For Sequential Data
    • FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes
    • Towards a General Purpose CNN for Long Range Dependencies in ND
  • Rnn
    • When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute
    • Linear Transformers Are Secretly Fast Weight Programmers
    • Going Beyond Linear Transformers with Recurrent Fast Weight Programmers
    • Parallelizing Linear Recurrent Neural Nets Over Sequence Length
    • Quasi-recurrent neural networks
  • CrossAttention
    • Neural Machine Translation in Linear Time
  • Inference
    • Extrapolation
      • Parallel Context Windows for Large Language Models
      • Structured Prompting: Scaling In-Context Learning to 1,000 Examples
      • Naive Bayes-based Context Extension
  • Peft
    • Parameter-Efficient Fine-Tuning without Introducing New Latency
    • Make Your Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning
  • LLM
    • LLM Details Summary
    • What Language Model to Train if You Have One Million GPU Hours?
Powered by GitBook
On this page
  • 整体思路以及计算方式
  • 时间复杂度
  • 训练以及loss
  • 代码
  • 实验以及适用场景
  • 细节
  • 简评
  1. Pretrain

XLNet Generalized Autoregressive Pretraining for Language Understanding

PreviousPretrainNextTranscormer Transformer for Sentence Scoring with Sliding Language Modeling

Last updated 2 years ago

论文地址:

参考资料:

整体思路以及计算方式

XLNET给出一种新的预训练方式,结合了AR(GPT),AE(Bert)的特点。

给定句子x=[x1,⋯ ,xT]\mathbf{x}=\left[x_{1}, \cdots, x_{T}\right]x=[x1​,⋯,xT​],AR语言模型的目标为:

max⁡θlog⁡pθ(x)=∑t=1⊤log⁡pθ(xt∣x<t)=∑t=1⊤log⁡exp⁡(hθ(x1:t−1)⊤e(xt))∑x′exp⁡(hθ(x1:t−1)⊤e(x′))\max _{\theta} \log p_{\theta}(\mathbf{x})=\sum_{t=1}^{\top} \log p_{\theta}\left(x_{t} \mid \mathrm{x}_{<t}\right)=\sum_{t=1}^{\top} \log \frac{\exp \left(h_{\theta}\left(\mathbf{x}_{1: t-1}\right)^{\top} \mathbf e\left(x_{t}\right)\right)}{\sum_{x^{\prime}} \exp \left(h_{\theta}\left(\mathbf{x}_{1: t-1}\right)^{\top} \mathbf e\left(x^{\prime}\right)\right)}θmax​logpθ​(x)=t=1∑⊤​logpθ​(xt​∣x<t​)=t=1∑⊤​log∑x′​exp(hθ​(x1:t−1​)⊤e(x′))exp(hθ​(x1:t−1​)⊤e(xt​))​

AE语言模型的目标为:

max⁡θlog⁡pθ(x‾∣x^)≈∑t=1⊤mtlog⁡pθ(xt∣x^)=∑t=1⊤mtlog⁡exp⁡(Hθ(x^)t⊤e(xt))∑x′exp⁡(Hθ(x^)t⊤e(x′))\max _{\theta} \log p_{\theta}(\overline{\mathbf{x}} \mid \hat{\mathbf{x}}) \approx \sum_{t=1}^{\top} m_{t} \log p_{\theta}\left(x_{t} \mid \hat{\mathbf{x}}\right)=\sum_{t=1}^{\top} m_{t} \log \frac{\exp \left(H_{\theta}(\hat{\mathbf{x}})_{t}^{\top} \mathbf e\left(x_{t}\right)\right)}{\sum_{x^{\prime}} \exp \left(H_{\theta}(\hat{\mathbf{x}})_{t}^{\top} \mathbf e\left(x^{\prime}\right)\right)}θmax​logpθ​(x∣x^)≈t=1∑⊤​mt​logpθ​(xt​∣x^)=t=1∑⊤​mt​log∑x′​exp(Hθ​(x^)t⊤​e(x′))exp(Hθ​(x^)t⊤​e(xt​))​

其中mt=1m_t=1mt​=1表示xtx_txt​被mask。

两者的缺点是:

  • AE模型有独立性假设;

  • AE模型在训练的时候有噪声,在测试的时候没有噪声;

  • AR模型只能看到单侧信息;

为了解决这点,论文提出了Permutation Language Modeling,即对长度为TTT的句子,考虑全部T!T!T!种排列:

max⁡θEz∼ZT[∑t=1⊤log⁡pθ(xzt∣xz<t)]\max _{\theta} \quad \mathbb{E}_{\mathrm{z} \sim \mathcal{Z}_{T}}\left[\sum_{t=1}^{\top} \log p_{\theta}\left(x_{z_{t}} \mid \mathbf{x}_{\mathrm{z}<t}\right)\right]θmax​Ez∼ZT​​[t=1∑⊤​logpθ​(xzt​​∣xz<t​)]

其中ZT\mathcal Z_TZT​表示长度为TTT的全排列集合。

下一步是计算pθ(xzt∣xz<t)p_{\theta}\left(x_{z_{t}} \mid \mathrm{x}_{\mathrm{z}<t}\right)pθ​(xzt​​∣xz<t​),模型的计算方式为:

pθ(Xzt=x∣xz<t)=exp⁡(e(x)⊤hθ(xz<t))∑x′exp⁡(e(x′)⊤hθ(xz<t))p_{\theta}\left(\mathbf X_{z_{t}}=x \mid \mathbf{x}_{\mathrm{z}<t}\right)=\frac{\exp \left(\mathbf e(\mathbf x)^{\top} h_{\theta}\left(\mathbf{x}_{\mathrm{z}<t}\right)\right)}{\sum_{x^{\prime}} \exp \left(\mathbf e\left(\mathbf x^{\prime}\right)^{\top} h_{\theta}\left(\mathbf{x}_{\mathrm{z}<t}\right)\right)}pθ​(Xzt​​=x∣xz<t​)=∑x′​exp(e(x′)⊤hθ​(xz<t​))exp(e(x)⊤hθ​(xz<t​))​

但是该方法有问题,因为没有考虑ztz_tzt​,所以作者提出了如下计算方式:

pθ(Xzt=x∣xz<t)=exp⁡(e(x)⊤gθ(xz<t,zt))∑x′exp⁡(e(x′)⊤gθ(xz<t,zt))p_{\theta}\left(\mathbf X_{z_{t}}=x \mid \mathbf{x}_{z_{<t}}\right)=\frac{\exp \left(\mathbf e(\mathbf x)^{\top} g_{\theta}\left(\mathbf{x}_{\mathrm{z}_{<t}}, z_{t}\right)\right)}{\sum_{x^{\prime}} \exp \left(\mathbf e\left(\mathbf x^{\prime}\right)^{\top} g_{\theta}\left(\mathbf{x}_{\mathrm{z}_{<t}}, z_{t}\right)\right)}pθ​(Xzt​​=x∣xz<t​​)=∑x′​exp(e(x′)⊤gθ​(xz<t​​,zt​))exp(e(x)⊤gθ​(xz<t​​,zt​))​

作者将hθ,gθh_\theta, g_\thetahθ​,gθ​分别称为content representation和query representation,计算方式为:

gzt(m)←Attention⁡(Q=gzt(m−1),KV=hz<t(m−θ))( query stream: use zt but cannot see xzt).hzt(m)←Attention⁡(Q=hzt(m−1),KV=hz≤t(m−1);θ),( content stream: use both zt and xzt).\begin{aligned} g_{z_{t}}^{(m)}& \leftarrow \operatorname{Attention} \left(\mathbf{Q}=g_{z_{t}}^{(m-1)}, \mathbf{KV}=\mathrm{h}_{\mathrm{z}_{<t}}^{(m-\theta)} \right) \quad\left(\text { query stream: use } z_{t} \text { but cannot see } x_{z_{t}}\right). \\ h_{z_{t}}^{(m)} &\leftarrow \operatorname{Attention}\left(\mathbf{Q}=h_{z_{t}}^{(m-1)}, \mathbf{KV}=\mathrm{h}_{\mathrm{z}_{\leq t}}^{(m-1)} ; \theta\right), \quad\left(\text { content stream: use both } z_{t} \text { and } x_{z_{t}}\right) . \end{aligned}gzt​(m)​hzt​(m)​​←Attention(Q=gzt​(m−1)​,KV=hz<t​(m−θ)​)( query stream: use zt​ but cannot see xzt​​).←Attention(Q=hzt​(m−1)​,KV=hz≤t​(m−1)​;θ),( content stream: use both zt​ and xzt​​).​

作者还借鉴了Transformer-XL的想法,将hhh的计算方式修改为:

hzt(m)←Attention⁡(Q=hzt(m−1),KV=[h~(m−1),hz≤t(m−1)];θ)h_{z_{t}}^{(m)} \leftarrow \operatorname{Attention}\left(\mathbf{Q}=h_{z_{t}}^{(m-1)}, \mathbf{KV}=\left[\tilde{\mathrm{h}}^{(m-1)}, \mathrm{h}_{\mathrm{z}_{\leq t}}^{(m-1)}\right] ; \theta\right)hzt​(m)​←Attention(Q=hzt​(m−1)​,KV=[h~(m−1),hz≤t​(m−1)​];θ)

目标函数:

注意穷举全部排列显然是不现实的,所以作者将目标函数定义为:

max⁡θEz∼ZT[log⁡pθ(xz>c∣xz≤c)]=Ez∼ZT[∑t=c+1∣z∣log⁡pθ(xzt∣xz<t)]\max _{\theta} \mathbb{E}_{\mathbf{z} \sim \mathcal{Z}_{T}}\left[\log p_{\theta}\left(\mathbf{x}_{\mathbf{z}_{>c}} \mid \mathbf{x}_{\mathbf{z}_{\leq c}}\right)\right]=\mathbb{E}_{\mathbf{z} \sim \mathcal{Z}_{T}}\left[\sum_{t=c+1}^{|\mathbf{z}|} \log p_{\theta}\left(x_{z_{t}} \mid \mathbf{x}_{\mathbf{z}_{<t}}\right)\right]θmax​Ez∼ZT​​[logpθ​(xz>c​​∣xz≤c​​)]=Ez∼ZT​​​t=c+1∑∣z∣​logpθ​(xzt​​∣xz<t​​)​

时间复杂度

因为是预训练任务,所以不考虑这点。

训练以及loss

已经讨论过。

代码

实验以及适用场景

从实验来看,带来了非常大的提升。

细节

暂无,需要复现之后才能了解细节。

简评

非常有意思的想法,虽然时间有点久远,但是个人觉得很值得复现。

https://arxiv.org/abs/1906.08237
https://peijun.rocks/2021/12/18/fa646ceb.html/
https://zhuanlan.zhihu.com/p/89712347
https://bbs.dian.org.cn/topic/975/%E8%AF%A6%E8%A7%A3xlnet-generalized-autoregressive-pretraining-for-language-understanding%E8%AE%BA%E6%96%87%E7%AC%94%E8%AE%B0
https://github.com/zihangdai/xlnet