Transformer-Evolution-Paper

CtrlK

DeBERTa Decoding-enhanced BERT with Disentangled Attention

论文地址：

https://arxiv.org/abs/2006.03654

整体思路以及计算方式

传统的Attention计算， $\mathbf Q,\mathbf K$ 可以拆成context和pos部分：

\begin{aligned} \mathbf Q_{c}&=\mathbf H \mathbf W_{q, c}\\ \mathbf K_{c}&=\mathbf H\mathbf W_{k, c}\\ \mathbf Q_{r}&=\mathbf P\mathbf W_{q, r}\\ \mathbf K_{r}&=\mathbf P\mathbf W_{k, r} \end{aligned}

所以Attention Score的计算可以拆成4项：

\begin{aligned} \tilde{\mathbf A}_{i, j}=\mathbf Q_{i}^{c}\mathbf K_{j}^{c \top}+\mathbf Q_{i}^{c} \mathbf K_{j}^{r\top}+\mathbf K_{j}^{c}\mathbf Q_{j}^{r \top} +\mathbf K_{i}^{r}\mathbf Q_{i}^{r \top} \end{aligned}

DeBERTa的计算方式是将上式修改为：

\tilde{\mathbf A}_{i, j}=\underbrace{\mathbf Q_{i}^{c} \mathbf K_{j}^{c \top}}_{\text {(a) content-to-content }}+\underbrace{\mathbf Q_{i}^{c}\mathbf K_{\delta(i, j)}^{r{\top}}}_{\text {(b) content-to-position }}+\underbrace{\mathbf K_{j}^{c}\mathbf Q_{\delta(j, i)}^{r{\top}}}_{\text {(c) position-to-content }}

其中：

\delta(i, j)=\left\{\begin{array}{rcl} 0 & \text { for } & i-j \le-k \\ 2 k-1 & \text { for } & i-j \ge k \\ i-j+k & \text { others. } & \end{array}\right.

即在一定范围内由相对位置确定，该范围外为固定值。

时间复杂度

Attention Matrix的时间复杂度由 $n^2d$ 增加为 $3n^2d$ ，其余部分不变。

训练以及loss

不变。

代码

https://github.com/microsoft/DeBERTa

实验以及适用场景

适用于所有场景，论文主要测试了在BERT中的效果。

细节

暂无。

简评

性能很好，但是无法适用于Linear Attention，所以暂时不考虑复现。

PreviousA Simple and Effective Positional Encoding for Transformers NextDecBERT Enhancing the Language Understanding of BERT with Causal Attention Masks

Last updated 3 years ago