Transformer-Evolution-Paper

CtrlK

Query-Key Normalization for Transformers

论文地址：

https://aclanthology.org/2020.findings-emnlp.379/

整体思路以及计算方式

Attention中计算Softmax之前需要先除以 $\sqrt d$ ，其原因是为了缩小极值的影响（避免出现One-hot情形），这篇文章是对这点改进：

\begin{aligned} \hat{\mathbf q}_i& =\frac{\mathbf q_i}{\|\mathbf q_i \|}\\ \hat{\mathbf k}_i& =\frac{\mathbf k_i}{\|\mathbf k_i \|} \end{aligned}

最后的计算方式为：

\mathrm{Softmax}(g\times (\hat{\mathbf q}_i^{\top} \hat{\mathbf k}_j))

其中 $g$ 为可学习的参数，初始化为：

g_0=\log_2(L^2-L)

其中 $L$ 和序列长度有关（序列长度的97.5分位数）， $g_0$ 的含义为Attention Matrix独立元素的信息熵。

时间复杂度

不变。

训练以及loss

不变。

代码

https://github.com/CyndxAI/QKNorm

实验以及适用场景

适用于所有场景，在NMT中性能有提升。

细节

暂无。

简评

很合理的一个思路，可以减少Attention的工程部分，Swin-V2中也使用了这个思路。

PreviousTransformers without Tears: Improving the Normalization of Self-Attention NextUnderstanding the difficulty of training transformers

Last updated 2 years ago