Why self-attention is Natural for Sequence-to-Sequence Problems? A Perspective from Symmetries

论文地址：

简评

文章先把Self Attention的置换不变性（ $f(\mathbf P\mathbf X)=\mathbf P f(\mathbf X)$ ）推广为正交不变性（ $\mathbf P$ 从置换矩阵推广为正交矩阵），然后证明其形式为：

f(\mathbf X)=\mathbf X g\left(\mathbf X^{\top} \mathbf X\right).

接着推广到一般的Attention，即 $f(\mathbf X, \mathbf Z)$ ，作者证明其形式可以表达为：

f(\mathbf X, \mathbf Z)=X g_1\left(\mathbf X^{\top}\mathbf X, \mathbf Z^{\top}\mathbf X,\mathbf Z^{\top} \mathbf Z\right)+\mathbf Z g_2\left(\mathbf X^{\top}\mathbf X, \mathbf Z^{\top} \mathbf X, \mathbf Z^{\top} \mathbf Z\right)

那么就和Attention的形式非常类似。

Last updated 2 years ago