Why self-attention is Natural for Sequence-to-Sequence Problems? A Perspective from Symmetries

论文地址:

简评

文章先把Self Attention的置换不变性(f(PX)=Pf(X)f(\mathbf P\mathbf X)=\mathbf P f(\mathbf X))推广为正交不变性(P\mathbf P从置换矩阵推广为正交矩阵),然后证明其形式为:

f(X)=Xg(XX).f(\mathbf X)=\mathbf X g\left(\mathbf X^{\top} \mathbf X\right).

接着推广到一般的Attention,即f(X,Z)f(\mathbf X, \mathbf Z),作者证明其形式可以表达为:

f(X,Z)=Xg1(XX,ZX,ZZ)+Zg2(XX,ZX,ZZ)f(\mathbf X, \mathbf Z)=X g_1\left(\mathbf X^{\top}\mathbf X, \mathbf Z^{\top}\mathbf X,\mathbf Z^{\top} \mathbf Z\right)+\mathbf Z g_2\left(\mathbf X^{\top}\mathbf X, \mathbf Z^{\top} \mathbf X, \mathbf Z^{\top} \mathbf Z\right)

那么就和Attention的形式非常类似。

Last updated