论文地址:
https://arxiv.org/abs/1805.00631arrow-up-right
参考资料:
https://blog.csdn.net/wwx123521/article/details/83238989arrow-up-right
http://www.xuwei.io/2019/08/09/accelerating-neural-transformer-via-an-average-attention-network/arrow-up-right
https://zhuanlan.zhihu.com/p/77434191arrow-up-right
替换Deocder中self Attention为AAN,计算方式如下:
yj∈Rd1\mathbf y_j\in \mathbb R^{ d_1}yj∈Rd1
gj=FFN(1j∑k=1jyk)∈Rd2\mathbf {g}_{j}=\operatorname{FFN}\left(\frac{1}{j} \sum_{k=1}^{j} \mathbf {y}_{k}\right)\in \mathbb R^{d_2}gj=FFN(j1∑k=1jyk)∈Rd2
ij,fj=σ(W[yj;gj])∈R{i}_{j}, {f}_{j}=\sigma\left(W\left[\mathbf {y}_{j} ; \mathbf {g}_{j}\right]\right)\in \mathbb Rij,fj=σ(W[yj;gj])∈R
h~j=ij⊙yj+fj⊙gj∈Rd1\tilde{\mathbf {h}}_{j}={i}_{j} \odot \mathbf {y}_{j}+{f}_{j} \odot \mathbf {g}_{j} \in \mathbb R^{ d_1}h~j=ij⊙yj+fj⊙gj∈Rd1
hj=LayerNorm(yj+h~j)∈Rd1\mathbf {h}_{j}=\operatorname{LayerNorm}\left(\mathbf {y}_{j}+\tilde{\mathbf {h}}_{j}\right)\in \mathbb R^{d_1}hj=LayerNorm(yj+h~j)∈Rd1
循环实现的时间复杂度为O(nd1d2)O(nd_1 d_2)O(nd1d2),并行实现的时间复杂度为O(n2d1+nd1d2)O(n^2d_1 + nd_1 d_2)O(n2d1+nd1d2)。
没有变化。
https://github.com/bzhangGo/transformer-aan/blob/master/code/thumt/models/transformer.pyarrow-up-right
适用于Causal Attention,可以替换LM中的Attention;论文测试了NMT实验,取得了相当的效果,但是没有速度提升。
暂无。
本质上和Attention类似,只不过假定等权重,训练时不能提速,解码时能提速;
可以在lm上测试;
Last updated 3 years ago