LaMemo Language Modeling with Look-Ahead Memory

论文地址：

整体思路以及计算方式

之前Transformer中使用memory的方式都是当前token和memory中token交互，但是memory中token无法和当前token交互，本文就是对这点进行改进。

符号：

当前token： $\mathbf {X}_{\tau}=\left[\mathbf {x}_{\tau+1}, \cdots, \mathbf {x}_{\tau+N}\right] \in \mathbb{R}^{N \times d}$
memory： $\mathbf {X}_{\tau-1}=\left[\mathbf {x}_{\tau-M+1}, \cdots, \mathbf {x}_{\tau}\right] \in \mathbb{R}^{M \times d}$
$\tilde{\mathbf {X}}_{\tau-1}=\left[\mathbf {x}_{\tau-N+2}, \cdots, \mathbf {x}_{\tau+1}\right] \in \mathbb{R}^{N \times d}$

计算：

$\mathbf{Q}_{\tau}=\mathbf{X}_{\tau} \mathbf{W}_{q}, \mathbf{K}_{\tau}=\mathbf{X}_{\tau} \mathbf{W}_{k}, \mathbf{V}_{\tau}=\mathbf{X}_{\tau} \mathbf{W}_{v}$
$\tilde{\mathbf{K}}_{\tau-1}=\tilde{\mathbf{X}}_{\tau-1} \mathbf{W}_{k}, \tilde{\mathbf{V}}_{\tau-1}=\tilde{\mathbf{X}}_{\tau-1} \mathbf{W}_{v}$
$\mathbf{C}_{\tau}^{\leftarrow}=\operatorname{Softmax}_{\text{lower-triangle}} \left(\frac{\mathbf{Q}_{\tau} \tilde{\mathbf{K}}_{\tau}^{\top}}{\sqrt{d}}\right) \tilde{\mathbf{V}}_{\tau}$
$\mathbf{C}_{\tau}^{\rightarrow}=\operatorname{Softmax}_{\text{upper-triangle}} \left(\frac{\mathbf{Q}_{\tau} \tilde{\mathbf{K}}_{\tau}^{\top}}{\sqrt{d}}\right) \tilde{\mathbf{V}}_{\tau}$
$\mathbf{C}_{\tau-1}^{\leftrightarrow}=\mathbf{\alpha}_{\tau} \operatorname{sg}\left(\mathbf{C}_{\tau}^{\rightarrow}\right)+\left(1-\mathbf{\alpha}_{\tau}\right) \mathbf{C}_{\tau}^{\leftarrow}$
- $\mathrm{sg}$ 表示不计算梯度；
- $\mathbf{\alpha}_{\tau}=\frac{\operatorname{sg}\left(\mathbf{s}_{\tau}^{\rightarrow}\right)}{\operatorname{sg}\left(\mathbf{s}_{\tau}^{\rightarrow}\right)+\mathbf{s}_{\tau}^{{\leftarrow}}+\varepsilon}$ ；
- 其中 $\mathbf{s}_{\tau}^{\rightarrow}$ 表示 $\mathbf{C}_{\tau}^{\rightarrow}$ 中Softmax矩阵归一化之前的元素和；

时间复杂度为 $O(N(N+M)d)$ 。

不变。

适用于encoder和decoder；论文只测试了lm(decoder)场景，获得了一定的提升。

暂无。

提供了一种新的memory交互方式。

Last updated 3 years ago