# CoLT5: Faster Long-Range Transformers with Conditional Computation

论文地址：

* <https://arxiv.org/abs/2303.09752>

## 整体思路以及计算方式

分成两个部分：

* Attention部分使用Sparse Attention，类似于window attention加上少量global pattern，后续记为$$\mathrm{SMHA}$$；
* 在Attention和FFN部分别使用Heavy和Light模块，前者参数多，后者参数少；

计算方式如下：

* 输入$$\mathbf X\in \mathbb R^{n\times d}$$；
* 路由函数：$$s\_{\mathbf u}(\mathbf X) = \mathrm{Softmax}(\mathrm {Topk} (\mathbf X \mathbf u^{\top})), \mathbf u \in \mathbb R^d$$；
  * Topk函数：$$\mathrm{Topk}(\mathrm{s})\in \mathbb R^n$$，取值最大的$$k$$个值，其余设置为$$-\infty$$；
* Attention部分：
  * $$\mathbf X= \mathrm{SMHA}*{\mathrm {light}}(\mathbf X, \mathbf X) + s*{\mathbf u\_1} (\mathbf X)\mathrm{SMHA}*{\mathrm {heavy}}(\mathbf X, s*{\mathbf u\_2} (\mathbf X))$$；
* FFN部分：
  * $$\mathbf X= \mathrm{FFN}*{\mathrm {light}}(\mathbf X) + s*{\mathbf u\_3} (\mathbf X)\mathrm{FFN}\_{\mathrm {heavy}}(\mathbf X)$$；

## 时间复杂度

见论文。

## 代码

* <https://github.com/lucidrains/CoLT5-attention>

## 简评

很工程的思路，感觉一般。


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://doraemonzzz.gitbook.io/transformer_evolution_paper/mha/localglobal/006.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
