Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kern
PreviousSynthesizer: Rethinking Self-Attention in Transformer ModelsNextCombiner Full Attention Transformer with Sparse Computation Cost
Last updated
Last updated