Hierarchical Transformers Are More Efficient Language Models
PreviousGeneral-purpose, long-context autoregressive modeling with Perceiver ARNextBranchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding
Last updated