Loading...

热门

伯克利｜适用于长上下文大模型的分块并行Transformer

大模型3年前 (2023)发布智源社区

884 0 0

Blockwise Parallel Transformer for Long Context Large Models

Hao Liu, Pieter Abbeel
[UC Berkeley]

适用于长上下文大模型的分块并行Transformer

要点:

动机：解决自注意力机制和大型前馈网络在Transformer中带来的内存需求问题，以处理长序列和长程依赖性任务。
方法：提出一种新方法，即块状并行Transformer(BPT)，通过块状计算自注意力和前馈网络融合，以最小化内存成本。
优势：BPT可以处理比普通Transformer长32倍的训练序列，并且比之前的内存高效方法能处理2至4倍更长的序列。在语言建模和强化学习任务上进行的大量实验证明了BPT在降低内存要求和提高性能方面的有效性。

提出了块状并行Transformer(BPT)方法，通过块状计算自注意力和前馈网络融合，降低内存需求并处理长序列和长程依赖性任务。

Transformers have emerged as the cornerstone of state-of-the-art natural language processing models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands posed by the self-attention mechanism and the large feedforward network in Transformers limit their ability to handle long sequences, thereby creating challenges for tasks involving multiple long sequences or long-term dependencies. We present a distinct approach, Blockwise Parallel Transformer (BPT), that leverages blockwise computation of self-attention and feedforward network fusion to minimize memory costs. By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences up to 32 times longer than vanilla Transformers and 2 to 4 times longer than previous memory-efficient methods. Extensive experiments on language modeling and reinforcement learning tasks demonstrate the effectiveness of BPT in reducing memory requirements and improving performance.

https://arxiv.org/abs/2305.19370
伯克利｜适用于长上下文大模型的分块并行Transformer

# 大模型 # 智源社区 # 大模型 # 论文

© 版权声明

文章版权归作者所有，未经允许请勿转载。

相关文章

吴恩达来信：AI的民主化

智源社区

920

CMU & Meta｜逼真生成式3D人脸模型研究

智源社区

859

离散扩散模型的引导方法：基于引导离散扩散的蛋白质设计

智源社区

716

Thespian: 多角色扮演游戏代理

智源社区

702

陈丹琦等｜微调语言模型内存高效的零阶优化器MeZO，内存减少多达12倍

智源社区

1,062

ICML 2022 | 探索语言模型的最佳架构和训练方法

智源社区

929

暂无评论

暂无评论...

这是一个专注于人工智能产品的导航站。

关于我们友情链接

Copyright © 2025 Ai导航鄂ICP备2023001728号