Transformer block拆解

基本结构

Transformer block拆解

basic参数

  • Transformer block拆解 or Transformer block拆解: total number of transformer blocks

  • Transformer block拆解 or Transformer block拆解: number of units in each bottleneck layer, and number of units of each Q/K/V input

  • Transformer block拆解 or Transformer block拆解: number of heads of each transformer block

  • Transformer block拆解 or Transformer block拆解: input sequence length

derived参数

  • Transformer block拆解: dimension of each attention head, Transformer block拆解

  • Transformer block拆解: intermediate layer units of feed forward layer, Transformer block拆解

各参数在transformer block中的详细示意图如下(可双击放大):

Transformer block拆解

Zoom in Feed Forward子模块

Transformer block拆解

典型模型基本参数

应用 模型 Transformer block拆解 Transformer block拆解 Transformer block拆解 Transformer block拆解
NLP GPT-3 96 12288 96 2048
NLP BERT_Base 12 768 12 128/512
NLP BERT_Large 24 1024 16 128/512
RecSys BST 1 128(max) 8 20
  • BST: Behavior Sequence Transformer

References

  1. The GPT-3 Architecture, on a Napkin

  2. GPT-3 An Overview

  3. Language Models are Few-Shot Learners

  4. Improving Language Understanding by Generative Pre-Training

  5. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  6. Attention Is All You Need

  7. BERT transformer block code

  8. Deep Learning Recommendation Model for Personalization and Recommendation Systems

  9. Behavior Sequence Transformer for E-commerce Recommendation in Alibaba

上一篇:Attention和Transformer详解


下一篇:词袋模型和transformer模型