Transformer Architecture: A Detailed Note

Transformer is the core architecture for modern NLP and large language model systems. This note explains structure, equations, training, and engineering optimizations.

1. Why Transformer

RNN/LSTM models have natural bottlenecks for long-range dependency modeling and parallel training:

Long gradient propagation paths
Strong step-by-step sequential dependency in computation

Transformer addresses both with attention-based global interaction and hardware-friendly parallelism.

2. Macro Architecture

Common forms:

Encoder-only (e.g., BERT) for understanding tasks
Decoder-only (e.g., GPT) for autoregressive generation
Encoder-Decoder (e.g., original Transformer, T5) for conditional generation

For modern autoregressive LLMs, Decoder-only is the dominant choice.

flowchart LR
  A[Token IDs] --> B[Embedding + Positional Info]
  B --> C[Transformer Block x N]
  C --> D[LayerNorm]
  D --> E[Linear Head]
  E --> F[Next-token logits]

A typical block contains:

Multi-Head Self-Attention
Feed-Forward Network (FFN)
Residual connections
Normalization (LayerNorm or RMSNorm)

3. Input Representation: Embedding + Position

Token embeddings map ids to vectors:

X \in R^{T \times d_{m o d e l}}

Without positional information, sequence order is lost. Common positional mechanisms:

Sinusoidal absolute position encoding
Learnable position embedding
RoPE (widely used in modern LLMs)

Sinusoidal form:

P E_{(p o s, 2 i)} = \sin (\frac{p o s}{10000^{2 i / d_{m o d e l}}}), P E_{(p o s, 2 i + 1)} = \cos (\frac{p o s}{10000^{2 i / d_{m o d e l}}})

4. Self-Attention

From input $X$ , project to:

Q = X W_{Q}, K = X W_{K}, V = X W_{V}

Scaled dot-product attention:

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}} + M) V

where $M$ is a mask:

Padding mask ignores padded positions
Causal mask blocks future tokens in autoregressive decoding

5. Multi-Head Attention

Multiple heads capture different relational subspaces:

{head}_{i} = Attention (Q_{i}, K_{i}, V_{i}), MHA = Concat ({head}_{1}, \dots, {head}_{h}) W_{O}

In practice, different heads often focus on syntax, entities, locality, and long-range semantics.

6. Feed-Forward Network

Attention mixes information across tokens; FFN applies non-linear transformation per position:

FFN (x) = W_{2} σ (W_{1} x + b_{1}) + b_{2}

Modern variants often use SwiGLU/GELU for stronger efficiency-quality tradeoffs.

7. Residual and Normalization

Pre-Norm block form:

h^{'} = h + MHA (Norm (h))

h^{″} = h^{'} + FFN (Norm (h^{'}))

Pre-Norm is typically more stable than Post-Norm in very deep stacks.

8. Training and Inference

8.1 Autoregressive Objective

For decoder-only LMs:

L = - \sum_{t = 1}^{T} \log p (x_{t} ∣ x_{< t})

8.2 KV Cache in Decoding

At generation time, each step adds one token. Caching historical K/V avoids recomputing all previous states:

Reduces repeated compute
Improves long-context throughput

9. Common Engineering Upgrades

Position handling: RoPE / ALiBi
Normalization: LayerNorm -> RMSNorm
FFN activation: ReLU/GELU -> SwiGLU
Attention kernel: FlashAttention
KV efficiency: MQA / GQA
Stability: warmup, clipping, weight decay

Transformer Architecture: A Detailed Note ​

1. Why Transformer ​

2. Macro Architecture ​

3. Input Representation: Embedding + Position ​

4. Self-Attention ​

5. Multi-Head Attention ​

6. Feed-Forward Network ​

7. Residual and Normalization ​

8. Training and Inference ​

8.1 Autoregressive Objective ​

8.2 KV Cache in Decoding ​

9. Common Engineering Upgrades ​