LLM Transformer Overview

..for the busy AI engineer

Jan 01, 2025

Introduction

Large Language Models (LLMs) are built on the Transformer architecture introduced in Attention Is All You Need (AIAYN) paper in 2017.

Transformers are based solely on attention mechanisms and dispense with recurrence (as used in Recurrent Neural Networks or RNNs) and convolutions (as used in Convolutional Neural Networks or CNNs). RNNs are typically used in Natural Language Processing or NLP, and CNNs are used for Computer Vision.

Before the Transformers architecture, the dominant machine learning (ML) architecture was sequence-to-sequence modeling (also referred to as sequence transduction models). Sequence-to-sequence modeling transforms input sequence to an output sequence. Examples include:

Text Translation - Converting text between languages
Text Summarisation - Condensing long text into shorter versions
Conversation - Generating responses to questions

The Transformer described in the AIAYN paper is based on an encoder-decoder architecture, as seen in this image.

Transformers operate on tokens, which are units of data like words, sub-words, or characters. For example:

The string "tokenisation" is decomposed as "token" and "isation."
A short and common word like "the" is represented as a single token.

As a rule of thumb, 1 token is approximately 4 characters or 0.75 words for English text.

The three main variants of Transformers are:

BART - Bidirectional and Auto-Regressive Transformer
- Uses both encoder and decoder components
- Handles both understanding and generation tasks
BERT - Bidirectional Encoder Representations from Transformers
- Uses encoder-only architecture
- Specializes in understanding input text
GPT - Generative Pre-Trained Transformers
- Uses decoder-only architecture
- Excels at text generation tasks

Key Architectural Concepts

Bidirectional means that the transformer attends to a single token by looking at tokens to the left (before) and to the right (after) to fully understand the sequence. Bidirectionality corresponds to the encoder stack and the multi-head attention layer in the Transformer architecture.

Encoders that look bidirectionally are good at understanding input.

Auto-Regressive means that the value at a particular time (or position in a sequence) depends on its own previous values. Auto-Regressive predictions correspond to the decoder stack (and the Masked Multi-Head Attention layer).

Decoders that mask all words after the current word in the sentence are good at being generative or generating one word (or more accurately, a token) at a time.

Comparison of Transformer Variants:

Modern LLMs are based on the Decoder-only GPT architecture. Examples include:

OpenAI's GPT-series models
Anthropic's Claude-series models
Meta's Llama-series models

These GPT-based LLMs power many current Generative AI (GenAI) applications.

Original Transformer Processing Pipeline

What happens when you put a sequence of words into a transformer, as described in the AIAYN paper:

Tokenization: Splits input text into token units.
Embedding: Transforms tokens into a vector (or list) of numbers by creating an embedding representation.
Positional Encoding: Adds sequence position information to each token, i.e., keeps track of word positions.
Residual Connection: Remembers what you've already learned, i.e., maintains information flow through layers.
Layer Normalization: Stabilizes training and prevents overfitting.
Multi-Headed Attention: Processes input from multiple perspectives.
Feed Forward Neural Network: Provides additional input analysis, i.e., looks at the sequence from another angle.
Encoder Block: Processes input bidirectionally for understanding.
Decoder Block: Generates tokens based on the previous sequence (auto-regressive).
Linear Projection: Calculates raw scores (logits) for vocabulary tokens.
Softmax: Converts logits into probability distributions for token selection.

Transformers have revolutionised machine learning, laying the foundation for modern Generative AI and reshaping the future of AI-driven innovation. If you want to learn more, I’d highly recommend the following resources:

Attention in Transformers, Visually Explained by 3Blue1Brown. I’d recommend all videos in the ML series
Visual Transformer Explainer
Hands on LLM book by Jay Allamar (I’m currently working my way through this book)

The AI Engineering Brief

Discussion about this post