Question 1

What is the self-attention mechanism in Transformers?

Accepted Answer

Self-attention allows each position in a sequence to attend to all other positions, computing weighted representations based on relevance. It uses Query, Key, and Value vectors: attention scores are computed by comparing queries with keys, then used to weight values. This enables capturing long-range dependencies without recurrence.

Question 2

What is the difference between encoder-only, decoder-only, and encoder-decoder Transformers?

Accepted Answer

Encoder-only models (like BERT) process input bidirectionally for understanding tasks. Decoder-only models (like GPT) generate text autoregressively, seeing only previous tokens. Encoder-decoder models (like T5) combine both for sequence-to-sequence tasks like translation, where the encoder processes input and decoder generates output.

Question 3

Why are positional encodings necessary in Transformers?

Accepted Answer

Unlike RNNs that process sequences in order, Transformers process all positions simultaneously and have no inherent notion of sequence order. Positional encodings add information about token positions, either through learned embeddings or sinusoidal functions, enabling the model to understand word order and relative positions.

Question 4

What is multi-head attention and why is it used?

Accepted Answer

Multi-head attention runs multiple attention operations in parallel, each with different learned projections. This allows the model to jointly attend to information from different representation subspaces at different positions. For example, one head might focus on syntax while another captures semantic relationships.

Question 5

How do Vision Transformers (ViT) adapt the architecture for images?

Accepted Answer

Vision Transformers divide images into fixed-size patches (e.g., 16x16 pixels), flatten them into sequences, and add positional embeddings. These patch embeddings are processed like word tokens in NLP Transformers. A classification token aggregates information for image-level predictions. ViT has achieved state-of-the-art results on image classification.

Created	2017 by Google (Vaswani et al.)
Specification	Official Specification

What is Transformer?

Quick Facts

How It Works

Key Characteristics

Common Use Cases

Example

Frequently Asked Questions

What is the self-attention mechanism in Transformers?

What is the difference between encoder-only, decoder-only, and encoder-decoder Transformers?

Why are positional encodings necessary in Transformers?

What is multi-head attention and why is it used?

How do Vision Transformers (ViT) adapt the architecture for images?

Related Tools

JSON Formatter

Related Terms

Attention Mechanism

Multimodal

Context Window

LLM

Related Articles

Mamba and State Space Models (SSM): The Next-Generation Architecture Beyond Transformers

Transformer Architecture Complete Guide: Self-Attention, Encoder-Decoder, and Modern LLM Foundations

Attention Mechanism Complete Guide: From Intuition to Transformer Core Principles with Code Implementation