Question 1

What is attention mechanism in deep learning?

Accepted Answer

Attention mechanism is a technique that allows neural networks to focus on the most relevant parts of input data when making predictions. It computes weighted importance scores for different input elements, enabling the model to selectively attend to pertinent information rather than treating all inputs equally.

Question 2

What is the difference between self-attention and cross-attention?

Accepted Answer

Self-attention (intra-attention) allows each position in a sequence to attend to all positions within the same sequence, capturing internal dependencies. Cross-attention enables interaction between two different sequences, such as between encoder outputs and decoder states in translation models.

Question 3

How does attention mechanism work in transformers?

Accepted Answer

In transformers, attention uses Query (Q), Key (K), and Value (V) vectors. Attention scores are computed by taking the dot product of queries with keys, scaling, and applying softmax. These scores weight the values to produce the output, allowing the model to focus on relevant context.

Question 4

What are the advantages of attention mechanism?

Accepted Answer

Key advantages include: capturing long-range dependencies regardless of distance, parallel computation for efficient training, interpretability through attention weight visualization, and flexibility to handle variable-length sequences without recurrence.

Question 5

How to implement attention mechanism in Python?

Accepted Answer

Implement scaled dot-product attention by: computing Q, K, V projections from input, calculating attention scores as softmax(QK^T / sqrt(d_k)), and multiplying scores with V. Libraries like PyTorch and TensorFlow provide built-in MultiheadAttention modules.

Question 6

What is Flash Attention and why does it matter?

Accepted Answer

Flash Attention is a memory-efficient, IO-aware implementation of the attention mechanism developed by Tri Dao et al. Instead of materializing the full N×N attention matrix in GPU high-bandwidth memory (HBM), it computes attention in tiles using on-chip SRAM, reducing memory usage from O(N²) to O(N) and achieving 2-4x wall-clock speedup. Flash Attention 2 and 3 further improve throughput, and it is now the default attention implementation in most training and inference frameworks.

Question 7

What are Multi-Query Attention (MQA) and Grouped-Query Attention (GQA)?

Accepted Answer

MQA shares a single key-value head across all query heads, dramatically reducing the KV cache memory during autoregressive inference (important for LLM serving). GQA is a middle ground where key-value heads are shared among groups of query heads rather than all of them, providing a better trade-off between inference speed and model quality. GQA is used in LLaMA 2 70B, Mistral, and many modern LLMs.

Created	2014 by Bahdanau et al., popularized in 2017 by Vaswani et al.
Specification	Official Specification

What is Attention Mechanism?

Quick Facts

How It Works

Key Characteristics

Common Use Cases

Example

Frequently Asked Questions