What is Distillation?

Distillation (Knowledge Distillation) is a model compression technique where a smaller 'student' model is trained to mimic the behavior of a larger 'teacher' model, transferring knowledge to create efficient models that retain much of the original performance.

Quick Facts

Full NameKnowledge Distillation
Created2015 by Hinton et al.
SpecificationOfficial Specification

How It Works

Knowledge distillation enables deploying AI capabilities in resource-constrained environments. The technique works by training the student model on the soft probability distributions (logits) output by the teacher, rather than just hard labels. This transfers 'dark knowledge' about relationships between classes. Modern applications include distilling GPT-4 into smaller models, creating efficient inference models, and building specialized models from general-purpose teachers. Notable examples include DistilBERT and various distilled LLMs.

Key Characteristics

  • Transfers knowledge from large teacher to small student
  • Uses soft labels (probability distributions) for training
  • Preserves more knowledge than training on hard labels alone
  • Enables deployment on edge devices and mobile
  • Can combine multiple teachers for ensemble distillation
  • Temperature parameter controls softness of distributions

Common Use Cases

  1. Creating efficient models for mobile deployment
  2. Reducing inference costs while maintaining quality
  3. Building specialized models from general teachers
  4. Compressing large language models
  5. Edge AI and IoT applications

Example

loading...
Loading code...

Frequently Asked Questions

What is the difference between knowledge distillation and model pruning?

Knowledge distillation trains a smaller student model to mimic a larger teacher model's outputs, transferring learned knowledge. Model pruning removes unnecessary weights from an existing model. Distillation creates a new, smaller architecture while pruning reduces an existing one.

Why is temperature important in knowledge distillation?

Temperature softens the probability distribution from the teacher model, revealing more information about relationships between classes. Higher temperatures produce softer distributions that transfer more 'dark knowledge' about class similarities, helping the student learn better representations.

Can knowledge distillation be used with large language models?

Yes, knowledge distillation is widely used for LLMs. Examples include DistilBERT (distilled from BERT) and various distilled versions of GPT models. It enables deploying powerful language capabilities on resource-constrained devices.

What is the typical performance loss when using knowledge distillation?

Student models typically retain 90-99% of the teacher's performance while being 2-10x smaller. The exact performance depends on the compression ratio, student architecture, and quality of the distillation process.

What is ensemble distillation?

Ensemble distillation combines knowledge from multiple teacher models into a single student. This can produce students that outperform any individual teacher by capturing complementary knowledge from different models.

Related Tools

Related Terms

Related Articles