Bigger is not always better

For years, progress in AI has been defined by scale: more parameters, more data, and more compute. From GPT-3's 175 billion parameters in 2020, to today's frontier models rumored to approach the trillion-parameter range, each leap forward has demanded exponentially more GPUs, data, and energy.

But not every task needs a large general-purpose model. Most production workloads don't need trillion-parameter models. Routing support tickets, extracting invoice fields, or classifying emails rewards speed, predictability, and cost efficiency far more than general intelligence. In practice, most applications benefit from models that are targeted, predictable, and efficient: orders of magnitude smaller, far cheaper to run, and often better on the specific tasks they're designed for.

One of the key techniques enabling this is model distillation: transferring task behavior from a large model (the teacher) into a smaller one (the student).

The core idea is simple: train the student to match what the teacher would do.

Distillation shows up in two common forms. The first is the workhorse use case: you have a bounded task (support routing, extraction, classification) and you want a smaller, faster model to match the quality of a larger one on that narrow scope.

The second is more like product-line compression: you take a general purpose model and distill it into a smaller "everyday" model that mimics the larger model's behavior (think something like a "Pro" to "Flash" tier). In this setting, distillation often beats training the smaller model from scratch on the same dataset, because the teacher provides a cleaner learning signal than raw labels alone.

The history

Distillation started as a compression trick. In 2006, Buciluă et al. showed you could train a small neural network to match the output probabilities of a large ensemble, instead of learning only from hard labels.

An ensemble of models voting on whether an email is spam.

It works because those "soft" probabilities contain extra signal: relative confidence, which classes are similar, and where the decision boundary lives. That's how a small model can mimic a much larger system with far fewer parameters.

This idea was formalized and popularized in the paper "Distilling the Knowledge in a Neural Network" by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean (2015). Hinton introduced the now-standard teacher–student framework, where a large, high-capacity model (the teacher) trains a smaller model (the student) by exposing it to softened probability distributions, often controlled by a temperature parameter to reveal more structure in the teacher's predictions.

When distillation actually makes sense (and when it doesn't)

Distillation makes sense when the task is narrow, stable, and well-defined, when you care about cost, latency, and predictability more than general intelligence (support, classification, routing, extraction, structured output).

It makes less sense for open-ended work or shifting objectives. And if you don't control the teacher (no logits, no temperature, inconsistent behavior), you're not really distilling. You're copying outputs and hoping it holds up.

So how does it work:

Step 1: Choose your teacher

It starts with choosing the right teacher. You're not just copying answers, you're copying how it behaves on real inputs, especially edge cases. So a good teacher is reliable, not just top of a leaderboard.

You might assume the strongest teacher always produces the strongest student, but recent work shows this is not consistently true. Research such as In Good GRACEs demonstrates that student performance depends on teacher–student compatibility, not just teacher size, introducing GRACE: a score that predicts how effective a teacher will be for a specific student architecture without expensive trial-and-error.

Hover over teachers or the student to explore compatibility scores

Step 2: Create prompts

Next, collect real inputs. If you're distilling a support model, use real tickets, not polished demos.

Include hard ones:

Vague or underspecified: "My account is locked and I need this fixed ASAP."
Emotionally charged: "I've been charged twice and support hasn't replied for days. This is unacceptable."
Ambiguous: "I can't log in after changing my email."

These are where the teacher reveals what you actually care about: how it handles ambiguity, prioritization, and trade-offs.

A shallow or overly clean prompt set produces a shallow student.

Step 3: Run the teacher (and record how it thinks)

Once you've picked your teacher and assembled a good prompt set, run every prompt through the teacher and save its responses.

Now you have a dataset of teacher behavior: what it prioritized, what it ignored, and how it responded under uncertainty.

Here's what that looks like in practice:

System: You are a helpful tech support agent. (In practice, this would typically be a much longer, detailed prompt)

User: I can't log into my account after resetting my password.

Teacher output:

Let's troubleshoot this step by step. First, make sure you're using the new password you just set.
If you're still having issues, try clearing your browser cache or using an incognito window. If the problem persists, I can help you reset it again.

You store this as an input–output pair.

Do this at scale, and you end up with a dataset that captures how the teacher behaves.

Step 4: Train the student on the teacher's behavior

This is the part people mean by true distillation.

Here's the key difference between "copying outputs" and distillation: distillation trains on the full distribution, not just the final text.

For each input, the teacher does not produce a single answer. It produces a probability distribution across all possible next tokens. These probabilities encode what Geoffrey Hinton called dark knowledge:

Which alternatives were almost chosen
Which options were clearly wrong
How confident the model was at each decision point

To transfer this knowledge, the student is trained to minimize the difference between its output distribution and the teacher's distribution, typically using KL divergence as the loss function.

A temperature is used to soften the teacher's distribution so the student can see more than the top-1 token. More of the teacher's structure becomes learnable.

This is what separates proper distillation from simple imitation:

Behavioral cloning trains on the teacher's final text only
Distillation trains on the shape of the teacher's decision space

Because of this, a distilled student often generalizes better than a model trained on hard labels, despite having far fewer parameters. It learns smoother decision boundaries, better uncertainty handling, and more consistent behavior on edge cases.

The downside is that this requires white-box access to the teacher. You must be able to extract logits, apply temperature scaling, and store large volumes of probability data. This is why proper distillation is expensive, compute-heavy, and usually only done by teams that own the teacher model itself.

Conclusion

Distillation flips the default assumption. A lot of what makes big models useful can be transferred into systems that are cheaper, faster, and more predictable.

Bigger models will continue to exist. They will explore, generalize, and discover. But most real-world systems will not run them directly. They will distill them.