Large Language Models (LLMs) are powerful tools transforming natural language processing. However, deploying these models in real-world applications is often impractical due to their massive computational costs. Knowledge Distillation (KD) (Hinton et al., 2015) offers a solution by compressing the knowledge from large models (teachers) into smaller, more efficient ones (students). But KD methods traditionally require teacher and student models to share the same tokenizer, limiting flexibility.

In our latest paper, Towards Cross-Tokenizer Distillation: The Universal Logit Distillation Loss (Boizard et al., 2024), we introduce a novel approach enabling effective cross-tokenizer distillation. In this blog post we will explain the key component of the ULD Loss and see together why this approach overpass traditional restriction.

Kullback–Leibler divergence (KLd):

 

IMAGE

The KL divergence is a way to measure how one probability distribution, let’s call it \( Q(x) \) differs from another probability distribution \( P(x) \).

In the context of knowledge distillation for models, \( P(x) \) represents the token probabilities predicted by the teacher model, while \( Q(x) \) represents the token probabilities predicted by the student mode.

Why is it important?

The KL divergence helps the student model learn by comparing its predictions to the teacher’s predictions. It tells us how “close” the student’s predictions are to the teacher’s, and the goal during training is to make the KL divergence as small as possible.

How is it calculated?

Given the shared vocabulary |\( \Omega \)| , KL divergence is defined as:

\[
D_{KL}(P \parallel Q) = \sum_{x=1}^{|\Omega|} P(x) \log \left( \frac{P(x)}{Q(x)} \right)
\]

Here’s what this means:

  • \( P(x) \) : The probability assigned by the teacher.

  • \( Q(x) \) : The probability assigned to the same token the student.

  • The ratio \( \frac{P(x)}{Q(x)} \) “tell us how much” the student probability differs from student one for token x.

  • Finaly we get \( P(x) \log \left( \frac{P(x)}{Q(x)} \right) \): with the purpose of multiplying by \( P(x) \) is to weight the difference between \( P(x) \) and \( Q(x) \) based on how important x is according to the teacher model.

The sum of these costs over all tokens in the vocabulary gives the KL divergence. If the student predicts exactly like the teacher (\( P(x) \) = \( Q(x) \) for all x), the KL divergence becomes 0, meaning perfect alignment.

Why vocabulary must be the same?

If the vocabularies differ, say that the student vocabulary is {dog, cat, sun} and the teacher one {dog, cat, moon}, then the probability for \( Q(moon) \) does not exist. In this case \( Q(moon) = 0 \) and \( \frac{P(x)}{Q(x)} \) becomes undefined (division by zero). This breaks the computation and makes the comparison inconsistent.

ULD Loss – Wasserstein distance

At Diabolocom, in partnership with CentraleSupélec and our PhD student Nicolas Boizard, we developed a new method called ULD Loss, based on Optimal Transport solutions rather than the classical Kullback-Leibler Divergence for knowledge distillation. For detailed information and improved results, you can access the paper here (accepted to TMLR) but today, we will focus on understanding why the Wasserstein distance overcomes the requirement for identical vocabularies between the student and teacher models.

IMAGE

Overview:

The Wasserstein distance is another way to measure the difference between two probability distributions. It is also known as the Earth Mover’s Distance (EMD), and it provides a more geometrically intuitive measure of how much “work” is needed to transform one distribution into another.

How is it calculated?

The Wasserstein distance is defined as:

\[
W_p(P, Q) = \min_{T \in \Pi(P, Q)} \sum_{i=1}^{|\Omega_s|} \sum_{j=1}^{|\Omega_t|} T_{ij} C_{ij}^p
\]

Here’s what this means:

  • Pand \( Q \) are the probability distributions of the teacher and student, respectively.

  • \( T \in \Pi(P, Q) \) is the set of all couplings (joint distributions) between Pand \( Q \), meaning that each token in the teacher vocabulary is paired with a token in the student vocabulary.

  • \( C_{ij} \) is the transport cost, in our we can see it as the cost to transform \( Q(x) \) in \( P(x) \)

  • \(\displaystyle \min_{T \in \Pi(P, Q)} \sum_{i=1}^{|\Omega_s|} \sum_{j=1}^{|\Omega_t|} T_{ij} C_{ij}^{p}\) finally, meaning that we try to find the set of all couplings minimizing the cost to transform \( Q(x) \) into \( P(x) \).

Unlike KL divergence, the Wasserstein distance does not require that the vocabularies of the two models P (teacher) and Q (student) match. This is because we are comparing distributions in a more flexible, continuous manner. Each token from P can be mapped to any token in Q, regardless of whether the specific tokens appear in both vocabularies. However, similarly to the KL divergence, the Wasserstein distance value is minimal in both cases when (\( P(x) \) = \( Q(x) \) for all x. In this case, training the student model to reduce this value, forces the student model to reproduce the behavior of the teacher, or in other words, distills the teacher’s knowledge into the student’s model.

3 – Final thoughts

The ULD loss efficiently distills knowledge between models with different vocabularies (cf. our paper). This provides greater flexibility in selecting models for a teacher-student setup, particularly for models trained on different data distributions (different languages for example).

Written by Nada Nachit |