2024-06-24

Vocabulary expansion for non-SentencePiece based BPE tokeniser

Table of Contents

Introduction
Background
Implementation
Efficacy of vocabulary expansion
Summary

Introduction

Since the advent of LLaMA2, additional training with target language data, i.e. continued pre-training, has been actively used to build an LLM for specific languages. The main effect of vocabulary expansion is to reduce overfragmentation¹, resulting in better inference efficiency.

The idea of vocabulary expansion itself is quite simple, but the way it is implemented depends on how the target tokeniser is implemented. In this post, I share the procedure for expanding the vocabulary of a non-SentencePiece based BPE tokeniser.²

Background

The main difference between SentencePiece-based (e.g. LLaMA2, Mistral) and non-SentencePiece based (e.g. LLaMA3, OLMo) BPE tokenisers is whether they are byte-level or not.

The former, SentencePiece-based BPE, often uses the byte-fallback option, preventing the occurrence of UNK tokens.

On the other hand, recent non-SentencePiece based BPE tokenisers are typically based on byte-level BPE, which converts the input to UTF-8 encoded byte sequences before tokenisation. Since tokenisation is performed on byte sequences, no UNK tokens are generated.³

Therefore, even if the same algorithm is used between the two, the pre- and post-processing of strings is slightly different. This difference can also be seen in the metadata of the transformers tokenisers, specifically the pre_tokenizer and decoder parts.

import json
from transformers import AutoTokenizer

# LLaMA2
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer_json = json.loads(tokenizer._tokenizer.to_str())
print(tokenizer_json["pre_tokenizer"])
# None
print(tokenizer_json["decoder"])
# {'type': 'Sequence', 'decoders': [{'type': 'Replace', 'pattern': {'String': '▁'}, 'content': ' '}, {'type': 'ByteFallback'}, {'type': 'Fuse'}, {'type': 'Strip', 'content': ' ', 'start': 1, 'stop': 0}]}

# LLaMA3
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokenizer_json = json.loads(tokenizer._tokenizer.to_str())
print(tokenizer_json["pre_tokenizer"])
# {'type': 'Sequence', 'pretokenizers': [{'type': 'Split', 'pattern': {'Regex': "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"}, 'behavior': 'Isolated', 'invert': False}, {'type': 'ByteLevel', 'add_prefix_space': False, 'trim_offsets': True, 'use_regex': False}]}
print(tokenizer_json["decoder"])
# {'type': 'ByteLevel', 'add_prefix_space': True, 'trim_offsets': True, 'use_regex': True}

Implementation

Here, we expand the vocabulary of a source tokeniser tokenizer with the help of an auxiliary tokeniser aux_tokenizer trained on a target language.

We use Greek as an example, where the effect of vocabulary expansion is relatively easy to see.

1. Load tokenisers and their metadata

First, we load the source tokeniser and auxiliary tokeniser, and get the merge rules and vocabulary of the target language.

In the example below, we use LLaMA3 as the source tokeniser and an auxiliary tokeniser trained on the Greek CC-100 subcorpus ($2^{20}$ sentences randomly sampled) with a vocabulary size of 50k tokens. The rest of the training settings are the same as LLaMA3.⁴

import json
import copy

from transformers import AutoTokenizer
from tokenizers.models import BPE

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
vocab = tokenizer.get_vocab()
tokenizer_json = json.loads(tokenizer._tokenizer.to_str())
merges = tokenizer_json["model"]["merges"]

aux_tokenizer = AutoTokenizer.from_pretrained("atsuki-yamaguchi/cc100-el-50k")
aux_tokenizer_json = json.loads(aux_tokenizer._tokenizer.to_str())
aux_merges = aux_tokenizer_json["model"]["merges"]

2. Expand vocabulary and merge rules

We then add new target tokens and the corresponding merge rules to the source tokeniser’s vocabulary and merge rule list. Here, we add up to 10k new tokens.⁵

# merge the tokenizers
num_new_token = 0
max_new_token = 10000
ret_vocab = copy.copy(vocab)
ret_merges = []
old_merges = copy.copy(merges)
for merge in aux_merges:
    # vocab
    token_1, token_2 = merge.split(" ")
    token = token_1 + token_2
    if num_new_token < max_new_token:
        if token_1 not in ret_vocab and token_2 not in ret_vocab: # both are new
            ret_vocab[token_1] = len(vocab) + num_new_token
            ret_vocab[token_2] = len(vocab) + num_new_token + 1
            num_new_token += 2
        elif token_1 not in ret_vocab and token_2 in ret_vocab: # new + existing
            ret_vocab[token_1] = len(vocab) + num_new_token
            num_new_token += 1
        elif token_1 in ret_vocab and token_2 not in ret_vocab: # old + existing
            ret_vocab[token_2] = len(vocab) + num_new_token
            num_new_token += 1
        else: # both are existing tokens
            pass
        if token not in ret_vocab:
            ret_vocab[token] = len(vocab) + num_new_token
            num_new_token += 1
    # merge
    if merge in merges:
        old_merges.remove(merge)
        ret_merges.append(merge)
    elif token in ret_vocab and token_1 in ret_vocab and token_2 in ret_vocab:
        ret_merges.append(merge)

3. Retrain BPE tokeniser

We create an instance of the BPE tokeniser with the expanded vocabulary and merge rules, and overwrite the source tokeniser with it.

# retrain tokenizer
merges = ret_merges + old_merges
vocab = ret_vocab
tokenizer.backend_tokenizer.model = BPE(
    vocab=vocab,
    merges=[(merge.split(' ')[0], merge.split(' ')[1]) for merge in merges],
    fuse_unk=False,
)

4. Save the tokeniser

Finally, we save the tokeniser to an output directory.

# save
tokenizer.save_pretrained("/path/to/output/dir")

Efficacy of vocabulary expansion

We measure the number of tokens before and after vocabulary expansion with the following example.

Μου είπαν ότι, θα έπρεπε να καλέσω έναν άντρα στο τέλος για να συναντηθούμε. Ερώτηση: Ο τύπος εμφανίστηκε λίγο αργά. Αληθές, Ψευδές, ή Κανένα από τα δύο; Απάντηση: Κανένα από τα δύο

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
modified_tokenizer = AutoTokenizer.from_pretrained("/path/to/output/dir")

text = "Μου είπαν ότι, θα έπρεπε να καλέσω έναν άντρα στο τέλος για να συναντηθούμε. Ερώτηση: Ο τύπος εμφανίστηκε λίγο αργά. Αληθές, Ψευδές, ή Κανένα από τα δύο; Απάντηση: Κανένα από τα δύο"

print(len(tokenizer.encode(text)))
# 81

print(len(modified_tokenizer.encode(text)))
# 46

As a result, the number of tokens was reduced by 35 by adding 10k target tokens to the vocabulary of the LLaMA3 tokeniser.

Summary

There are many examples of vocabulary expansion using SentencePiece-based BPE tokenisers, but I have not come across any practical examples of using non-SentencePiece-based tokenisers for vocabulary expansion. That’s why I decided to write a post about it. I hope this article is of some help.

Most LLMs are trained on English-centric data. Therefore, when encoding non-English language texts, the total number of tokens is likely to increase. For more information, see Ahia et al. (2023) and other references. ↩
For how to expand the vocabulary of a SentencePiece-based BPE tokeniser, see the explanation. ↩
For more details, see minbpe. ↩
It is convenient to use tokenizer.train_new_from_iterator() for training. ↩
The merge rules are sorted by token frequency (higher to lower), so you can add a new token in order of token frequency by processing the list sequentially. For more information, see the issue. ↩