Introduction
Since the advent of LLaMA2, additional training with target language data, i.e. continued pre-training, has been actively used to build an LLM for specific languages. The main effect of vocabulary expansion is to reduce overfragmentation1, resulting in better inference efficiency.
The idea of vocabulary expansion itself is quite simple, but the way it is implemented depends on how the target tokeniser is implemented. In this post, I share the procedure for expanding the vocabulary of a non-SentencePiece based BPE tokeniser.2
Background
The main difference between SentencePiece-based (e.g. LLaMA2, Mistral) and non-SentencePiece based (e.g. LLaMA3, OLMo) BPE tokenisers is whether they are byte-level or not.
The former, SentencePiece-based BPE, often uses the byte-fallback option, preventing the occurrence of UNK tokens.
On the other hand, recent non-SentencePiece based BPE tokenisers are typically based on byte-level BPE, which converts the input to UTF-8 encoded byte sequences before tokenisation. Since tokenisation is performed on byte sequences, no UNK tokens are generated.3
Therefore, even if the same algorithm is used between the two, the pre- and post-processing of strings is slightly different. This difference can also be seen in the metadata of the transformers
tokenisers, specifically the pre_tokenizer
and decoder
parts.
import json
from transformers import AutoTokenizer
# LLaMA2
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer_json = json.loads(tokenizer._tokenizer.to_str())
print(tokenizer_json["pre_tokenizer"])
# None
print(tokenizer_json["decoder"])
# {'type': 'Sequence', 'decoders': [{'type': 'Replace', 'pattern': {'String': '▁'}, 'content': ' '}, {'type': 'ByteFallback'}, {'type': 'Fuse'}, {'type': 'Strip', 'content': ' ', 'start': 1, 'stop': 0}]}
# LLaMA3
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokenizer_json = json.loads(tokenizer._tokenizer.to_str())
print(tokenizer_json["pre_tokenizer"])
# {'type': 'Sequence', 'pretokenizers': [{'type': 'Split', 'pattern': {'Regex': "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"}, 'behavior': 'Isolated', 'invert': False}, {'type': 'ByteLevel', 'add_prefix_space': False, 'trim_offsets': True, 'use_regex': False}]}
print(tokenizer_json["decoder"])
# {'type': 'ByteLevel', 'add_prefix_space': True, 'trim_offsets': True, 'use_regex': True}
Implementation
Here, we expand the vocabulary of a source tokeniser tokenizer
with the help of an auxiliary tokeniser aux_tokenizer
trained on a target language.
We use Greek as an example, where the effect of vocabulary expansion is relatively easy to see.
1. Load tokenisers and their metadata
First, we load the source tokeniser and auxiliary tokeniser, and get the merge rules and vocabulary of the target language.
In the example below, we use LLaMA3 as the source tokeniser and an auxiliary tokeniser trained on the Greek CC-100 subcorpus ($2^{20}$ sentences randomly sampled) with a vocabulary size of 50k tokens. The rest of the training settings are the same as LLaMA3.4
import json
import copy
from transformers import AutoTokenizer
from tokenizers.models import BPE
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
vocab = tokenizer.get_vocab()
tokenizer_json = json.loads(tokenizer._tokenizer.to_str())
merges = tokenizer_json["model"]["merges"]
aux_tokenizer = AutoTokenizer.from_pretrained("atsuki-yamaguchi/cc100-el-50k")
aux_tokenizer_json = json.loads(aux_tokenizer._tokenizer.to_str())
aux_merges = aux_tokenizer_json["model"]["merges"]
2. Expand vocabulary and merge rules
We then add new target tokens and the corresponding merge rules to the source tokeniser’s vocabulary and merge rule list. Here, we add up to 10k new tokens.5
# merge the tokenizers
num_new_token = 0
max_new_token = 10000
ret_vocab = copy.copy(vocab)
ret_merges = []
old_merges = copy.copy(merges)
for merge in aux_merges:
# vocab
token_1, token_2 = merge.split(" ")
token = token_1 + token_2
if num_new_token < max_new_token:
if token_1 not in ret_vocab and token_2 not in ret_vocab: # both are new
ret_vocab[token_1] = len(vocab) + num_new_token
ret_vocab[token_2] = len(vocab) + num_new_token + 1
num_new_token += 2
elif token_1 not in ret_vocab and token_2 in ret_vocab: # new + existing
ret_vocab[token_1] = len(vocab) + num_new_token
num_new_token += 1
elif token_1 in ret_vocab and token_2 not in ret_vocab: # old + existing
ret_vocab[token_2] = len(vocab) + num_new_token
num_new_token += 1
else: # both are existing tokens
pass
if token not in ret_vocab:
ret_vocab[token] = len(vocab) + num_new_token
num_new_token += 1
# merge
if merge in merges:
old_merges.remove(merge)
ret_merges.append(merge)
elif token in ret_vocab and token_1 in ret_vocab and token_2 in ret_vocab:
ret_merges.append(merge)
3. Retrain BPE tokeniser
We create an instance of the BPE tokeniser with the expanded vocabulary and merge rules, and overwrite the source tokeniser with it.
# retrain tokenizer
merges = ret_merges + old_merges
vocab = ret_vocab
tokenizer.backend_tokenizer.model = BPE(
vocab=vocab,
merges=[(merge.split(' ')[0], merge.split(' ')[1]) for merge in merges],
fuse_unk=False,
)
4. Save the tokeniser
Finally, we save the tokeniser to an output directory.
# save
tokenizer.save_pretrained("/path/to/output/dir")
Efficacy of vocabulary expansion
We measure the number of tokens before and after vocabulary expansion with the following example.
Μου είπαν ότι, θα έπρεπε να καλέσω έναν άντρα στο τέλος για να συναντηθούμε. Ερώτηση: Ο τύπος εμφανίστηκε λίγο αργά. Αληθές, Ψευδές, ή Κανένα από τα δύο; Απάντηση: Κανένα από τα δύο
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
modified_tokenizer = AutoTokenizer.from_pretrained("/path/to/output/dir")
text = "Μου είπαν ότι, θα έπρεπε να καλέσω έναν άντρα στο τέλος για να συναντηθούμε. Ερώτηση: Ο τύπος εμφανίστηκε λίγο αργά. Αληθές, Ψευδές, ή Κανένα από τα δύο; Απάντηση: Κανένα από τα δύο"
print(len(tokenizer.encode(text)))
# 81
print(len(modified_tokenizer.encode(text)))
# 46
As a result, the number of tokens was reduced by 35 by adding 10k target tokens to the vocabulary of the LLaMA3 tokeniser.
Summary
There are many examples of vocabulary expansion using SentencePiece-based BPE tokenisers, but I have not come across any practical examples of using non-SentencePiece-based tokenisers for vocabulary expansion. That’s why I decided to write a post about it. I hope this article is of some help.
-
Most LLMs are trained on English-centric data. Therefore, when encoding non-English language texts, the total number of tokens is likely to increase. For more information, see Ahia et al. (2023) and other references. ↩
-
For how to expand the vocabulary of a SentencePiece-based BPE tokeniser, see the explanation. ↩
-
It is convenient to use
tokenizer.train_new_from_iterator()
for training. ↩ -
The merge rules are sorted by token frequency (higher to lower), so you can add a new token in order of token frequency by processing the list sequentially. For more information, see the issue. ↩