2024-12-26 2024-12-27

Extracting SentencePiece tokeniser training settings for vocabulary expansion

Introduction

In vocabulary expansion for target language adaptation, we usually first train an auxiliary tokeniser on target language data. For effective expansion, it is important to use the same training settings as the source tokeniser. However, the SentencePiece documentation and papers do not always clearly explain how to do this. In this post, I briefly describe how to extract the training settings of the source tokeniser and apply them to the auxiliary tokeniser.

This post is based on my script for vocabulary expansion with Gemma 2.

Extracting SentencePiece training settings

To extract the training settings of the source tokeniser, we first load the tokeniser model file as follows:

import sentencepiece as spm
sp_model = spm.SentencePieceProcessor()
sp_model.load(
    "/path/to/models--google--gemma-2-9b/snapshots/33c193028431c2fde6c6e51f29e6f17b60cbfac6/tokenizer.model"
)

Here, we assume that the source tokeniser is from Gemma 2 and we use its tokenizer.model.

Next, we load the SentencePiece model proto, which contains parameters.1

from sentencepiece import sentencepiece_model_pb2 as sp_pb2_model
spm = sp_pb2_model.ModelProto()
spm.ParseFromString(sp_model.serialized_model_proto())

Finally, we can extract the training settings as follows:

from google.protobuf.json_format import MessageToDict
training_configs = MessageToDict(spm)['trainerSpec']

The training_configs variable is a dictionary that contains the training settings of the source tokeniser (as shown below).

{'modelPrefix': '',
 'modelType': 'BPE',
 'vocabSize': 256000,
 'selfTestSampleSize': 0,
 'inputFormat': '',
 'characterCoverage': 0.0,
 'inputSentenceSize': '0',
 'seedSentencepieceSize': 0,
 'shrinkingFactor': 0.0,
 'numThreads': 0,
 'numSubIterations': 0,
 'maxSentenceLength': 0,
 'shuffleInputSentence': True,
 'maxSentencepieceLength': 16,
 'splitByUnicodeScript': True,
 'splitByWhitespace': True,
 'splitByNumber': True,
 'treatWhitespaceAsSuffix': False,
 'splitDigits': True,
 'allowWhitespaceOnlyPieces': True,
 'userDefinedSymbols': ['<mask>',
  '<2mass>',
  '[@BOS@]',
  ...
  '</sup>',
  '</code>'],
 'vocabularyOutputPieceScore': True,
 'hardVocabLimit': True,
 'useAllVocab': False,
 'byteFallback': True,
 'requiredChars': '',
 'unkId': 3,
 'bosId': 2,
 'eosId': 1,
 'padId': 0,
 'unkSurface': ' ⁇ ',
 'unkPiece': '<unk>',
 'bosPiece': '<bos>',
 'eosPiece': '<eos>',
 'padPiece': '<pad>',
 'trainExtremelyLargeCorpus': True,
 'enableDifferentialPrivacy': False,
 'differentialPrivacyNoiseLevel': 0.0,
 'differentialPrivacyClippingThreshold': '0'}

We can now use these settings to train the auxiliary tokeniser as in my script for vocabulary expansion with Gemma 2.

Conclusion

In this post, I briefly described how to extract the training settings of the source tokeniser and apply them to the auxiliary tokeniser for vocabulary expansion. I hope this helps!

  1. For more details, see the SentencePiece repository