<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://gucci-j.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://gucci-j.github.io/" rel="alternate" type="text/html" /><updated>2026-04-18T20:55:44+00:00</updated><id>https://gucci-j.github.io/feed.xml</id><title type="html">きままにNLP</title><subtitle>A Technical Blog about NLP and ML</subtitle><entry xml:lang="ja_JP"><title type="html">ApptainerでPyTorch NGCコンテナを使う</title><link href="https://gucci-j.github.io/post/apptainer-pytorch/" rel="alternate" type="text/html" title="ApptainerでPyTorch NGCコンテナを使う" /><published>2025-07-19T00:00:00+00:00</published><updated>2025-07-19T15:00:00+00:00</updated><id>https://gucci-j.github.io/post/apptainer-pytorch</id><content type="html" xml:base="https://gucci-j.github.io/post/apptainer-pytorch/"><![CDATA[<h2 id="はじめに">はじめに</h2>

<p>NVIDIAから提供されている<a href="https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch">PyTorch NGCコンテナ</a>をApptainerで使用する際のTipsを紹介します。</p>

<p>PyTorch NGCコンテナは、CUDAやcuDNNなどのNVIDIAライブラリが事前にインストールされており、GPUを活用したモデル学習を簡単に行えます。しかし、Dockerコンテナとして提供されているため、HPC環境で使用するにはApptainer向けに変換が必要です。また、追加のライブラリをインストールする手順がDockerを用いる場合と若干異なります。<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>

<p>本記事では、これらの差異に着目して、ApptainerでのPyTorch NGCコンテナの使用方法を解説します。</p>

<h2 id="前提">前提</h2>
<p>HPC環境でApptainerがインストールされていることを前提とします。インストールされていない場合は、管理者に依頼してください。</p>

<h2 id="手順">手順</h2>
<h3 id="1-apptainerをロードする">1. Apptainerをロードする</h3>
<p>自動的にApptainerがロードされる場合もありますが、手動でロードする必要がある場合は以下のようにコマンドを実行してください。なお、下記はHPC環境によって異なるため、適宜読み替えてください。</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>module load Apptainer/&lt;version&gt;
</code></pre></div></div>

<h3 id="2-pytorch-ngcコンテナをダウンロードする">2. PyTorch NGCコンテナをダウンロードする</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>apptainer pull docker://nvcr.io/nvidia/pytorch:&lt;version&gt;
</code></pre></div></div>

<p>ここで、<code class="language-plaintext highlighter-rouge">&lt;version&gt;</code>は使用したいPyTorch NGCのバージョンを指定します。例えば、PyTorch 25.04-py3を使用する場合は以下のようにします。バージョンは、<a href="https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags">NGCカタログページ</a>から確認できます。</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>apptainer pull docker://nvcr.io/nvidia/pytorch:25.04-py3
</code></pre></div></div>

<p>変換後のファイル名は、<code class="language-plaintext highlighter-rouge">pytorch_&lt;version&gt;-py3.sif</code>の形式になります。例えば、上記のコマンドを実行すると、<code class="language-plaintext highlighter-rouge">pytorch_25.04-py3.sif</code>というファイルが生成されます。</p>

<h3 id="3-コンテナを適切なディレクトリに移動する">3. コンテナを適切なディレクトリに移動する</h3>
<p>ダウンロードしたコンテナファイルは、デフォルトではカレントディレクトリに保存されます。個人の好みに応じて、プロジェクトディレクトリや特定のフォルダに移動することをお勧めします。私の場合は、以下のようにプロジェクトディレクトリ内の<code class="language-plaintext highlighter-rouge">containers</code>フォルダに移動しています。</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">mkdir</span> <span class="nt">-p</span> <span class="nv">$PROJ_HOME</span>/containers
<span class="nv">$ </span><span class="nb">mv </span>pytorch_25.04-py3.sif <span class="nv">$PROJ_HOME</span>/containers
</code></pre></div></div>

<h3 id="4-コンテナを実行する">4. コンテナを実行する</h3>

<p>続いて、ダウンロードしたコンテナを実行します。<code class="language-plaintext highlighter-rouge">--nv</code>オプションを指定して、GPUサポートを有効にします。また、<code class="language-plaintext highlighter-rouge">--bind</code>オプションを使用して、プロジェクトディレクトリをコンテナ内にマウントします。これにより、コンテナ内からプロジェクトファイルにアクセスできるようになります。</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>apptainer <span class="nb">exec</span> <span class="se">\</span>
    <span class="nt">--bind</span> <span class="nv">$PROJ_HOME</span>:<span class="nv">$PROJ_HOME</span> <span class="se">\</span>
    <span class="nt">--nv</span> <span class="nv">$PROJ_HOME</span>/containers/pytorch_25.04-py3.sif <span class="se">\</span>
    /bin/bash
</code></pre></div></div>

<p>HPC環境によっては、ホームディレクトリとLustreベースのファイルストレージを自動的にマウントする設定がされている場合があります。その場合、<code class="language-plaintext highlighter-rouge">--bind</code>オプションは不要です。マウントされているか否かについては、使用するHPCのドキュメントを確認してください。</p>

<h3 id="5-必要なライブラリを追加インストールする">5. 必要なライブラリを追加インストールする</h3>
<p>PyTorch NGCコンテナには、基本的なPyTorch環境が整っていますが、Transformersやその他のライブラリを使用する場合は、追加でインストールが必要です。追加のインストールには仮想環境を使用することをお勧めします。以下は、<code class="language-plaintext highlighter-rouge">pip</code>を使用してTransformersライブラリをインストールする例です。</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>python3 <span class="nt">-m</span> venv <span class="nt">--system-site-packages</span> <span class="nv">$PROJ_HOME</span>/envs/test
<span class="nv">$ </span><span class="nb">source</span> <span class="nv">$PROJ_HOME</span>/envs/test/bin/activate
<span class="nv">$ </span><span class="nb">unset </span>PIP_CONSTRAINT
<span class="nv">$ </span>pip <span class="nb">install </span>transformers
</code></pre></div></div>
<p>このコマンドは、プロジェクトディレクトリ内の<code class="language-plaintext highlighter-rouge">envs/test</code>に仮想環境を作成し、そこにTransformersライブラリをインストールします。<code class="language-plaintext highlighter-rouge">--system-site-packages</code>オプションを使用することで、コンテナ内のシステムパッケージも利用できます。</p>

<p>なお、<code class="language-plaintext highlighter-rouge">PIP_CONSTRAINT</code>環境変数を解除することで、PyTorch NGCコンテナに設定されているバージョンロックを一時的に無視してインストールを行います。この点がDockerコンテナと異なる点です。<code class="language-plaintext highlighter-rouge">/etc/pip/constraint.txt</code>にバージョンロックが設定されていますが、編集できない場合があるため、環境変数を解除する方法を採用しています。</p>

<p>インストールが完了したら、一旦コンテナ内から抜けてください。</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">exit</span>
</code></pre></div></div>

<h3 id="6-実際にコンテナを使用する">6. 実際にコンテナを使用する</h3>
<h4 id="インタラクティブジョブの場合">インタラクティブジョブの場合</h4>
<p>インタラクティブジョブを使用してコンテナを実行する場合、上記の4.に示したコマンドをそのまま使用できます。</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>apptainer <span class="nb">exec</span> <span class="se">\</span>
    <span class="nt">--bind</span> <span class="nv">$PROJ_HOME</span>:<span class="nv">$PROJ_HOME</span> <span class="se">\</span>
    <span class="nt">--nv</span> <span class="nv">$PROJ_HOME</span>/containers/pytorch_25.04-py3.sif <span class="se">\</span>
    /bin/bash
</code></pre></div></div>

<p>なお、コンテナ内でライブラリの環境変数を設定する必要がある場合は、以下を参考にしてください。ただし、パスは環境によっては異なる可能性があります。コンテナが使用するCUDAのバージョンは、<a href="https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html">公式のサポート表</a>を参照してください。25.04-py3の場合は、CUDA 12.9が使用されているため、以下のように設定します。</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">CUDA_HOME</span><span class="o">=</span>/usr/local/cuda-12.9
<span class="nb">export </span><span class="nv">LD_LIBRARY_PATH</span><span class="o">=</span><span class="nv">$CUDA_HOME</span>/lib64:<span class="nv">$LD_LIBRARY_PATH</span>
<span class="nb">export </span><span class="nv">TMPDIR</span><span class="o">=</span>/tmp
</code></pre></div></div>

<h4 id="ジョブスクリプトの場合">ジョブスクリプトの場合</h4>
<p>ジョブスクリプトを使用してコンテナを実行する場合、以下のように<code class="language-plaintext highlighter-rouge">apptainer exec</code>コマンドを使用します。Slurmのジョブスクリプトの例を以下に示します。</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c">#SBATCH --job-name=pytorch_job</span>
<span class="c">#SBATCH --output=pytorch_job.out</span>
<span class="c">#SBATCH --error=pytorch_job.err</span>
<span class="c">#SBATCH --ntasks=1</span>
<span class="c">#SBATCH --cpus-per-task=4</span>
<span class="c">#SBATCH --gres=gpu:1</span>
<span class="c">#SBATCH --time=01:00:00</span>
<span class="c">#SBATCH --partition=gpu</span>

module load Apptainer/&lt;version&gt;
<span class="nv">PROJ_HOME</span><span class="o">=</span>/path/to/your/project
apptainer <span class="nb">exec</span> <span class="se">\</span>
    <span class="nt">--bind</span> <span class="nv">$PROJ_HOME</span>:<span class="nv">$PROJ_HOME</span> <span class="se">\</span>
    <span class="nt">--nv</span> <span class="nv">$PROJ_HOME</span>/containers/pytorch_25.04-py3.sif <span class="se">\</span>
    python3 <span class="nv">$PROJ_HOME</span>/scripts/train.py
</code></pre></div></div>

<p>コマンドライン引数をたくさん指定したい場合は、シェルスクリプトを実行用(内部)とSlurmジョブスクリプト(ラッパー)用に分けることをおすすめします。以下がその例です。</p>

<h5 id="実行用スクリプト">実行用スクリプト</h5>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>

<span class="nb">source</span> <span class="nv">$PROJ_HOME</span>/envs/test/bin/activate

<span class="nb">export </span><span class="nv">CUDA_HOME</span><span class="o">=</span>/usr/local/cuda-12.9
<span class="nb">export </span><span class="nv">LD_LIBRARY_PATH</span><span class="o">=</span><span class="nv">$CUDA_HOME</span>/lib64:<span class="nv">$LD_LIBRARY_PATH</span>
<span class="nb">export </span><span class="nv">TMPDIR</span><span class="o">=</span>/tmp

python3 <span class="nv">$PROJ_HOME</span>/scripts/train.py <span class="s2">"</span><span class="nv">$@</span><span class="s2">"</span>  
</code></pre></div></div>

<h5 id="slurmジョブスクリプト">Slurmジョブスクリプト</h5>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c">#SBATCH --job-name=pytorch_job</span>
<span class="c">#SBATCH --output=pytorch_job.out</span>
<span class="c">#SBATCH --error=pytorch_job.err</span>
<span class="c">#SBATCH --ntasks=1</span>
<span class="c">#SBATCH --cpus-per-task=4</span>
<span class="c">#SBATCH --gres=gpu:1</span>
<span class="c">#SBATCH --time=01:00:00</span>
<span class="c">#SBATCH --partition=gpu</span>

module load Apptainer/&lt;version&gt;
<span class="nv">PROJ_HOME</span><span class="o">=</span>/path/to/your/project
apptainer <span class="nb">exec</span> <span class="se">\</span>
    <span class="nt">--bind</span> <span class="nv">$PROJ_HOME</span>:<span class="nv">$PROJ_HOME</span> <span class="se">\</span>
    <span class="nt">--nv</span> <span class="nv">$PROJ_HOME</span>/containers/pytorch_25.04-py3.sif <span class="se">\</span>
    <span class="nv">$PROJ_HOME</span>/scripts/run.sh <span class="s2">"</span><span class="nv">$@</span><span class="s2">"</span>
</code></pre></div></div>

<p>このようにすることで、Slurmジョブスクリプトから実行用スクリプトを呼び出し、Pythonスクリプトを実行できます。</p>

<h2 id="まとめ">まとめ</h2>
<p>本記事では、ApptainerでPyTorch NGCコンテナを使用する方法を解説しました。</p>

<p>PyTorch NGCコンテナはCUDA等の必要なライブラリがほとんど揃っているので、手早く実験環境を構築できます。使ったことがない方は、ぜひ試してみてください。</p>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>なお本記事は、<a href="https://zenn.dev/turing_motors/articles/fdea210f6ed6e9">[Tips] NGC PyTorchのversion lockを解除する方法</a>に着想を得て、執筆しました。 <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="Tips" /><category term="Transformers" /><summary type="html"><![CDATA[PyTorch NGCコンテナをApptainerで使う際のTipsです。]]></summary></entry><entry xml:lang="en_US"><title type="html">Using PyTorch NGC Containers with Apptainer</title><link href="https://gucci-j.github.io/post/en/apptainer-pytorch/" rel="alternate" type="text/html" title="Using PyTorch NGC Containers with Apptainer" /><published>2025-07-19T00:00:00+00:00</published><updated>2025-07-19T16:00:00+00:00</updated><id>https://gucci-j.github.io/post/en/apptainer-pytorch</id><content type="html" xml:base="https://gucci-j.github.io/post/en/apptainer-pytorch/"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>This post provides tips for using <a href="https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch">PyTorch NGC containers</a> provided by NVIDIA with Apptainer.</p>

<p>PyTorch NGC containers come with NVIDIA libraries like CUDA and cuDNN pre-installed, making it easy to train models using GPUs. However, since they are provided as Docker containers, they need to be converted for Apptainer use in HPC environments. Additionally, the procedure for installing extra libraries differs slightly from using Docker.</p>

<p>This post focuses on these differences and explains how to use PyTorch NGC containers with Apptainer.</p>

<h2 id="prerequisites">Prerequisites</h2>

<p>This guide assumes Apptainer is installed in your HPC environment. If it is not, please ask your system administrator to install it.</p>

<h2 id="steps">Steps</h2>

<h3 id="1-load-apptainer">1. Load Apptainer</h3>

<p>While Apptainer may be loaded automatically in some cases, if you need to load it manually, run the following command. Note that this command may vary depending on your HPC environment, so adjust it as necessary.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>module load Apptainer/&lt;version&gt;
</code></pre></div></div>

<h3 id="2-download-the-pytorch-ngc-container">2. Download the PyTorch NGC Container</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>apptainer pull docker://nvcr.io/nvidia/pytorch:&lt;version&gt;
</code></pre></div></div>

<p>Here, <code class="language-plaintext highlighter-rouge">&lt;version&gt;</code> specifies the desired PyTorch NGC version. You can check the available versions on the <a href="https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags">NGC Catalog page</a>. For example, to use PyTorch 25.04-py3, you can run the following command:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>apptainer pull docker://nvcr.io/nvidia/pytorch:25.04-py3
</code></pre></div></div>

<p>The converted container file will be in the format <code class="language-plaintext highlighter-rouge">pytorch_&lt;version&gt;-py3.sif</code>. For example, running the command above will generate a file named <code class="language-plaintext highlighter-rouge">pytorch_25.04-py3.sif</code>.</p>

<h3 id="3-move-the-container-to-an-appropriate-directory">3. Move the Container to an Appropriate Directory</h3>

<p>By default, the downloaded container file is saved in the current directory. I recommend moving it to a project directory or a specific folder according to your preference. In my case, I move it to a <code class="language-plaintext highlighter-rouge">containers</code> folder within my project directory as follows:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">mkdir</span> <span class="nt">-p</span> <span class="nv">$PROJ_HOME</span>/containers
<span class="nv">$ </span><span class="nb">mv </span>pytorch_25.04-py3.sif <span class="nv">$PROJ_HOME</span>/containers
</code></pre></div></div>

<h3 id="4-run-the-container">4. Run the Container</h3>

<p>Next, execute the downloaded container. Specify the <code class="language-plaintext highlighter-rouge">--nv</code> option to enable GPU support. Also, use the <code class="language-plaintext highlighter-rouge">--bind</code> option to mount your project directory inside the container. This allows you to access project files from within the container.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>apptainer <span class="nb">exec</span> <span class="se">\</span>
    <span class="nt">--bind</span> <span class="nv">$PROJ_HOME</span>:<span class="nv">$PROJ_HOME</span> <span class="se">\</span>
    <span class="nt">--nv</span> <span class="nv">$PROJ_HOME</span>/containers/pytorch_25.04-py3.sif <span class="se">\</span>
    /bin/bash
</code></pre></div></div>

<p>In some HPC environments, home directories and Lustre-based file storage might be automatically mounted. In such cases, the <code class="language-plaintext highlighter-rouge">--bind</code> option is not necessary. Refer to your HPC documentation to confirm whether this is the case.</p>

<h3 id="5-install-additional-required-libraries">5. Install Additional Required Libraries</h3>

<p>PyTorch NGC containers come with a basic PyTorch environment, but if you want to use libraries like Transformers or others, you will need to install them additionally. I recommend using a virtual environment for these extra installations. Here is an example of installing the Transformers library using <code class="language-plaintext highlighter-rouge">pip</code>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>python3 <span class="nt">-m</span> venv <span class="nt">--system-site-packages</span> <span class="nv">$PROJ_HOME</span>/envs/test
<span class="nv">$ </span><span class="nb">source</span> <span class="nv">$PROJ_HOME</span>/envs/test/bin/activate
<span class="nv">$ </span><span class="nb">unset </span>PIP_CONSTRAINT
<span class="nv">$ </span>pip <span class="nb">install </span>transformers
</code></pre></div></div>

<p>This command creates a virtual environment in <code class="language-plaintext highlighter-rouge">envs/test</code> within your project directory and installs the Transformers library there. The <code class="language-plaintext highlighter-rouge">--system-site-packages</code> option allows you to use system packages from within the container.</p>

<p>By unsetting the <code class="language-plaintext highlighter-rouge">PIP_CONSTRAINT</code> environment variable, you temporarily override the version lock set in the PyTorch NGC container. This is a key difference from Docker containers. While the version lock is configured in <code class="language-plaintext highlighter-rouge">/etc/pip/constraint.txt</code>, it might not be editable, which is why unsetting the environment variable is used.</p>

<p>Once the installation is complete, exit the container.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">exit</span>
</code></pre></div></div>

<h3 id="6-use-the-container-for-your-tasks">6. Use the Container for Your Tasks</h3>

<h4 id="for-interactive-jobs">For Interactive Jobs</h4>

<p>When running the container for an interactive job, you can use the same command shown in step 4:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>apptainer <span class="nb">exec</span> <span class="se">\</span>
    <span class="nt">--bind</span> <span class="nv">$PROJ_HOME</span>:<span class="nv">$PROJ_HOME</span> <span class="se">\</span>
    <span class="nt">--nv</span> <span class="nv">$PROJ_HOME</span>/containers/pytorch_25.04-py3.sif <span class="se">\</span>
    /bin/bash
</code></pre></div></div>

<p>If you need to set library environment variables within the container, refer to the following example. However, paths may vary depending on your environment. For the CUDA version used by the container, refer to the <a href="https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html">official support matrix</a>. For 25.04-py3, CUDA 12.9 is used, so set it as follows:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">CUDA_HOME</span><span class="o">=</span>/usr/local/cuda-12.9
<span class="nb">export </span><span class="nv">LD_LIBRARY_PATH</span><span class="o">=</span><span class="nv">$CUDA_HOME</span>/lib64:<span class="nv">$LD_LIBRARY_PATH</span>
<span class="nb">export </span><span class="nv">TMPDIR</span><span class="o">=</span>/tmp
</code></pre></div></div>

<h4 id="for-job-scripts">For Job Scripts</h4>

<p>When running the container using a job script, use the <code class="language-plaintext highlighter-rouge">apptainer exec</code> command as shown below. Here’s an example of a Slurm job script:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c">#SBATCH --job-name=pytorch_job</span>
<span class="c">#SBATCH --output=pytorch_job.out</span>
<span class="c">#SBATCH --error=pytorch_job.err</span>
<span class="c">#SBATCH --ntasks=1</span>
<span class="c">#SBATCH --cpus-per-task=4</span>
<span class="c">#SBATCH --gres=gpu:1</span>
<span class="c">#SBATCH --time=01:00:00</span>
<span class="c">#SBATCH --partition=gpu</span>

module load Apptainer/&lt;version&gt;
<span class="nv">PROJ_HOME</span><span class="o">=</span>/path/to/your/project
apptainer <span class="nb">exec</span> <span class="se">\</span>
    <span class="nt">--bind</span> <span class="nv">$PROJ_HOME</span>:<span class="nv">$PROJ_HOME</span> <span class="se">\</span>
    <span class="nt">--nv</span> <span class="nv">$PROJ_HOME</span>/containers/pytorch_25.04-py3.sif <span class="se">\</span>
    python3 <span class="nv">$PROJ_HOME</span>/scripts/train.py
</code></pre></div></div>

<p>If you need to specify many command-line arguments, I recommend separating the shell script into an execution script (internal) and a Slurm job script (wrapper). Here is an example:</p>

<h5 id="execution-script">Execution Script</h5>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>

<span class="nb">source</span> <span class="nv">$PROJ_HOME</span>/envs/test/bin/activate

<span class="nb">export </span><span class="nv">CUDA_HOME</span><span class="o">=</span>/usr/local/cuda-12.9
<span class="nb">export </span><span class="nv">LD_LIBRARY_PATH</span><span class="o">=</span><span class="nv">$CUDA_HOME</span>/lib64:<span class="nv">$LD_LIBRARY_PATH</span>
<span class="nb">export </span><span class="nv">TMPDIR</span><span class="o">=</span>/tmp

python3 <span class="nv">$PROJ_HOME</span>/scripts/train.py <span class="s2">"</span><span class="nv">$@</span><span class="s2">"</span>
</code></pre></div></div>

<h5 id="slurm-job-script">Slurm Job Script</h5>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c">#SBATCH --job-name=pytorch_job</span>
<span class="c">#SBATCH --output=pytorch_job.out</span>
<span class="c">#SBATCH --error=pytorch_job.err</span>
<span class="c">#SBATCH --ntasks=1</span>
<span class="c">#SBATCH --cpus-per-task=4</span>
<span class="c">#SBATCH --gres=gpu:1</span>
<span class="c">#SBATCH --time=01:00:00</span>
<span class="c">#SBATCH --partition=gpu</span>

module load Apptainer/&lt;version&gt;
<span class="nv">PROJ_HOME</span><span class="o">=</span>/path/to/your/project
apptainer <span class="nb">exec</span> <span class="se">\</span>
    <span class="nt">--bind</span> <span class="nv">$PROJ_HOME</span>:<span class="nv">$PROJ_HOME</span> <span class="se">\</span>
    <span class="nt">--nv</span> <span class="nv">$PROJ_HOME</span>/containers/pytorch_25.04-py3.sif <span class="se">\</span>
    <span class="nv">$PROJ_HOME</span>/scripts/run.sh <span class="s2">"</span><span class="nv">$@</span><span class="s2">"</span>
</code></pre></div></div>

<p>By doing this, you can call the execution script from the Slurm job script to run your Python script.</p>

<h2 id="conclusion">Conclusion</h2>

<p>This post explained how to use PyTorch NGC containers with Apptainer.</p>

<p>PyTorch NGC containers come with most of the necessary libraries like CUDA pre-installed, allowing you to quickly set up an experimental environment. If you have not used them before, give them a try!</p>]]></content><author><name></name></author><category term="english" /><category term="Tips" /><category term="Transformers" /><summary type="html"><![CDATA[Tips for using PyTorch NGC containers with Apptainer.]]></summary></entry><entry xml:lang="en_US"><title type="html">Resubmission does not work as intended in ARR</title><link href="https://gucci-j.github.io/post/en/arr-resubmission/" rel="alternate" type="text/html" title="Resubmission does not work as intended in ARR" /><published>2025-06-02T00:00:00+00:00</published><updated>2025-06-02T10:00:00+00:00</updated><id>https://gucci-j.github.io/post/en/arr-resubmission</id><content type="html" xml:base="https://gucci-j.github.io/post/en/arr-resubmission/"><![CDATA[<p><strong>DISCLAIMER</strong>: <strong>This post expresses my personal opinion and experience with ACL Rolling Review (ARR). It does not represent the views of any organisation I am affiliated with.</strong></p>

<p>This post is a critique of the resubmission feature in ACL Rolling Review (ARR) – which must be the core feature of the system. In particular, I will focus on the problem of resubmission when we request the same set of reviewers. I hope that this post will help to raise awareness of the issues with the resubmission process in ARR, and hopefully increase the number of reviewers who follow the guidelines when reviewing resubmissions.</p>

<h2 id="how-resubmission-should-work-in-arr">How resubmission <em>should</em> work in ARR</h2>
<p>In ARR, when you resubmit a paper, you can request the same set of reviewers as before. This is a feature that is supposed to make the process smoother for authors who have already received feedback from those reviewers and have made changes based on that feedback.</p>

<p>The <a href="https://aclrollingreview.org/reviewerguidelines">ARR Reviewer Guidelines</a> state that:</p>
<blockquote>
  <ul>
    <li>“You will need to reread the paper and comment on to what extent the authors successfully responded to the previous reviews. Please indicate that you read their revision notes and say whether you feel weaknesses were adequately addressed or not.”</li>
    <li>“In your review, you should refrain from raising new issues that were not discussed in the first round of reviews — otherwise, if resubmissions get a new round of weaknesses identified, then authors will never be able to move forward with commitment.”</li>
  </ul>
</blockquote>

<p>Specifically for substitute reviewers, the guidelines state:</p>
<blockquote>
  <ul>
    <li>“you should then edit your draft review, where (i) you explain the reasons why the revisions did not adequately address previously identified weaknesses; (ii) you explain the reasons why newly raised issues represent critical issues of soundness; and (iii) you reframe any other newly raised issues as suggestions for improvement.”</li>
  </ul>
</blockquote>

<p>Great. So the idea seems to be that reviewers should focus on the changes made by the authors in response to the previous reviews, and not raise new issues that were not discussed in the first round of reviews. Furthermore, another important point is that even substitute reviewers should not raise new issues that were not discussed in the first round of reviews, unless they are critical issues of soundness.</p>

<h2 id="how-resubmission-actually-works-in-arr">How resubmission actually works in ARR</h2>
<p>Unfortunately, the intended process for resubmissions in ARR often diverges from reality. In practice, some (many?) reviewers do not follow these guidelines, raise new issues that were not discussed in the first round of reviews, and assign low scores to the resubmitted paper, even if the authors have made significant changes based on the previous reviews.</p>

<p>Here, ARR offers a good feature to combat this kind of behaviour: <a href="https://aclrollingreview.org/authors#step2.2">Review issue reporting</a>. With this feature, authors can report any issues with the reviews they receive, including issues with resubmissions. A meta-reviewer will then look into the reported issues and need to take them into account when making their recommendation.</p>

<p>However, a significant hurdle remains: the complex nature of conference acceptance. Even if a meta-reviewer acknowledges a reviewer’s deviation from guidelines and provides a constructive evaluation, a paper can still be, or is highly likely to be, rejected because of the initial low scores from reviewers.</p>

<p>One typical example of this is a paper I submitted to ARR, which was resubmitted with the same set of reviewers. All the reviewers raised new issues that were not discussed in the first round of reviews, and one of them even wrote a rewiew that was almost simply a copy of the previous review, without taking into account the changes made. Despite reporting these issues and the meta-reviewer agreeing that the guidelines were not followed (even providing a constructive evaluation with no “Summary of suggested revisions” – implying sufficient revisions), the paper was rejected. This outcome was likely due to the low scores the reviewers assigned.</p>

<p>The absence of a “Summary of suggested revisions” typically signifies that authors have adequately addressed previous concerns. Yet, in this case, the paper was still rejected, leaving us without a clear path forward, effectively leaving the paper “dead in the water” within the ARR system.</p>

<h2 id="discussion">Discussion</h2>
<p>While I understand that no system is perfect and that ARR keeps improving, I believe that the resubmission process in ARR needs significant improvement. The current system does not effectively address the issues that arise when reviewers do not follow the guidelines, and it can lead to unfair outcomes for authors who have made efforts to improve their work based on previous feedback.</p>

<p>While it might not be a problem with the ARR system per se, I believe that the resubmission process should be more robust and better enforced. If this is not possible, then I would think that the resubmission feature should be removed altogether (i.e. no sticky reviews), as it does not work as intended and can lead to frustration for authors.</p>

<p>In this sense, I also agree that “Experiment with Making Findings Decisions in ARR,” which is proposed in the <a href="https://www.aclweb.org/adminwiki/images/9/9e/COPR-5-cycle-report-to-publish.pdf">ACL Peer Review Report</a>, could be another good idea to reduce this kind of problem.</p>]]></content><author><name></name></author><category term="english" /><category term="Opinion" /><summary type="html"><![CDATA[A critique of the resubmission feature in ARR.]]></summary></entry><entry xml:lang="ja_JP"><title type="html">SentencePieceベースのトークナイザの学習設定を抽出する</title><link href="https://gucci-j.github.io/post/vocab-expansion-gemma2/" rel="alternate" type="text/html" title="SentencePieceベースのトークナイザの学習設定を抽出する" /><published>2024-12-26T00:00:00+00:00</published><updated>2024-12-27T11:00:00+00:00</updated><id>https://gucci-j.github.io/post/vocab-expansion-gemma2</id><content type="html" xml:base="https://gucci-j.github.io/post/vocab-expansion-gemma2/"><![CDATA[<h2 id="はじめに">はじめに</h2>

<p>言語適応のための語彙拡張では、まず対象言語データで補助トークナイザを学習するのが一般的です。</p>

<p>その際、効果的な語彙拡張のためには、拡張元のトークナイザと同じ(近しい)学習設定を補助トークナイザに適用することが重要です。</p>

<p>しかし、SentencePieceのドキュメントや先行研究の論文には、トークナイザの学習設定をどのように抽出できるかは明確には説明されていません。<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>

<p>そこで本記事では、トークナイザの学習設定を抽出し、補助トークナイザに適用する方法を簡単に説明します。</p>

<p>なお、本記事は、<a href="https://github.com/gucci-j/chat-cve/blob/main/preprocessing/src/training/train_tokenizer_gemma2_sp.py">Gemma 2の語彙拡張例</a>を元に執筆しました。</p>

<h2 id="sentencepieceモデルの学習設定の抽出">SentencePieceモデルの学習設定の抽出</h2>

<p>まず、元のトークナイザのモデルファイルを読み込みます。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">sentencepiece</span> <span class="k">as</span> <span class="n">spm</span>
<span class="n">sp_model</span> <span class="o">=</span> <span class="n">spm</span><span class="p">.</span><span class="n">SentencePieceProcessor</span><span class="p">()</span>
<span class="n">sp_model</span><span class="p">.</span><span class="n">load</span><span class="p">(</span>
    <span class="s">"/path/to/models--google--gemma-2-9b/snapshots/33c193028431c2fde6c6e51f29e6f17b60cbfac6/tokenizer.model"</span>
<span class="p">)</span>
</code></pre></div></div>

<p>なお、上記では拡張元のトークナイザとして<a href="https://huggingface.co/google/gemma-2-9b">Gemma 2</a>の<a href="https://huggingface.co/google/gemma-2-9b/blob/main/tokenizer.model"><code class="language-plaintext highlighter-rouge">tokenizer.model</code></a>を使用することを想定しています。</p>

<p>次に、SentencePieceモデルプロトをロードし、各種パラメタを取得します。<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sentencepiece</span> <span class="kn">import</span> <span class="n">sentencepiece_model_pb2</span> <span class="k">as</span> <span class="n">sp_pb2_model</span>
<span class="n">spm</span> <span class="o">=</span> <span class="n">sp_pb2_model</span><span class="p">.</span><span class="n">ModelProto</span><span class="p">()</span>
<span class="n">spm</span><span class="p">.</span><span class="n">ParseFromString</span><span class="p">(</span><span class="n">sp_model</span><span class="p">.</span><span class="n">serialized_model_proto</span><span class="p">())</span>
</code></pre></div></div>

<p>最後に、以下の手順で学習設定を抽出します。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">google.protobuf.json_format</span> <span class="kn">import</span> <span class="n">MessageToDict</span>
<span class="n">training_configs</span> <span class="o">=</span> <span class="n">MessageToDict</span><span class="p">(</span><span class="n">spm</span><span class="p">)[</span><span class="s">'trainerSpec'</span><span class="p">]</span>
</code></pre></div></div>

<p>ここで、<code class="language-plaintext highlighter-rouge">training_configs</code>は、元のトークナイザの学習設定を含む辞書型変数です。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="s">'modelPrefix'</span><span class="p">:</span> <span class="s">''</span><span class="p">,</span>
 <span class="s">'modelType'</span><span class="p">:</span> <span class="s">'BPE'</span><span class="p">,</span>
 <span class="s">'vocabSize'</span><span class="p">:</span> <span class="mi">256000</span><span class="p">,</span>
 <span class="s">'selfTestSampleSize'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
 <span class="s">'inputFormat'</span><span class="p">:</span> <span class="s">''</span><span class="p">,</span>
 <span class="s">'characterCoverage'</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>
 <span class="s">'inputSentenceSize'</span><span class="p">:</span> <span class="s">'0'</span><span class="p">,</span>
 <span class="s">'seedSentencepieceSize'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
 <span class="s">'shrinkingFactor'</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>
 <span class="s">'numThreads'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
 <span class="s">'numSubIterations'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
 <span class="s">'maxSentenceLength'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
 <span class="s">'shuffleInputSentence'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
 <span class="s">'maxSentencepieceLength'</span><span class="p">:</span> <span class="mi">16</span><span class="p">,</span>
 <span class="s">'splitByUnicodeScript'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
 <span class="s">'splitByWhitespace'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
 <span class="s">'splitByNumber'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
 <span class="s">'treatWhitespaceAsSuffix'</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span>
 <span class="s">'splitDigits'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
 <span class="s">'allowWhitespaceOnlyPieces'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
 <span class="s">'userDefinedSymbols'</span><span class="p">:</span> <span class="p">[</span><span class="s">'&lt;mask&gt;'</span><span class="p">,</span>
  <span class="s">'&lt;2mass&gt;'</span><span class="p">,</span>
  <span class="s">'[@BOS@]'</span><span class="p">,</span>
  <span class="p">...</span>
  <span class="s">'&lt;/sup&gt;'</span><span class="p">,</span>
  <span class="s">'&lt;/code&gt;'</span><span class="p">],</span>
 <span class="s">'vocabularyOutputPieceScore'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
 <span class="s">'hardVocabLimit'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
 <span class="s">'useAllVocab'</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span>
 <span class="s">'byteFallback'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
 <span class="s">'requiredChars'</span><span class="p">:</span> <span class="s">''</span><span class="p">,</span>
 <span class="s">'unkId'</span><span class="p">:</span> <span class="mi">3</span><span class="p">,</span>
 <span class="s">'bosId'</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>
 <span class="s">'eosId'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
 <span class="s">'padId'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
 <span class="s">'unkSurface'</span><span class="p">:</span> <span class="s">' ⁇ '</span><span class="p">,</span>
 <span class="s">'unkPiece'</span><span class="p">:</span> <span class="s">'&lt;unk&gt;'</span><span class="p">,</span>
 <span class="s">'bosPiece'</span><span class="p">:</span> <span class="s">'&lt;bos&gt;'</span><span class="p">,</span>
 <span class="s">'eosPiece'</span><span class="p">:</span> <span class="s">'&lt;eos&gt;'</span><span class="p">,</span>
 <span class="s">'padPiece'</span><span class="p">:</span> <span class="s">'&lt;pad&gt;'</span><span class="p">,</span>
 <span class="s">'trainExtremelyLargeCorpus'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
 <span class="s">'enableDifferentialPrivacy'</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span>
 <span class="s">'differentialPrivacyNoiseLevel'</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>
 <span class="s">'differentialPrivacyClippingThreshold'</span><span class="p">:</span> <span class="s">'0'</span><span class="p">}</span>
</code></pre></div></div>

<p>上記の学習設定を用いることで、<a href="https://github.com/gucci-j/chat-cve/blob/main/preprocessing/src/training/train_tokenizer_gemma2_sp.py">先述のGemma 2の語彙拡張例</a>に示したようにGemma 2の補助トークナイザを学習することができます。</p>

<h2 id="まとめ">まとめ</h2>

<p>本記事では、元のトークナイザの学習設定を抽出し、補助トークナイザに適用する方法を簡単に説明しました。参考になれば幸いです。</p>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>公式には非推奨であったと記憶しているので、当然のことかもしれません。しかしながら、語彙拡張が広く用いられている中で、メモとして残しておいた方が有益であろうという観点から、本記事の執筆に至りました。 <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>詳細は、<a href="https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto">SentencePieceリポジトリ</a>を参照されたい。 <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="Tips" /><category term="Transformers" /><summary type="html"><![CDATA[Gemma2等を語彙拡張したい方向けのメモです。]]></summary></entry><entry xml:lang="en_US"><title type="html">Extracting SentencePiece tokeniser training settings for vocabulary expansion</title><link href="https://gucci-j.github.io/post/en/vocab-expansion-gemma2/" rel="alternate" type="text/html" title="Extracting SentencePiece tokeniser training settings for vocabulary expansion" /><published>2024-12-26T00:00:00+00:00</published><updated>2024-12-27T11:00:00+00:00</updated><id>https://gucci-j.github.io/post/en/vocab-expansion-gemma2</id><content type="html" xml:base="https://gucci-j.github.io/post/en/vocab-expansion-gemma2/"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>In vocabulary expansion for target language adaptation, we usually first train an auxiliary tokeniser on target language data. For effective expansion, it is important to use the same training settings as the source tokeniser. However, the SentencePiece documentation and papers do not always clearly explain how to do this. In this post, I briefly describe how to extract the training settings of the source tokeniser and apply them to the auxiliary tokeniser.</p>

<p>This post is based on <a href="https://github.com/gucci-j/chat-cve/blob/main/preprocessing/src/training/train_tokenizer_gemma2_sp.py">my script for vocabulary expansion with Gemma 2</a>.</p>

<h2 id="extracting-sentencepiece-training-settings">Extracting SentencePiece training settings</h2>

<p>To extract the training settings of the source tokeniser, we first load the tokeniser model file as follows:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">sentencepiece</span> <span class="k">as</span> <span class="n">spm</span>
<span class="n">sp_model</span> <span class="o">=</span> <span class="n">spm</span><span class="p">.</span><span class="n">SentencePieceProcessor</span><span class="p">()</span>
<span class="n">sp_model</span><span class="p">.</span><span class="n">load</span><span class="p">(</span>
    <span class="s">"/path/to/models--google--gemma-2-9b/snapshots/33c193028431c2fde6c6e51f29e6f17b60cbfac6/tokenizer.model"</span>
<span class="p">)</span>
</code></pre></div></div>

<p>Here, we assume that the source tokeniser is from <a href="https://huggingface.co/google/gemma-2-9b">Gemma 2</a> and we use its <a href="https://huggingface.co/google/gemma-2-9b/blob/main/tokenizer.model"><code class="language-plaintext highlighter-rouge">tokenizer.model</code></a>.</p>

<p>Next, we load the SentencePiece model proto, which contains parameters.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sentencepiece</span> <span class="kn">import</span> <span class="n">sentencepiece_model_pb2</span> <span class="k">as</span> <span class="n">sp_pb2_model</span>
<span class="n">spm</span> <span class="o">=</span> <span class="n">sp_pb2_model</span><span class="p">.</span><span class="n">ModelProto</span><span class="p">()</span>
<span class="n">spm</span><span class="p">.</span><span class="n">ParseFromString</span><span class="p">(</span><span class="n">sp_model</span><span class="p">.</span><span class="n">serialized_model_proto</span><span class="p">())</span>
</code></pre></div></div>

<p>Finally, we can extract the training settings as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">google.protobuf.json_format</span> <span class="kn">import</span> <span class="n">MessageToDict</span>
<span class="n">training_configs</span> <span class="o">=</span> <span class="n">MessageToDict</span><span class="p">(</span><span class="n">spm</span><span class="p">)[</span><span class="s">'trainerSpec'</span><span class="p">]</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">training_configs</code> variable is a dictionary that contains the training settings of the source tokeniser (as shown below).</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="s">'modelPrefix'</span><span class="p">:</span> <span class="s">''</span><span class="p">,</span>
 <span class="s">'modelType'</span><span class="p">:</span> <span class="s">'BPE'</span><span class="p">,</span>
 <span class="s">'vocabSize'</span><span class="p">:</span> <span class="mi">256000</span><span class="p">,</span>
 <span class="s">'selfTestSampleSize'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
 <span class="s">'inputFormat'</span><span class="p">:</span> <span class="s">''</span><span class="p">,</span>
 <span class="s">'characterCoverage'</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>
 <span class="s">'inputSentenceSize'</span><span class="p">:</span> <span class="s">'0'</span><span class="p">,</span>
 <span class="s">'seedSentencepieceSize'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
 <span class="s">'shrinkingFactor'</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>
 <span class="s">'numThreads'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
 <span class="s">'numSubIterations'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
 <span class="s">'maxSentenceLength'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
 <span class="s">'shuffleInputSentence'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
 <span class="s">'maxSentencepieceLength'</span><span class="p">:</span> <span class="mi">16</span><span class="p">,</span>
 <span class="s">'splitByUnicodeScript'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
 <span class="s">'splitByWhitespace'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
 <span class="s">'splitByNumber'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
 <span class="s">'treatWhitespaceAsSuffix'</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span>
 <span class="s">'splitDigits'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
 <span class="s">'allowWhitespaceOnlyPieces'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
 <span class="s">'userDefinedSymbols'</span><span class="p">:</span> <span class="p">[</span><span class="s">'&lt;mask&gt;'</span><span class="p">,</span>
  <span class="s">'&lt;2mass&gt;'</span><span class="p">,</span>
  <span class="s">'[@BOS@]'</span><span class="p">,</span>
  <span class="p">...</span>
  <span class="s">'&lt;/sup&gt;'</span><span class="p">,</span>
  <span class="s">'&lt;/code&gt;'</span><span class="p">],</span>
 <span class="s">'vocabularyOutputPieceScore'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
 <span class="s">'hardVocabLimit'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
 <span class="s">'useAllVocab'</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span>
 <span class="s">'byteFallback'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
 <span class="s">'requiredChars'</span><span class="p">:</span> <span class="s">''</span><span class="p">,</span>
 <span class="s">'unkId'</span><span class="p">:</span> <span class="mi">3</span><span class="p">,</span>
 <span class="s">'bosId'</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>
 <span class="s">'eosId'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
 <span class="s">'padId'</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span>
 <span class="s">'unkSurface'</span><span class="p">:</span> <span class="s">' ⁇ '</span><span class="p">,</span>
 <span class="s">'unkPiece'</span><span class="p">:</span> <span class="s">'&lt;unk&gt;'</span><span class="p">,</span>
 <span class="s">'bosPiece'</span><span class="p">:</span> <span class="s">'&lt;bos&gt;'</span><span class="p">,</span>
 <span class="s">'eosPiece'</span><span class="p">:</span> <span class="s">'&lt;eos&gt;'</span><span class="p">,</span>
 <span class="s">'padPiece'</span><span class="p">:</span> <span class="s">'&lt;pad&gt;'</span><span class="p">,</span>
 <span class="s">'trainExtremelyLargeCorpus'</span><span class="p">:</span> <span class="bp">True</span><span class="p">,</span>
 <span class="s">'enableDifferentialPrivacy'</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span>
 <span class="s">'differentialPrivacyNoiseLevel'</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>
 <span class="s">'differentialPrivacyClippingThreshold'</span><span class="p">:</span> <span class="s">'0'</span><span class="p">}</span>
</code></pre></div></div>

<p>We can now use these settings to train the auxiliary tokeniser as in <a href="https://github.com/gucci-j/chat-cve/blob/main/preprocessing/src/training/train_tokenizer_gemma2_sp.py">my script for vocabulary expansion with Gemma 2</a>.</p>

<h2 id="conclusion">Conclusion</h2>

<p>In this post, I briefly described how to extract the training settings of the source tokeniser and apply them to the auxiliary tokeniser for vocabulary expansion. I hope this helps!</p>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>For more details, see the <a href="https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto">SentencePiece repository</a>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="english" /><category term="Tips" /><category term="Transformers" /><summary type="html"><![CDATA[How to extract SentencePiece tokeniser training settings for vocabulary expansion.]]></summary></entry><entry xml:lang="ja_JP"><title type="html">一時的にMACアドレスを変更する</title><link href="https://gucci-j.github.io/post/temporal-mac-address-change/" rel="alternate" type="text/html" title="一時的にMACアドレスを変更する" /><published>2024-10-31T00:00:00+00:00</published><updated>2024-10-31T06:42:00+00:00</updated><id>https://gucci-j.github.io/post/temporal-mac-address-change</id><content type="html" xml:base="https://gucci-j.github.io/post/temporal-mac-address-change/"><![CDATA[<p>どのような場面かは人によるかと思いますが、macOS上で一時的にMACアドレスを変更したい場合の方法を備忘録として残しておきます。</p>

<h2 id="前提">前提</h2>
<ul>
  <li>macOS Apple Silicon (Sonoma 14.7)</li>
</ul>

<h2 id="手順">手順</h2>
<p>以下の3手順で見かけ上のMACアドレスを変更することができます。なお、再起動すると元に戻ります。</p>

<h3 id="1-wi-fiをオンにする">1. Wi-Fiをオンにする</h3>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>networksetup <span class="nt">-setairportpower</span> en0 on
</code></pre></div></div>
<blockquote>
  <p><code class="language-plaintext highlighter-rouge">en0</code> は自分の環境に合わせて変更してください。</p>
</blockquote>

<h3 id="2-macアドレスを変更する">2. MACアドレスを変更する</h3>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">sudo </span>ifconfig en0 ether 00:11:22:33:44:55
</code></pre></div></div>
<blockquote>
  <p><code class="language-plaintext highlighter-rouge">00:11:22:33:44:55</code> に変更先のMACアドレスを指定してください。</p>
</blockquote>

<h3 id="3-ネットワークを検出する">3. ネットワークを検出する</h3>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>networksetup <span class="nt">-detectnewhardware</span>
</code></pre></div></div>

<h2 id="参考">参考</h2>
<ul>
  <li><a href="https://support.apple.com/en-gb/guide/remote-desktop/apdd0c5a2d5/mac">https://support.apple.com/en-gb/guide/remote-desktop/apdd0c5a2d5/mac</a></li>
</ul>]]></content><author><name></name></author><category term="Tips" /><summary type="html"><![CDATA[macOSで一時的にMACアドレスを変更するTipsです。]]></summary></entry><entry xml:lang="en_US"><title type="html">Vocabulary expansion for non-SentencePiece based BPE tokeniser</title><link href="https://gucci-j.github.io/post/en/vocab-expansion/" rel="alternate" type="text/html" title="Vocabulary expansion for non-SentencePiece based BPE tokeniser" /><published>2024-06-24T00:00:00+00:00</published><updated>2024-06-24T00:00:00+00:00</updated><id>https://gucci-j.github.io/post/en/vocab-expansion</id><content type="html" xml:base="https://gucci-j.github.io/post/en/vocab-expansion/"><![CDATA[<h2 id="introduction">Introduction</h2>
<p>Since the advent of LLaMA2, additional training with target language data, i.e. continued pre-training, has been actively used to build an LLM for specific languages. The main effect of vocabulary expansion is to reduce overfragmentation<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>, resulting in better inference efficiency.</p>

<p>The idea of vocabulary expansion itself is quite simple, but the way it is implemented depends on how the target tokeniser is implemented. In this post, I share the procedure for expanding the vocabulary of a non-SentencePiece based BPE tokeniser.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>

<h2 id="background">Background</h2>
<p>The main difference between SentencePiece-based (e.g. LLaMA2, Mistral) and non-SentencePiece based (e.g. LLaMA3, OLMo) BPE tokenisers is whether they are byte-level or not.</p>

<p>The former, SentencePiece-based BPE, often uses the byte-fallback option, preventing the occurrence of UNK tokens.</p>

<p>On the other hand, recent non-SentencePiece based BPE tokenisers are typically based on byte-level BPE, which converts the input to UTF-8 encoded byte sequences before tokenisation. Since tokenisation is performed on byte sequences, no UNK tokens are generated.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></p>

<p>Therefore, even if the same algorithm is used between the two, the pre- and post-processing of strings is slightly different. This difference can also be seen in the metadata of the <code class="language-plaintext highlighter-rouge">transformers</code> tokenisers, specifically the <code class="language-plaintext highlighter-rouge">pre_tokenizer</code> and <code class="language-plaintext highlighter-rouge">decoder</code> parts.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">json</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">AutoTokenizer</span>

<span class="c1"># LLaMA2
</span><span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"meta-llama/Llama-2-7b-hf"</span><span class="p">)</span>
<span class="n">tokenizer_json</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">loads</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">_tokenizer</span><span class="p">.</span><span class="n">to_str</span><span class="p">())</span>
<span class="k">print</span><span class="p">(</span><span class="n">tokenizer_json</span><span class="p">[</span><span class="s">"pre_tokenizer"</span><span class="p">])</span>
<span class="c1"># None
</span><span class="k">print</span><span class="p">(</span><span class="n">tokenizer_json</span><span class="p">[</span><span class="s">"decoder"</span><span class="p">])</span>
<span class="c1"># {'type': 'Sequence', 'decoders': [{'type': 'Replace', 'pattern': {'String': '▁'}, 'content': ' '}, {'type': 'ByteFallback'}, {'type': 'Fuse'}, {'type': 'Strip', 'content': ' ', 'start': 1, 'stop': 0}]}
</span>
<span class="c1"># LLaMA3
</span><span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"meta-llama/Meta-Llama-3-8B"</span><span class="p">)</span>
<span class="n">tokenizer_json</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">loads</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">_tokenizer</span><span class="p">.</span><span class="n">to_str</span><span class="p">())</span>
<span class="k">print</span><span class="p">(</span><span class="n">tokenizer_json</span><span class="p">[</span><span class="s">"pre_tokenizer"</span><span class="p">])</span>
<span class="c1"># {'type': 'Sequence', 'pretokenizers': [{'type': 'Split', 'pattern': {'Regex': "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"}, 'behavior': 'Isolated', 'invert': False}, {'type': 'ByteLevel', 'add_prefix_space': False, 'trim_offsets': True, 'use_regex': False}]}
</span><span class="k">print</span><span class="p">(</span><span class="n">tokenizer_json</span><span class="p">[</span><span class="s">"decoder"</span><span class="p">])</span>
<span class="c1"># {'type': 'ByteLevel', 'add_prefix_space': True, 'trim_offsets': True, 'use_regex': True}
</span></code></pre></div></div>

<h2 id="implementation">Implementation</h2>
<p>Here, we expand the vocabulary of a source tokeniser <code class="language-plaintext highlighter-rouge">tokenizer</code> with the help of an auxiliary tokeniser <code class="language-plaintext highlighter-rouge">aux_tokenizer</code> trained on a target language.</p>

<p>We use Greek as an example, where the effect of vocabulary expansion is relatively easy to see.</p>

<h3 id="1-load-tokenisers-and-their-metadata">1. Load tokenisers and their metadata</h3>
<p>First, we load the source tokeniser and auxiliary tokeniser, and get the merge rules and vocabulary of the target language.</p>

<p>In the example below, we use LLaMA3 as the source tokeniser and an auxiliary tokeniser trained on the Greek CC-100 subcorpus ($2^{20}$ sentences randomly sampled) with a vocabulary size of 50k tokens. The rest of the training settings are the same as LLaMA3.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">copy</span>

<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">AutoTokenizer</span>
<span class="kn">from</span> <span class="nn">tokenizers.models</span> <span class="kn">import</span> <span class="n">BPE</span>

<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"meta-llama/Meta-Llama-3-8B"</span><span class="p">)</span>
<span class="n">vocab</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">.</span><span class="n">get_vocab</span><span class="p">()</span>
<span class="n">tokenizer_json</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">loads</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">_tokenizer</span><span class="p">.</span><span class="n">to_str</span><span class="p">())</span>
<span class="n">merges</span> <span class="o">=</span> <span class="n">tokenizer_json</span><span class="p">[</span><span class="s">"model"</span><span class="p">][</span><span class="s">"merges"</span><span class="p">]</span>

<span class="n">aux_tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"atsuki-yamaguchi/cc100-el-50k"</span><span class="p">)</span>
<span class="n">aux_tokenizer_json</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">loads</span><span class="p">(</span><span class="n">aux_tokenizer</span><span class="p">.</span><span class="n">_tokenizer</span><span class="p">.</span><span class="n">to_str</span><span class="p">())</span>
<span class="n">aux_merges</span> <span class="o">=</span> <span class="n">aux_tokenizer_json</span><span class="p">[</span><span class="s">"model"</span><span class="p">][</span><span class="s">"merges"</span><span class="p">]</span>
</code></pre></div></div>

<h3 id="2-expand-vocabulary-and-merge-rules">2. Expand vocabulary and merge rules</h3>
<p>We then add new target tokens and the corresponding merge rules to the source tokeniser’s vocabulary and merge rule list. Here, we add up to 10k new tokens.<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># merge the tokenizers
</span><span class="n">num_new_token</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">max_new_token</span> <span class="o">=</span> <span class="mi">10000</span>
<span class="n">ret_vocab</span> <span class="o">=</span> <span class="n">copy</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">vocab</span><span class="p">)</span>
<span class="n">ret_merges</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">old_merges</span> <span class="o">=</span> <span class="n">copy</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">merges</span><span class="p">)</span>
<span class="k">for</span> <span class="n">merge</span> <span class="ow">in</span> <span class="n">aux_merges</span><span class="p">:</span>
    <span class="c1"># vocab
</span>    <span class="n">token_1</span><span class="p">,</span> <span class="n">token_2</span> <span class="o">=</span> <span class="n">merge</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">" "</span><span class="p">)</span>
    <span class="n">token</span> <span class="o">=</span> <span class="n">token_1</span> <span class="o">+</span> <span class="n">token_2</span>
    <span class="k">if</span> <span class="n">num_new_token</span> <span class="o">&lt;</span> <span class="n">max_new_token</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">token_1</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">ret_vocab</span> <span class="ow">and</span> <span class="n">token_2</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">ret_vocab</span><span class="p">:</span> <span class="c1"># both are new
</span>            <span class="n">ret_vocab</span><span class="p">[</span><span class="n">token_1</span><span class="p">]</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">vocab</span><span class="p">)</span> <span class="o">+</span> <span class="n">num_new_token</span>
            <span class="n">ret_vocab</span><span class="p">[</span><span class="n">token_2</span><span class="p">]</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">vocab</span><span class="p">)</span> <span class="o">+</span> <span class="n">num_new_token</span> <span class="o">+</span> <span class="mi">1</span>
            <span class="n">num_new_token</span> <span class="o">+=</span> <span class="mi">2</span>
        <span class="k">elif</span> <span class="n">token_1</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">ret_vocab</span> <span class="ow">and</span> <span class="n">token_2</span> <span class="ow">in</span> <span class="n">ret_vocab</span><span class="p">:</span> <span class="c1"># new + existing
</span>            <span class="n">ret_vocab</span><span class="p">[</span><span class="n">token_1</span><span class="p">]</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">vocab</span><span class="p">)</span> <span class="o">+</span> <span class="n">num_new_token</span>
            <span class="n">num_new_token</span> <span class="o">+=</span> <span class="mi">1</span>
        <span class="k">elif</span> <span class="n">token_1</span> <span class="ow">in</span> <span class="n">ret_vocab</span> <span class="ow">and</span> <span class="n">token_2</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">ret_vocab</span><span class="p">:</span> <span class="c1"># old + existing
</span>            <span class="n">ret_vocab</span><span class="p">[</span><span class="n">token_2</span><span class="p">]</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">vocab</span><span class="p">)</span> <span class="o">+</span> <span class="n">num_new_token</span>
            <span class="n">num_new_token</span> <span class="o">+=</span> <span class="mi">1</span>
        <span class="k">else</span><span class="p">:</span> <span class="c1"># both are existing tokens
</span>            <span class="k">pass</span>
        <span class="k">if</span> <span class="n">token</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">ret_vocab</span><span class="p">:</span>
            <span class="n">ret_vocab</span><span class="p">[</span><span class="n">token</span><span class="p">]</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">vocab</span><span class="p">)</span> <span class="o">+</span> <span class="n">num_new_token</span>
            <span class="n">num_new_token</span> <span class="o">+=</span> <span class="mi">1</span>
    <span class="c1"># merge
</span>    <span class="k">if</span> <span class="n">merge</span> <span class="ow">in</span> <span class="n">merges</span><span class="p">:</span>
        <span class="n">old_merges</span><span class="p">.</span><span class="n">remove</span><span class="p">(</span><span class="n">merge</span><span class="p">)</span>
        <span class="n">ret_merges</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">merge</span><span class="p">)</span>
    <span class="k">elif</span> <span class="n">token</span> <span class="ow">in</span> <span class="n">ret_vocab</span> <span class="ow">and</span> <span class="n">token_1</span> <span class="ow">in</span> <span class="n">ret_vocab</span> <span class="ow">and</span> <span class="n">token_2</span> <span class="ow">in</span> <span class="n">ret_vocab</span><span class="p">:</span>
        <span class="n">ret_merges</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">merge</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="3-retrain-bpe-tokeniser">3. Retrain BPE tokeniser</h3>
<p>We create an instance of the BPE tokeniser with the expanded vocabulary and merge rules, and overwrite the source tokeniser with it.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># retrain tokenizer
</span><span class="n">merges</span> <span class="o">=</span> <span class="n">ret_merges</span> <span class="o">+</span> <span class="n">old_merges</span>
<span class="n">vocab</span> <span class="o">=</span> <span class="n">ret_vocab</span>
<span class="n">tokenizer</span><span class="p">.</span><span class="n">backend_tokenizer</span><span class="p">.</span><span class="n">model</span> <span class="o">=</span> <span class="n">BPE</span><span class="p">(</span>
    <span class="n">vocab</span><span class="o">=</span><span class="n">vocab</span><span class="p">,</span>
    <span class="n">merges</span><span class="o">=</span><span class="p">[(</span><span class="n">merge</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">' '</span><span class="p">)[</span><span class="mi">0</span><span class="p">],</span> <span class="n">merge</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">' '</span><span class="p">)[</span><span class="mi">1</span><span class="p">])</span> <span class="k">for</span> <span class="n">merge</span> <span class="ow">in</span> <span class="n">merges</span><span class="p">],</span>
    <span class="n">fuse_unk</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div></div>

<h3 id="4-save-the-tokeniser">4. Save the tokeniser</h3>
<p>Finally, we save the tokeniser to an output directory.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># save
</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">save_pretrained</span><span class="p">(</span><span class="s">"/path/to/output/dir"</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="efficacy-of-vocabulary-expansion">Efficacy of vocabulary expansion</h2>
<p>We measure the number of tokens before and after vocabulary expansion with the following example.</p>

<blockquote>
  <p>Μου είπαν ότι, θα έπρεπε να καλέσω έναν άντρα στο τέλος για να συναντηθούμε. Ερώτηση: Ο τύπος εμφανίστηκε λίγο αργά. Αληθές, Ψευδές, ή Κανένα από τα δύο; Απάντηση: Κανένα από τα δύο</p>
</blockquote>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"meta-llama/Meta-Llama-3-8B"</span><span class="p">)</span>
<span class="n">modified_tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"/path/to/output/dir"</span><span class="p">)</span>

<span class="n">text</span> <span class="o">=</span> <span class="s">"Μου είπαν ότι, θα έπρεπε να καλέσω έναν άντρα στο τέλος για να συναντηθούμε. Ερώτηση: Ο τύπος εμφανίστηκε λίγο αργά. Αληθές, Ψευδές, ή Κανένα από τα δύο; Απάντηση: Κανένα από τα δύο"</span>

<span class="k">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="n">text</span><span class="p">)))</span>
<span class="c1"># 81
</span>
<span class="k">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">modified_tokenizer</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="n">text</span><span class="p">)))</span>
<span class="c1"># 46
</span></code></pre></div></div>

<p>As a result, the number of tokens was reduced by 35 by adding 10k target tokens to the vocabulary of the LLaMA3 tokeniser.</p>

<h2 id="summary">Summary</h2>
<p>There are many examples of vocabulary expansion using SentencePiece-based BPE tokenisers, but I have not come across any practical examples of using non-SentencePiece-based tokenisers for vocabulary expansion. That’s why I decided to write a post about it. I hope this article is of some help.</p>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Most LLMs are trained on English-centric data. Therefore, when encoding non-English language texts, the total number of tokens is likely to increase. For more information, see <a href="https://aclanthology.org/2023.emnlp-main.614/">Ahia et al. (2023)</a> and other references. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>For how to expand the vocabulary of a SentencePiece-based BPE tokeniser, see the <a href="https://github.com/google/sentencepiece/blob/master/python/add_new_vocab.ipynb">explanation</a>. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>For more details, see <a href="https://github.com/karpathy/minbpe">minbpe</a>. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p>It is convenient to use <code class="language-plaintext highlighter-rouge">tokenizer.train_new_from_iterator()</code> for training. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:5" role="doc-endnote">
      <p>The merge rules are sorted by token frequency (higher to lower), so you can add a new token in order of token frequency by processing the list sequentially. For more information, see the <a href="https://github.com/huggingface/transformers/issues/4777">issue</a>. <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="english" /><category term="Tips" /><category term="Transformers" /><summary type="html"><![CDATA[A note on how to do vocabulary expansion with LLaMA3, OLMo, etc.]]></summary></entry><entry xml:lang="en_US"><title type="html">Introduction</title><link href="https://gucci-j.github.io/post/en/Introduction/" rel="alternate" type="text/html" title="Introduction" /><published>2024-06-23T00:00:00+00:00</published><updated>2025-01-01T14:00:00+00:00</updated><id>https://gucci-j.github.io/post/en/Introduction</id><content type="html" xml:base="https://gucci-j.github.io/post/en/Introduction/"><![CDATA[<h2 id="about-this-site">About this site</h2>
<p>This blog covers topics related to machine learning and natural language processing.</p>

<p>You can find the list of all articles on the <a href="/sitemap/en/">site map (tag list)</a>.
Note that some articles are written in Japanese and not available in English.</p>

<h2 id="contact">Contact</h2>
<p>For inquiries or job requests, please contact me via the <a href="/contact/en/">contact form</a> or DM on <a href="https://twitter.com/_gucciiiii">X (formerly Twitter)</a>.</p>

<h2 id="privacy-policy-and-disclaimer">Privacy policy and disclaimer</h2>
<p>For the disclaimer and privacy policy of this site, please see the <a href="/privacy/en/">Privacy Policy</a>.</p>]]></content><author><name></name></author><category term="english" /><category term="General" /><summary type="html"><![CDATA[Introduction to my blog and the privacy policy.]]></summary></entry><entry xml:lang="ja_JP"><title type="html">非SentencePieceベースのBPEトークナイザを語彙拡張する</title><link href="https://gucci-j.github.io/post/vocab-expansion/" rel="alternate" type="text/html" title="非SentencePieceベースのBPEトークナイザを語彙拡張する" /><published>2024-06-20T00:00:00+00:00</published><updated>2024-06-20T00:00:00+00:00</updated><id>https://gucci-j.github.io/post/vocab-expansion</id><content type="html" xml:base="https://gucci-j.github.io/post/vocab-expansion/"><![CDATA[<h2 id="はじめに">はじめに</h2>
<p>LLaMA2の登場以後、特定言語向けの言語モデル構築のため、語彙拡張を伴う英語モデルのターゲット言語のデータによる追加学習が盛んに行われています。</p>

<p>語彙拡張の目的はフラグメンテーション<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>の改善にあり、これにより推論効率の改善が見込まれます。</p>

<p>語彙拡張のアイディア自体はシンプルですが、トークナイザの実装方法によってどのように拡張するかは変わってきます。本稿では、非SentencePieceベースのBPEトークナイザを語彙拡張する手順について共有します。（SentencePieceベースのBPEトークナイザを語彙拡張する方法については、SentencePieceのレポジトリ内に<a href="https://github.com/google/sentencepiece/blob/master/python/add_new_vocab.ipynb">説明</a>があります。）</p>

<h2 id="前提">前提</h2>
<p>SentencePiece（LLaMA2やMistralなどが使用）と非SentencePieceベース（LLaMA3やOLMoなど）のBPEトークナイザの主な違いとして、byte-levelか否かが挙げられます。</p>

<p>SentencePiece系のBPEはbyte-fallbackオプションが適用されており、UNKトークン（=トークナイズできない）の発生を防いでいます。</p>

<p>他方、近年の非SentencePieceベースのBPEはbyte-level BPEとなっており、入力をUTF-8でエンコードされたバイト列に変換してからトークナイズを行っています。バイト列に対してトークナイズを行うので、UNKトークンは発生しません。<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>

<p>したがって、同じアルゴリズムであっても、文字列の前処理・後処理が若干異なります。この違いは、実際に<code class="language-plaintext highlighter-rouge">transformers</code>トークナイザのメタデータのうち、<code class="language-plaintext highlighter-rouge">pre_tokenizer</code>と<code class="language-plaintext highlighter-rouge">decoder</code>部分からも確認できます。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">json</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">AutoTokenizer</span>

<span class="c1"># LLaMA2
</span><span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"meta-llama/Llama-2-7b-hf"</span><span class="p">)</span>
<span class="n">tokenizer_json</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">loads</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">_tokenizer</span><span class="p">.</span><span class="n">to_str</span><span class="p">())</span>
<span class="k">print</span><span class="p">(</span><span class="n">tokenizer_json</span><span class="p">[</span><span class="s">"pre_tokenizer"</span><span class="p">])</span>
<span class="c1"># None
</span><span class="k">print</span><span class="p">(</span><span class="n">tokenizer_json</span><span class="p">[</span><span class="s">"decoder"</span><span class="p">])</span>
<span class="c1"># {'type': 'Sequence', 'decoders': [{'type': 'Replace', 'pattern': {'String': '▁'}, 'content': ' '}, {'type': 'ByteFallback'}, {'type': 'Fuse'}, {'type': 'Strip', 'content': ' ', 'start': 1, 'stop': 0}]}
</span>
<span class="c1"># LLaMA3
</span><span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"meta-llama/Meta-Llama-3-8B"</span><span class="p">)</span>
<span class="n">tokenizer_json</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">loads</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">_tokenizer</span><span class="p">.</span><span class="n">to_str</span><span class="p">())</span>
<span class="k">print</span><span class="p">(</span><span class="n">tokenizer_json</span><span class="p">[</span><span class="s">"pre_tokenizer"</span><span class="p">])</span>
<span class="c1"># {'type': 'Sequence', 'pretokenizers': [{'type': 'Split', 'pattern': {'Regex': "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"}, 'behavior': 'Isolated', 'invert': False}, {'type': 'ByteLevel', 'add_prefix_space': False, 'trim_offsets': True, 'use_regex': False}]}
</span><span class="k">print</span><span class="p">(</span><span class="n">tokenizer_json</span><span class="p">[</span><span class="s">"decoder"</span><span class="p">])</span>
<span class="c1"># {'type': 'ByteLevel', 'add_prefix_space': True, 'trim_offsets': True, 'use_regex': True}
</span></code></pre></div></div>

<h2 id="実装">実装</h2>
<p>はじめに、拡張元のトークナイザ<code class="language-plaintext highlighter-rouge">tokenizer</code>とターゲットとなる言語で学習された補助のトークナイザ<code class="language-plaintext highlighter-rouge">aux_tokenizer</code>を用意しておきます。</p>

<p>ここでは、比較的語彙拡張の効果が見られやすいギリシャ語を例に取ります。</p>

<h3 id="1-読み込み">1. 読み込み</h3>
<p>拡張元のトークナイザとターゲット言語の補助トークナイザを読み込み、マージルールや語彙の辞書を取得します。</p>

<p>以下の例では、LLaMA3を拡張元のトークナイザとしています。
補助のトークナイザは、語彙サイズを5万トークンとし、ギリシャ語のCC-100コーパスから無作為に$2^{20}$文を抽出したデータセットで学習させたものです。それ以外の学習設定はLLaMA3に準じています。<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">copy</span>

<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">AutoTokenizer</span>
<span class="kn">from</span> <span class="nn">tokenizers.models</span> <span class="kn">import</span> <span class="n">BPE</span>

<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"meta-llama/Meta-Llama-3-8B"</span><span class="p">)</span>
<span class="n">vocab</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">.</span><span class="n">get_vocab</span><span class="p">()</span>
<span class="n">tokenizer_json</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">loads</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">_tokenizer</span><span class="p">.</span><span class="n">to_str</span><span class="p">())</span>
<span class="n">merges</span> <span class="o">=</span> <span class="n">tokenizer_json</span><span class="p">[</span><span class="s">"model"</span><span class="p">][</span><span class="s">"merges"</span><span class="p">]</span>

<span class="n">aux_tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"atsuki-yamaguchi/cc100-el-50k"</span><span class="p">)</span>
<span class="n">aux_tokenizer_json</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">loads</span><span class="p">(</span><span class="n">aux_tokenizer</span><span class="p">.</span><span class="n">_tokenizer</span><span class="p">.</span><span class="n">to_str</span><span class="p">())</span>
<span class="n">aux_merges</span> <span class="o">=</span> <span class="n">aux_tokenizer_json</span><span class="p">[</span><span class="s">"model"</span><span class="p">][</span><span class="s">"merges"</span><span class="p">]</span>
</code></pre></div></div>

<h3 id="2-語彙とマージルールの追加">2. 語彙とマージルールの追加</h3>
<p>拡張元のトークナイザの語彙の辞書とマージルールのリストに、語彙と対応するマージルールを追加します。ここでは、1万個の新規語彙を最大値として追加します。<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># merge the tokenizers
</span><span class="n">num_new_token</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">max_new_token</span> <span class="o">=</span> <span class="mi">10000</span>
<span class="n">ret_vocab</span> <span class="o">=</span> <span class="n">copy</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">vocab</span><span class="p">)</span>
<span class="n">ret_merges</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">old_merges</span> <span class="o">=</span> <span class="n">copy</span><span class="p">.</span><span class="n">copy</span><span class="p">(</span><span class="n">merges</span><span class="p">)</span>
<span class="k">for</span> <span class="n">merge</span> <span class="ow">in</span> <span class="n">aux_merges</span><span class="p">:</span>
    <span class="c1"># vocab
</span>    <span class="n">token_1</span><span class="p">,</span> <span class="n">token_2</span> <span class="o">=</span> <span class="n">merge</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">" "</span><span class="p">)</span>
    <span class="n">token</span> <span class="o">=</span> <span class="n">token_1</span> <span class="o">+</span> <span class="n">token_2</span>
    <span class="k">if</span> <span class="n">num_new_token</span> <span class="o">&lt;</span> <span class="n">max_new_token</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">token_1</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">ret_vocab</span> <span class="ow">and</span> <span class="n">token_2</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">ret_vocab</span><span class="p">:</span> <span class="c1"># both are new
</span>            <span class="n">ret_vocab</span><span class="p">[</span><span class="n">token_1</span><span class="p">]</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">vocab</span><span class="p">)</span> <span class="o">+</span> <span class="n">num_new_token</span>
            <span class="n">ret_vocab</span><span class="p">[</span><span class="n">token_2</span><span class="p">]</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">vocab</span><span class="p">)</span> <span class="o">+</span> <span class="n">num_new_token</span> <span class="o">+</span> <span class="mi">1</span>
            <span class="n">num_new_token</span> <span class="o">+=</span> <span class="mi">2</span>
        <span class="k">elif</span> <span class="n">token_1</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">ret_vocab</span> <span class="ow">and</span> <span class="n">token_2</span> <span class="ow">in</span> <span class="n">ret_vocab</span><span class="p">:</span> <span class="c1"># new + existing
</span>            <span class="n">ret_vocab</span><span class="p">[</span><span class="n">token_1</span><span class="p">]</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">vocab</span><span class="p">)</span> <span class="o">+</span> <span class="n">num_new_token</span>
            <span class="n">num_new_token</span> <span class="o">+=</span> <span class="mi">1</span>
        <span class="k">elif</span> <span class="n">token_1</span> <span class="ow">in</span> <span class="n">ret_vocab</span> <span class="ow">and</span> <span class="n">token_2</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">ret_vocab</span><span class="p">:</span> <span class="c1"># old + existing
</span>            <span class="n">ret_vocab</span><span class="p">[</span><span class="n">token_2</span><span class="p">]</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">vocab</span><span class="p">)</span> <span class="o">+</span> <span class="n">num_new_token</span>
            <span class="n">num_new_token</span> <span class="o">+=</span> <span class="mi">1</span>
        <span class="k">else</span><span class="p">:</span> <span class="c1"># both are existing tokens
</span>            <span class="k">pass</span>
        <span class="k">if</span> <span class="n">token</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">ret_vocab</span><span class="p">:</span>
            <span class="n">ret_vocab</span><span class="p">[</span><span class="n">token</span><span class="p">]</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">vocab</span><span class="p">)</span> <span class="o">+</span> <span class="n">num_new_token</span>
            <span class="n">num_new_token</span> <span class="o">+=</span> <span class="mi">1</span>
    <span class="c1"># merge
</span>    <span class="k">if</span> <span class="n">merge</span> <span class="ow">in</span> <span class="n">merges</span><span class="p">:</span>
        <span class="n">old_merges</span><span class="p">.</span><span class="n">remove</span><span class="p">(</span><span class="n">merge</span><span class="p">)</span>
        <span class="n">ret_merges</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">merge</span><span class="p">)</span>
    <span class="k">elif</span> <span class="n">token</span> <span class="ow">in</span> <span class="n">ret_vocab</span> <span class="ow">and</span> <span class="n">token_1</span> <span class="ow">in</span> <span class="n">ret_vocab</span> <span class="ow">and</span> <span class="n">token_2</span> <span class="ow">in</span> <span class="n">ret_vocab</span><span class="p">:</span>
        <span class="n">ret_merges</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">merge</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="3-トークナイザの再学習">3. トークナイザの再学習</h3>
<p>拡張した語彙とマージルールを基に、BPEトークナイザのインスタンスを作成し、上書きします。</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># retrain tokenizer
</span><span class="n">merges</span> <span class="o">=</span> <span class="n">ret_merges</span> <span class="o">+</span> <span class="n">old_merges</span>
<span class="n">vocab</span> <span class="o">=</span> <span class="n">ret_vocab</span>
<span class="n">tokenizer</span><span class="p">.</span><span class="n">backend_tokenizer</span><span class="p">.</span><span class="n">model</span> <span class="o">=</span> <span class="n">BPE</span><span class="p">(</span>
    <span class="n">vocab</span><span class="o">=</span><span class="n">vocab</span><span class="p">,</span>
    <span class="n">merges</span><span class="o">=</span><span class="p">[(</span><span class="n">merge</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">' '</span><span class="p">)[</span><span class="mi">0</span><span class="p">],</span> <span class="n">merge</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">' '</span><span class="p">)[</span><span class="mi">1</span><span class="p">])</span> <span class="k">for</span> <span class="n">merge</span> <span class="ow">in</span> <span class="n">merges</span><span class="p">],</span>
    <span class="n">fuse_unk</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div></div>

<h3 id="4-保存">4. 保存</h3>
<p>最後に上書きしたトークナイザを保存して完了です。</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># save
</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">save_pretrained</span><span class="p">(</span><span class="s">"/path/to/output/dir"</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="効果">効果</h2>
<p>下記の例文が語彙拡張前後のトークナイザで何トークンになるかを計測します。</p>

<blockquote>
  <p>Μου είπαν ότι, θα έπρεπε να καλέσω έναν άντρα στο τέλος για να συναντηθούμε. Ερώτηση: Ο τύπος εμφανίστηκε λίγο αργά. Αληθές, Ψευδές, ή Κανένα από τα δύο; Απάντηση: Κανένα από τα δύο</p>
</blockquote>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"meta-llama/Meta-Llama-3-8B"</span><span class="p">)</span>
<span class="n">modified_tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="s">"/path/to/output/dir"</span><span class="p">)</span>

<span class="n">text</span> <span class="o">=</span> <span class="s">"Μου είπαν ότι, θα έπρεπε να καλέσω έναν άντρα στο τέλος για να συναντηθούμε. Ερώτηση: Ο τύπος εμφανίστηκε λίγο αργά. Αληθές, Ψευδές, ή Κανένα από τα δύο; Απάντηση: Κανένα από τα δύο"</span>

<span class="k">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="n">text</span><span class="p">)))</span>
<span class="c1"># 81
</span>
<span class="k">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">modified_tokenizer</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="n">text</span><span class="p">)))</span>
<span class="c1"># 46
</span></code></pre></div></div>

<p>結果として1万トークンをLLaMA3トークナイザに新規に追加することで、35トークン減少しました。</p>

<h2 id="おわりに">おわりに</h2>
<p>SentencePieceベースのBPEトークナイザの語彙拡張の例はたくさん見かけますが、それ以外の実践例をあまり見かけたことがなかったので今回の記事作成に至りました。
何かの役に立てば幸いです。</p>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>ほとんどの大規模言語モデルは英語データを中心に学習されているため、サポート外の言語を使用する際にトークン数が増加してしまう事象が報告されている。詳しくは、<a href="https://aclanthology.org/2023.emnlp-main.614/">Ahia et al. (2023)</a>などを参照されたい。 <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>詳しくは、<a href="https://github.com/karpathy/minbpe">minbpe</a>を参照されたい。 <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>学習には<code class="language-plaintext highlighter-rouge">tokenizer.train_new_from_iterator()</code>を使うのが便利です。 <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p>マージルールはトークンの出現頻度順にソートされているため、リストを順繰りに処理することによりトークンの頻度順に語彙を追加できます。詳しくは、transformersの<a href="https://github.com/huggingface/transformers/issues/4777">issue</a>等を確認してください。 <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><category term="Tips" /><category term="Transformers" /><summary type="html"><![CDATA[LLaMA3等を語彙拡張したい方向けのメモです。]]></summary></entry><entry xml:lang="ja_JP"><title type="html">VS Code [Remote SSH] を古いLinuxで使う</title><link href="https://gucci-j.github.io/post/vscode-remote-ssh/" rel="alternate" type="text/html" title="VS Code [Remote SSH] を古いLinuxで使う" /><published>2024-02-01T00:00:00+00:00</published><updated>2024-02-19T10:00:00+00:00</updated><id>https://gucci-j.github.io/post/vscode-remote-ssh</id><content type="html" xml:base="https://gucci-j.github.io/post/vscode-remote-ssh/"><![CDATA[<p><strong>2024/02/19 追記</strong>: 最低システム要件に満たないOSであっても時限的措置で再び接続できるようになりました。一年程度は使用可能とのことですが、それ以降は再び接続不可になるはずですので、以下はその対策として残しておきます。</p>

<p>VS Code release 1.86 から最低システム要件が変更になった影響で、非対応となったOS（例：CentOS 7、Ubuntu 18.04）を搭載しているサーバへの接続が不可になりました。とはいえ、システム管理者でもない限り接続先サーバのアップデートは困難なので、クライアント側での対処手順を備忘録程度にまとめておきます。</p>

<h2 id="前提">前提</h2>
<ul>
  <li>macOS Apple Silicon</li>
</ul>

<h2 id="手順">手順</h2>
<h3 id="1-vs-code-185-のダウンロード">1. VS Code 1.85 のダウンロード</h3>
<p>Apple Siliconなら以下からダウンロードできます。</p>

<p><a href="https://vscode.download.prss.microsoft.com/dbazure/download/stable/8b3775030ed1a69b13e4f4c628c612102e30a681/VSCode-darwin-arm64.zip">VSCode-darwin-arm64.zip</a></p>

<h3 id="2-適当な場所で解凍">2. 適当な場所で解凍</h3>
<p>私は <code class="language-plaintext highlighter-rouge">$HOME/Applications</code> 以下に配置しました。</p>

<h3 id="3-portable-mode-の有効化">3. Portable Mode の有効化</h3>
<p>現行の環境と共存させるための作業を行います。</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">Visual Studio Code.app</code> が配置されているディレクトリに <code class="language-plaintext highlighter-rouge">code-portable-data</code> を作成
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> <span class="nv">$HOME</span>/Applications
<span class="nb">mkdir </span>code-portable-data
</code></pre></div>    </div>
  </li>
  <li>アプリの検疫状態を解除
    <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>xattr <span class="nt">-dr</span> com.apple.quarantine Visual<span class="se">\ </span>Studio<span class="se">\ </span>Code.app
</code></pre></div>    </div>
  </li>
  <li>
    <p>アプリ起動後、左下歯車⚙️アイコンをクリック→検索欄に”update”と入力</p>
  </li>
  <li>アップデート関連の項目を軒並みオフに設定</li>
</ol>

<h3 id="4-remote-ssh-をインストール">4. Remote SSH をインストール</h3>
<p>空っぽの状態なので、再びRemote SSHをインストール &amp; その他環境を整えて完了です。</p>

<h2 id="参考">参考</h2>
<ul>
  <li><a href="https://code.visualstudio.com/docs/remote/faq#_can-i-run-vs-code-server-on-older-linux-distributions">VS Code Remote Development FAQ</a></li>
  <li><a href="https://code.visualstudio.com/docs/editor/portable">VS Code Portable Mode</a></li>
  <li><a href="https://github.com/microsoft/vscode/issues/203967">GitHub 関連Issue</a></li>
</ul>]]></content><author><name></name></author><category term="Tips" /><summary type="html"><![CDATA[CentOS 7にVS Code Serverで接続したい時のTipsです。]]></summary></entry></feed>