LLMのパフォーマンス比較ーRoberta、Llama 2、およびMistralを使用したLoraによる災害ツイート分析の詳細解説

LLMのパフォーマンス比較−Roberta、Llama 2、およびMistralを使用したLoraによる災害ツイート分析の詳細解説

<ul><li><a href=”https://www.voagi.com/efficient-adaptability-in-large-language-models-through-lowrank-matrix-factorization-lora-qlora-and.html”>LoRAを使用した災害ツイート分析のためのRoberta、Llama 2、Mistralの性能比較</a><ul><li><a href=”https://www.voagi.com/intro-to-social-network-analysis-with-networkx.html”>イントロダクション</a></li><li><a href=”https://www.voagi.com/3-ios-0days-infect-iphone.html”>使用されたハードウェア</a></li><li><a href=”/?s=Goals”>ゴール</a></li><li><a href=”/?s=Dependencies”>依存関係</a></li><li><a href=”https://www.voagi.com/pretrained-foundation-models-the-future-of-molecular-machine-learning-with-graphium-ml-library-and.html”>事前学習済みモデル</a><ul><li><a href=”/?s=RoBERTa”>RoBERTa</a></li><li><a href=”https://www.voagi.com/create-a-rag-pipeline-using-the-llama-index.html”>Llama 2</a></li><li><a href=”https://www.voagi.com/mistral-ai-sets-new-benchmarks-beyond-llama2-in-opensource-space.html”>Mistral 7B</a></li></ul></li><li><a href=”https://www.voagi.com/langchain-101-finetuning-llms-with-peft-lora-and-rl.html”>LoRA</a></li><li><a href=”https://www.voagi.com/llm-evals-setup-and-important-metrics-guide.html”>セットアップ</a></li><li><a href=”https://www.voagi.com/how-to-be-a-data-analyst-in-the-usa.html”>データの準備</a><ul><li><a href=”https://www.voagi.com/how-to-be-a-data-analyst-in-the-usa.html”>データの読み込み</a></li><li><a href=”https://www.voagi.com/apache-kafka-the-mission-critical-data-fabric-for-genai.html”>データ処理</a></li></ul></li><li><a href=”https://www.voagi.com/impact-of-language-models-on-medical-text-analysis.html”>モデル</a><ul><li><a href=”/?s=RoBERTa”>RoBERTa</a><ul><li><a href=”https://www.voagi.com/tips-to-use-prompt-engineering-for-text-classification.html”>分類タスクのためのRoBERTAチェックポイントの読み込み</a></li><li><a href=”https://www.voagi.com/langchain-101-finetuning-llms-with-peft-lora-and-rl.html”>RoBERTa分類器のためのLoRAセットアップ</a></li></ul></li><li><a href=”https://www.voagi.com/mistral-ai-sets-new-benchmarks-beyond-llama2-in-opensource-space.html”>Mistral</a><ul><li><a href=”https://www.voagi.com/mistral-ai-opensources-mistral-7b-a-versatile-language-model.html”>分類モデルのためのチェックポイントの読み込み</a></li><li><a href=”https://www.voagi.com/twoheaded-classifier-use-case.html”>Mistral 7B分類器のためのLoRAセットアップ</a></li></ul></li><li><a href=”https://www.voagi.com/create-a-rag-pipeline-using-the-llama-index.html”>Llama 2</a><ul><li><a href=”https://www.voagi.com/fasttext-for-easy-text-classification.html”>分類モデルのためのチェックポイントの読み込み</a></li><li><a href=”https://www.voagi.com/efficient-adaptability-in-large-language-models-through-lowrank-matrix-factorization-lora-qlora-and.html”>Llama 2分類器のためのLoRAセットアップ</a></li></ul></li></ul></li><li><a href=”https://www.voagi.com/configure-openai-reverse-proxy-using-nginx-for-chatgpt.html”>トレーナーのセットアップ</a><ul><li><a href=”https://www.voagi.com/classification-metrics-assessing-model-accuracy.html”>評価指標</a></li><li><a href=”https://www.voagi.com/bellmanford-algorithm-pathfinding-for-weighted-graphs.html”>重み付けされたグラフのためのカスタムトレーナー</a></li><li><a href=”https://www.voagi.com/llm-evals-setup-and-important-metrics-guide.html”>トレーナーセットアップ</a><ul><li><a href=”/?s=RoBERTa”>RoBERTa</a></li><li><a href=”https://www.voagi.com/mistral-ai-sets-new-benchmarks-beyond-llama2-in-opensource-space.html”>Mistral-7B</a></li><li><a href=”https://www.voagi.com/llama-2-wikipedia-knowledge-empowered-agent-creation.html”>Llama 2</a></li></ul></li></ul></li><li><a href=”https://www.voagi.com/optimizing-ml-models-with-dehb-a-guide-with-xgboost-and-python.html”>ハイパーパラメーターチューニング</a></li><li><a href=”https://www.voagi.com/customize-search-results-with-amazon-personalize-and-amazon-opensearch-service-integration.html”>結果</a></li><li><a href=”/?s=Conclusion”>結論</a></li><li><a href=”/?s=Resources”>リソース</a></li></ul></li></ul>

イントロダクション

NLP（自然言語処理）の速変わりする世界では、特定のタスクに最適な言語モデルを比較することがよくあります。このブログ記事では、RoBERTa、Mistral-7b、Llama-2-7bの3つのモデルを比較します。これらのモデルを使用して、災害に関するツイートの分類という共通の問題に取り組みました。MistralとLlama 2は、7兆のパラメータを持つ大きなモデルです。対照的に、比較研究のためのベースラインとして使用されるRoBERTa-large（355Mのパラメータ）は、比較的小さなモデルです。

このブログでは、PEFT（パラメータ効率的なファインチューニング）の技術を使用しました。シーケンス分類タスクにおける事前学習済みモデルのファインチューニングのためのLoRA（Large Language Modelsの低次元適応）です。LoRAは、訓練可能なパラメータの数を大幅に減らしながら、強力な下流タスクのパフォーマンスを維持するように設計されています。

このブログ記事の主な目的は、Hugging Faceの3つの事前学習済みモデルを使用して、シーケンス分類タスクのためのLoRAファインチューニングを実装することです：meta-llama/Llama-2-7b-hf、mistralai/Mistral-7B-v0.1、roberta-large

使用したハードウェア

ノード数：1
ノードあたりのGPU数：1
GPUタイプ：A6000
GPUメモリ：48GB

ゴール

LoRA PEFTメソッドを使用した事前学習済みLLMsのファインチューニングを実装する。
HuggingFaceのAPI（transformers、peft、datasets）の使用方法を学ぶ。
Weights & Biasesを使用してハイパーパラメータのチューニングと実験の記録を設定する。

依存関係

datasets evaluate peft scikit-learn torch transformers wandb

注意：報告された結果を再現するためには、wandbのレポートで固定バージョンを確認してください。

事前学習済みモデル

RoBERTa

RoBERTa（Robustly Optimized BERT Approach）は、Meta AI研究チームによって提案されたBERTモデルの進化版です。BERTは、コンテキスト依存の単語表現のためのセルフアテンション機構を使用したトランスフォーマーベースの言語モデルであり、マスクされた言語モデルの目的で訓練されています。BERTは、シーケンス分類やトークン分類などの自然言語理解タスクに使用されるエンコーダーモデルです。

RoBERTaは、ファインチューニングに適した人気のあるモデルであり、実験のためのベースラインとして適しています。詳細については、Hugging Faceのモデルcardを参照してください。

Llama 2

Llama 2モデルは、Meta AIによって導入された大規模言語モデル（LLMs）のファミリーに属しています。Llama 2モデルは、パラメータ数が70億から650億までの範囲で異なります。

Llama 2は、トランスフォーマーデコーダーのアーキテクチャに基づいた自己回帰言語モデルです。Llama 2はテキストを生成するために、単語のシーケンスを入力として処理し、スライディングウィンドウを使用して次のトークンを予測します。Llama 2のアーキテクチャは、GPT-3のようなモデルとはやや異なります。たとえば、Llama 2はReLUではなくSwiGLU活性化関数を使用し、絶対的に学習可能な位置エンベッドを回転させる位置エンベッドに代わります。

最新のLlama 2では、4096トークンまでのコンテキスト長を拡張し、グループ化クエリアテンション（GQA）デコーディングを使用することで、非常に長いシーケンスをより効果的に活用するためのアーキテクチャの改良が導入されました。

ミストラル7B

ミストラル7B v0.1は73億のパラメータを持つ、ミストラルAIによって導入された最初のLLMです。ミストラル7Bのアーキテクチャで使用される主な新しい技術は次のとおりです：

スライディングウィンドウアテンション：完全なアテンション（二乗コンピュートコスト）を、各トークンが前の層から最大4,096トークンにアテンドできるスライディングウィンドウベースのアテンション（線形コンピュートコスト）で置き換えます。このメカニズムにより、ミストラル7Bは、より高い層が4,096トークンのウィンドウサイズを超えた過去の情報にアクセスできるようになり、より長いシーケンスを処理できます。
グループ化クエリアテンション：Llama 2でも使用されるこのテクニックは、シーケンス内の以前にデコードされたトークンのキーと値のベクトルをキャッシュすることで、推論プロセスの最適化（処理時間の削減）を図ります。

LoRA

PEFT、パラメータ効率の高いファインチューニングは、典型的なフルファインチューニングで達成されるパフォーマンスレベルを保持しながら、より小規模なトレーニングパラメータセットを使用して大規模なモデルをファインチューニングするためのテクニックのコレクションです。

LoRA、Low-Rank Adaptationは、アダプターレイヤーと類似点を持つPEFTの手法です。その主な目的は、モデルのトレーニング可能なパラメータを削減することです。LoRAの操作は、事前学習された重みを凍結しながら、低ランクの更新行列を学習することを含みます。

準備

RoBERTaは最大シーケンス長が512の制限がありますので、すべてのモデルに対してMAX_LEN=512を設定して公平な比較を行います。

MAX_LEN = 512 roberta_checkpoint = "roberta-large"mistral_checkpoint = "mistralai/Mistral-7B-v0.1"llama_checkpoint = "meta-llama/Llama-2-7b-hf"

データの準備

データの読み込み

Hugging Faceからデータセットを読み込みます：

from datasets import load_datasetdataset = load_dataset("mehdiiraqui/twitter_disaster")

さて、データセットをトレーニングデータとバリデーションデータに分割しましょう。そして、テストセットを追加します：

from datasets import Dataset# データセットをトレーニングデータとバリデーションデータに分割するdata = dataset['train'].train_test_split(train_size=0.8, seed=42)# デフォルトの「test」スプリットを「validation」にリネームするdata['val'] = data.pop("test")# テストデータフレームをHuggingFaceデータセットに変換し、最初のデータセットに追加するdata['test'] = dataset['test']

データセットの概要は次のとおりです：

DatasetDict({    train: Dataset({        features: ['id', 'keyword', 'location', 'text', 'target'],        num_rows: 6090    })    val: Dataset({        features: ['id', 'keyword', 'location', 'text', 'target'],        num_rows: 1523    })    test: Dataset({        features: ['id', 'keyword', 'location', 'text', 'target'],        num_rows: 3263    })})

データの分布を確認しましょう：

import pandas as pd
data['train'].to_pandas().info()
data['test'].to_pandas().info()

トレーニングデータセット

RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   id        7613 non-null   int64
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64
dtypes: int64(2), object(3)
memory usage: 297.5+ KB

テストデータセット

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   id        3263 non-null   int64
 1   keyword   3237 non-null   object
 2   location  2158 non-null   object
 3   text      3263 non-null   object
 4   target    3263 non-null   int64
dtypes: int64(2), object(3)
memory usage: 127.6+ KB

トレーニングデータセットのターゲット分布

target0    43421    3271Name: count, dtype: int64

クラスがバランスしていないため、後で損失の計算に使用するために正の重みと負の重みを計算します：

pos_weights = len(data['train'].to_pandas()) / (2 * data['train'].to_pandas().target.value_counts()[1])
neg_weights = len(data['train'].to_pandas()) / (2 * data['train'].to_pandas().target.value_counts()[0])

最終的な重みは：

POS_WEIGHT, NEG_WEIGHT = (1.1637114032405993, 0.8766697374481806)

次に、テキストの列の最大長を計算します：

# 文字数の最大値
max_char = data['train'].to_pandas()['text'].str.len().max()
# 単語数の最大値
max_words = data['train'].to_pandas()['text'].str.split().str.len().max()

文字の最大数は152です。単語の最大数は31です。

データの処理

トレーニングデータの1行の例を見てみましょう：

data['train'][0]

{'id': 5285, 'keyword': 'fear', 'location': 'Thibodaux, LA', 'text': 'my worst fear. https://t.co/iH8UDz8mq3', 'target': 0}

データにはキーワード、場所、ツイートのテキストが含まれています。シンプルさのために、LLMの入力としてtextフィーチャーを選択します。

この段階では、事前学習済みLLMが期待するHuggingFace形式のトレーニング、検証、およびテストセットを準備しました。次のステップは、適切なトークナイザーを使用してtextフィーチャをトークンIDと注意マスクのシーケンスの2つのテンソルに変換するためのトークナイズされたデータセットを定義することです。各モデルには独自のトークナイザーがあるため、3つの異なるデータセットを定義する必要があります。

まず、RoBERTaデータローダーを定義します：

トークナイザーの読み込み：

from transformers import AutoTokenizer
roberta_tokenizer = AutoTokenizer.from_pretrained(roberta_checkpoint, add_prefix_space=True)

注: RoBERTaトークナイザーは、トークンとしてスペースを扱うように学習されています。その結果、文の最初の単語は、前にホワイトスペースがない場合に異なるようにエンコードされます。最初の単語にスペースを含めるために、add_prefix_space=Trueを設定します。また、一貫性のある前処理を維持するために、Llama 2およびMistral 7b向けにパラメータを’True’に設定します。

データフレームの1行を変換するための前処理関数を定義します：

def roberta_preprocessing_function(examples):    return roberta_tokenizer(examples['text'], truncation=True, max_length=MAX_LEN)

前処理関数をトレーニングデータセットの最初の例に適用することにより、トークン化された入力（input_ids）とアテンションマスクを取得します：

roberta_preprocessing_function(data['train'][0])

{'input_ids': [0, 127, 2373, 2490, 4, 1205, 640, 90, 4, 876, 73, 118, 725, 398, 13083, 329, 398, 119, 1343, 246, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

では、前処理関数をデータセット全体に適用しましょう：

col_to_delete = ['id', 'keyword','location', 'text']# 前処理関数を適用し、不要な列を削除しますroberta_tokenized_datasets = data.map(roberta_preprocessing_function, batched=True, remove_columns=col_to_delete)# ターゲットをHugginFaceの規格に合わせて"label"にリネームしますroberta_tokenized_datasets = roberta_tokenized_datasets.rename_column("target", "label")# Torch形式に設定しますroberta_tokenized_datasets.set_format("torch")

注意：データの不要な列（id、keyword、location、text）を削除しました。textはすでに入力IDとアテンションマスクに変換されているため、削除しました。

トークン化されたトレーニングデータセットを確認することができます：

roberta_tokenized_datasets['train'][0]

{'label': tensor(0), 'input_ids': tensor([    0,   127,  2373,  2490,     4,  1205,   640,    90,     4,   876,            73,   118,   725,   398, 13083,   329,   398,   119,  1343,   246,             2]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])}

トレーニングバッチを生成するために、与えられたバッチの行をバッチ内で最大の長さにパディングする必要があります。そのために、DataCollatorWithPaddingクラスを使用します：

# バッチ内で最大の長さにパディングするためのデータコレータfrom transformers import DataCollatorWithPaddingroberta_data_collator = DataCollatorWithPadding(tokenizer=roberta_tokenizer)

Mistral 7BおよびLlama 2モデルのデータの準備にも同じ手順を従ってください：

注意：ただし、Llama 2とMistral 7Bにはデフォルトのpad_token_idがありません。したがって、パディングにeos_token_idも使用します。

Mistral 7B：

# Mistral 7Bトークナイザーの読み込みfrom transformers import AutoTokenizer, DataCollatorWithPaddingmistral_tokenizer = AutoTokenizer.from_pretrained(mistral_checkpoint, add_prefix_space=True)mistral_tokenizer.pad_token_id = mistral_tokenizer.eos_token_idmistral_tokenizer.pad_token = mistral_tokenizer.eos_tokendef mistral_preprocessing_function(examples):    return mistral_tokenizer(examples['text'], truncation=True, max_length=MAX_LEN)mistral_tokenized_datasets = data.map(mistral_preprocessing_function, batched=True, remove_columns=col_to_delete)mistral_tokenized_datasets = mistral_tokenized_datasets.rename_column("target", "label")mistral_tokenized_datasets.set_format("torch")# バッチ内で最大の長さにパディングするためのデータコレータmistral_data_collator = DataCollatorWithPadding(tokenizer=mistral_tokenizer)

Llama 2：

# Llama 2 トークナイザーロード
from transformers import AutoTokenizer, DataCollatorWithPadding
llama_tokenizer = AutoTokenizer.from_pretrained(llama_checkpoint, add_prefix_space=True)
llama_tokenizer.pad_token_id = llama_tokenizer.eos_token_id
llama_tokenizer.pad_token = llama_tokenizer.eos_token

# Llamaの前処理関数
def llama_preprocessing_function(examples):
    return llama_tokenizer(examples['text'], truncation=True, max_length=MAX_LEN)

# バッチ処理されたトークナイズされたデータセット
llama_tokenized_datasets = data.map(llama_preprocessing_function, batched=True, remove_columns=col_to_delete)
llama_tokenized_datasets = llama_tokenized_datasets.rename_column("target", "label")
llama_tokenized_datasets.set_format("torch")

# データコレーター
llama_data_collator = DataCollatorWithPadding(tokenizer=llama_tokenizer)

トークナイズされたデータセットの準備ができたので、次のセクションでは事前学習済みのLLMのチェックポイントをロードし、LoRaの重みを設定する方法を紹介します。

モデル

RoBERTa

分類タスクのためのRoBERTaのチェックポイントをロードする

Hugging FaceのAutoModelForSequenceClassificationクラスを使用して、シーケンス分類ヘッドを持つ事前学習済みのRoBERTaモデルをロードします。

from transformers import AutoModelForSequenceClassification
roberta_model = AutoModelForSequenceClassification.from_pretrained(roberta_checkpoint, num_labels=2)

RoBERTa分類器のLoRaのセットアップ

LoRaの設定をインポートし、RoBERTa分類器のいくつかのパラメータを設定します。

TaskType：シーケンス分類
r(ランク)：分解行列のランク
lora_alpha：学習された重みをスケーリングするためのアルファパラメータ。LoRaの論文ではalphaを16にすることが推奨されています。
lora_dropout：LoRaレイヤーのドロップアウト確率
bias：LoRaレイヤーにバイアス項を追加するかどうか

以下のコードは、Loraの論文で推奨されている値を使用しています。この投稿の後半で、これらのパラメータをwandbを使用してハイパーパラメータチューニングを行います。

from peft import get_peft_model, LoraConfig, TaskType
roberta_peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=2, lora_alpha=16, lora_dropout=0.1, bias="none",
)
roberta_model = get_peft_model(roberta_model, roberta_peft_config)
roberta_model.print_trainable_parameters()

訓練可能なパラメータの数がRoBERTaモデルのパラメータのわずか0.64%を占めていることがわかります。

trainable params: 2,299,908 || all params: 356,610,052 || trainable%: 0.6449363911929212

Mistral

分類モデルのためのチェックポイントをロードする

事前学習済みのMistral-7Bモデルをシーケンス分類ヘッド付きでロードします。

from transformers import AutoModelForSequenceClassification
import torch
mistral_model = AutoModelForSequenceClassification.from_pretrained(
  pretrained_model_name_or_path=mistral_checkpoint,
  num_labels=2,
  device_map="auto")

Mistral 7Bの場合、パディングトークンIDを追加する必要があります。

mistral_model.config.pad_token_id = mistral_model.config.eos_token_id

Mistral 7B分類器のLoRaのセットアップ

Mistral 7Bモデルでは、target_modules（アテンションモジュールのクエリとバリューベクトル）を指定する必要があります。

from peft import get_peft_model, LoraConfig, TaskTypemistral_peft_config = LoraConfig(    task_type=TaskType.SEQ_CLS, r=2, lora_alpha=16, lora_dropout=0.1, bias="none",     target_modules=[        "q_proj",        "v_proj",    ],)mistral_model = get_peft_model(mistral_model, mistral_peft_config)mistral_model.print_trainable_parameters()

トレーニング可能なパラメータの数はMistralモデルのパラメータのわずか0.024％を表しています：

trainable params: 1,720,320 || all params: 7,112,380,416 || trainable%: 0.02418768259540745

Llama 2

分類モードのチェックポイントを読み込む

シーケンス分類ヘッダーを持つ事前学習済みのLlama 2モデルを読み込みましょう。

from transformers import AutoModelForSequenceClassificationimport torchllama_model =  AutoModelForSequenceClassification.from_pretrained(  pretrained_model_name_or_path=llama_checkpoint,  num_labels=2,  device_map="auto",  offload_folder="offload",  trust_remote_code=True)

Llama 2では、デフォルトではパディングトークンIDが定義されていないため、これを追加する必要があります。

llama_model.config.pad_token_id = llama_model.config.eos_token_id

Llama 2分類器のためのLoRaセットアップ

Mistralと同じパラメータでLlama 2用のLoRaを定義します。

from peft import get_peft_model, LoraConfig, TaskTypellama_peft_config = LoraConfig(    task_type=TaskType.SEQ_CLS, r=16, lora_alpha=16, lora_dropout=0.05, bias="none",     target_modules=[        "q_proj",        "v_proj",      ],)llama_model = get_peft_model(llama_model, llama_peft_config)llama_model.print_trainable_parameters()

トレーニング可能なパラメータの数はLlama 2モデルのパラメータのわずか0.12％を表しています：

trainable params: 8,404,992 || all params: 6,615,748,608 || trainable%: 0.1270452143516515

この時点で、トレーニング用のトークン化されたデータセットとLoRaレイヤーを持つLLMのセットアップを定義しました。次のセクションでは、HuggingFaceのTrainerクラスを使用してトレーニングを開始する方法について説明します。

トレーナーのセットアップ

評価メトリクス

まず、三つのモデルを比較するために使用するパフォーマンスメトリクスを定義します：F1スコア、再現率、精度、および正確度です：

import evaluateimport numpy as npdef compute_metrics(eval_pred):    # すべてのメトリクスは、HFの`evaluate`パッケージで事前に定義されています。    precision_metric = evaluate.load("precision")    recall_metric = evaluate.load("recall")    f1_metric= evaluate.load("f1")    accuracy_metric = evaluate.load("accuracy")    logits, labels = eval_pred # eval_predは、モデルによって返される予測とラベルのタプルです    predictions = np.argmax(logits, axis=-1)    precision = precision_metric.compute(predictions=predictions, references=labels)["precision"]    recall = recall_metric.compute(predictions=predictions, references=labels)["recall"]    f1 = f1_metric.compute(predictions=predictions, references=labels)["f1"]    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]    # トレーナーは、キーがメトリクス名で値がスコアの辞書を期待しています。    return {"precision": precision, "recall": recall, "f1-score": f1, 'accuracy': accuracy}

重み付け損失のためのカスタムトレーナー

この投稿の最初で述べたように、正例と負例の間にはバランスの取れていない分布があります。そのため、重み付き交差エントロピー損失でモデルをトレーニングする必要があります。Trainerクラスでは、カスタム損失を提供することはサポートされていないため、損失を直接モデルの出力から取得することを期待しています。

それでは、モデルの予測と入力ラベルに基づいて重み付けされたクロスエントロピー損失を計算するために、WeightedCELossTrainerを定義する必要があります。これはcompute_lossメソッドをオーバーライドして以下のようになります:

from transformers import Trainerclass WeightedCELossTrainer(Trainer):    def compute_loss(self, model, inputs, return_outputs=False):        labels = inputs.pop("labels")        # モデルの予測を取得        outputs = model(**inputs)        logits = outputs.get("logits")        # カスタム損失を計算        loss_fct = torch.nn.CrossEntropyLoss(weight=torch.tensor([neg_weights, pos_weights], device=model.device, dtype=logits.dtype))        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))        return (loss, outputs) if return_outputs else loss

トレーナーのセットアップ

まず、トレーニング引数とトレーナーを設定しましょう。

RoBERTa

最初の重要なステップは、モデルをトレーニングするためにGPUデバイスに移動することです。

roberta_model = roberta_model.cuda()roberta_model.device()

これにより、次の出力が表示されます:

device(type='cuda', index=0)

次に、トレーニング引数を設定します:

from transformers import TrainingArgumentslr = 1e-4batch_size = 8num_epochs = 5training_args = TrainingArguments(    output_dir="roberta-large-lora-token-classification",    learning_rate=lr,    lr_scheduler_type= "constant",    warmup_ratio= 0.1,    max_grad_norm= 0.3,    per_device_train_batch_size=batch_size,    per_device_eval_batch_size=batch_size,    num_train_epochs=num_epochs,    weight_decay=0.001,    evaluation_strategy="epoch",    save_strategy="epoch",    load_best_model_at_end=True,    report_to="wandb",    fp16=False,    gradient_checkpointing=True,)

最後に、モデル、トレーニング引数、トークン化されたデータセットを指定してRoBERTaトレーナーを定義します:

roberta_trainer = WeightedCELossTrainer(    model=mistral_model,    args=training_args,    train_dataset=mistral_tokenized_datasets['train'],    eval_dataset=mistral_tokenized_datasets["val"],    data_collator=mistral_data_collator,    compute_metrics=compute_metrics)

Mistral-7B

RoBERTaと同様に、WeightedCELossTrainerを次のように初期化します:

from transformers import TrainingArguments, Trainermistral_model = mistral_model.cuda()lr = 1e-4batch_size = 8num_epochs = 5training_args = TrainingArguments(    output_dir="mistral-lora-token-classification",    learning_rate=lr,    lr_scheduler_type= "constant",    warmup_ratio= 0.1,    max_grad_norm= 0.3,    per_device_train_batch_size=batch_size,    per_device_eval_batch_size=batch_size,    num_train_epochs=num_epochs,    weight_decay=0.001,    evaluation_strategy="epoch",    save_strategy="epoch",    load_best_model_at_end=True,    report_to="wandb",    fp16=True,    gradient_checkpointing=True,)mistral_trainer = WeightedCELossTrainer(    model=mistral_model,    args=training_args,    train_dataset=mistral_tokenized_datasets['train'],    eval_dataset=mistral_tokenized_datasets["val"],    data_collator=mistral_data_collator,    compute_metrics=compute_metrics)

注意: fp16をTrueに設定して、半精度トレーニングを有効にする必要があります。その主な理由は、Mistral-7Bが大きすぎて、その重みを1つのGPUメモリ（48GB）に完全なfloat32精度で格納することができないためです。

Llama 2

Mistral 7Bと同様に、以下のようにトレーナーを定義します:

from transformers import TrainingArguments, Trainerllama_model = llama_model.cuda()lr = 1e-4batch_size = 8num_epochs = 5training_args = TrainingArguments(    output_dir="llama-lora-token-classification",    learning_rate=lr,    lr_scheduler_type= "constant",    warmup_ratio= 0.1,    max_grad_norm= 0.3,    per_device_train_batch_size=batch_size,    per_device_eval_batch_size=batch_size,    num_train_epochs=num_epochs,    weight_decay=0.001,    evaluation_strategy="epoch",    save_strategy="epoch",    load_best_model_at_end=True,    report_to="wandb",    fp16=True,    gradient_checkpointing=True,)llama_trainer = WeightedCELossTrainer(    model=llama_model,    args=training_args,    train_dataset=llama_tokenized_datasets['train'],    eval_dataset=llama_tokenized_datasets["val"],    data_collator=llama_data_collator,    compute_metrics=compute_metrics)

ハイパーパラメータの調整

我々は、Wandb Sweep APIを使用してベイズ探索戦略（30回の実行）でハイパーパラメータの調整を行いました。調整されたハイパーパラメータは以下の通りです。

詳細は、リソースセクションでWandbの実験レポートをご確認いただけます。

結果

結論

このブログ投稿では、災害ツイートの分類において、RoBERTa、Mistral 7b、Llama 2という3つの大規模言語モデル（LLM）のパフォーマンスをLoRaを使用して比較しました。パフォーマンスの結果から、RoBERTaがMistral 7BとLlama 2を大きく上回る優れた性能を発揮していることが分かります。これは、短いシーケンスの二値分類などのタスクに本当に複雑で大規模なLLMが必要かという疑問を投げかけます。

この研究から得られる教訓の一つは、プロジェクトの特定の要件、利用可能なリソース、パフォーマンスのニーズに応じてLLMsモデルを選択する必要があるということです。

また、比較的簡単な予測タスクにおいても、RoBERTaなどのベースモデルは競争力を保っていることを示しています。

最後に、LoRaの手法がエンコーダ（RoBERTa）モデルとデコーダ（Llama 2およびMistral 7B）モデルの両方に適用できることを紹介しました。