LLM Fine-tuning#

To improve is to change; to be perfect is to change often - Winston Churchill

Large language models (LLMs) have demonstrated remarkable general capabilities, but tailoring them to specific tasks or domains may require fine-tuning – adjusting model weights by further training on task-specific data. We examine the fine-tuning of Meta’s Llama-3.1 model using tools from the Hugging Face ecosystem, applying efficient techniques such as quantization and low-rank adaptation (LoRA) to an industry text classification task using firm-level 10-K filings.

# By: Terence Lim, 2020-2025 (terence-lim.github.io)
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import os
from tqdm import tqdm
from pathlib import Path
from pprint import pprint
import textwrap
import warnings
import bitsandbytes as bnb
import torch
from datasets import Dataset
from peft import LoraConfig, PeftConfig
from trl import SFTTrainer
from transformers import (AutoModelForCausalLM, 
                          AutoTokenizer, 
                          BitsAndBytesConfig, 
                          pipeline, 
                          logging)
import matplotlib.pyplot as plt
from sklearn.metrics import (accuracy_score, 
                             classification_report, 
                             confusion_matrix)
from sklearn.model_selection import train_test_split
from finds.database import SQL, RedisDB
from finds.unstructured import Edgar
from finds.structured import BusDay, CRSP, PSTAT
from finds.readers import Sectoring
from finds.utils import Store
from secret import paths, CRSP_DATE, credentials
logging.set_verbosity_error() 
NUM_TRAIN_EPOCHS = 2   # 0 # 1
RESUME_FROM_CHECKPOINT = False   # False # True
MAX_SEQ_LENGTH = 1024  #512 #2048
LOGGING_STEPS = 200
VERBOSE = 0
sql = SQL(**credentials['sql'], verbose=VERBOSE)
bd = BusDay(sql)
rdb = RedisDB(**credentials['redis'])
crsp = CRSP(sql, bd, rdb, verbose=VERBOSE)
pstat = PSTAT(sql, bd, verbose=VERBOSE)
ed = Edgar(paths['10X'], zipped=True, verbose=0)
store = Store('assets', ext='pkl')
permnos = list(store.load('nouns').keys()) 
print(f"{len(permnos)=}")   # comparable sample
len(permnos)=3474

Meta Llama-3.1 model#

Meta’s Llama 3.1 is an open-source large language model released in July 2024 under the Llama 3.1 Community License, permitting broad use, including commercial applications. Key highlights include:

  • Model variants:

    • 8B: 8 billion parameters.

    • 70B: 70 billion parameters.

    • 405B: 405 billion parameters.

  • Context length of up to 128,000 tokens.

  • Pre-trained on over 15 trillion tokens sourced from publicly available datasets.

  • Fine-tuned using supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF).

  • Multilingual support, including English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai.

https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

base_model = 'meta-llama/Llama-3.1-8B-Instruct'
# Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
max_memory = round(gpu_stats.total_memory / (1024**3), 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")

def cuda_memory(title, trainer_stats=None):
    """Show final memory and optional trainer stats"""
    if torch.cuda.is_available():
        device = torch.device('cuda')
        total_memory = torch.cuda.get_device_properties(device).total_memory
        reserved_memory = torch.cuda.memory_reserved(device)
        allocated_memory = torch.cuda.memory_allocated(device)
        free_memory = total_memory - reserved_memory
        print(f'------ {title.upper()} ------')
        if trainer_stats:
            print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
        print(f"Total memory: {total_memory / (1024**3):.2f} GB")
        print(f"Reserved memory: {reserved_memory / (1024**3):.2f} GB")
        print(f"Allocated memory: {allocated_memory / (1024**3):.2f} GB")
        print(f"Free memory: {free_memory / (1024**3):.2f} GB")
GPU = NVIDIA GeForce RTX 3080 Laptop GPU. Max memory = 15.739 GB.

Supervised fine-tuning (SFT)#

Supervised Fine-Tuning is the process of enhancing a pre-trained language model by fine-tuning it on labeled input–output pairs using standard supervised learning. Common use cases include:

  • Instruction tuning: The model learns to follow new instructions

  • Chatbot fine-tuning (e.g., with help-desk data)

  • Domain adaptation (e.g., legal, medical)

Huggingface framework#

Several ecosystems support fine-tuning and training of LLMs. The Hugging Face Ecosystem includes:

  • transformers: Model architectures and training components.

  • Transformers Reinforcement Learning (trl): Training large language models (LLMs) with reinforcement learning techniques, especially for alignment tasks like RLHF (Reinforcement Learning with Human Feedback) and DPO (Direct Preference Optimization).

  • bitsandbytes: Enables efficient low-bit model quantization, allowing large language models to run on limited GPU memory without much loss in performance.

  • Parameter-Efficient Fine-Tuning (peft): Tools to fine-tune large language models by training only a small number of additional parameters.

  • Accelerate: Distributed training optimization.

  • datasets: For loading, processing, and managing datasets

It provides access to 100k+ pre-trained transformer models, and tools for efficient-tuning of these models using low memory and quantized weights.

If you encounter a gated model repository on Hugging Face, it means the model requires manual access approval from the authors before you can use or download it. You should log in to your huggingface.ro account, go to the Model Page, and click on the “Request Access” button – approval may take up to a few days. When authorized, make sure you have set your Hugging Face token in your environment (e.g. huggingface-cli login), see https://huggingface.co/settings/tokens

# Locations to save fine-tuned model weights
output_dir = str(Path(paths['scratch'], "fine-tuned-model"))   # training checkpoints
model_dir = str(Path(paths['scratch'], "Llama-3.1-8B-Instruct-FF-Sector"))  # final model
from trl import SFTConfig
args = SFTConfig(
    output_dir=output_dir,                    # directory to save and repository id
    num_train_epochs=NUM_TRAIN_EPOCHS, ####1  # number of training epochs
    per_device_train_batch_size=2,    ####1   # batch size per device during training
    gradient_accumulation_steps=4,  ####8     # before performing a backward/update pass
    gradient_checkpointing=True,              # use gradient checkpointing to save memory
    optim="paged_adamw_32bit",
    logging_strategy="steps",                 # or "steps" or "no" or "epoch"
    logging_steps=LOGGING_STEPS, #### 1,                         
    learning_rate=2e-4,                       # learning rate, based on QLoRA paper
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,                        # max gradient norm based on QLoRA paper
    max_steps=-1,
    warmup_ratio=0.03,                        # warmup ratio based on QLoRA paper
    group_by_length=False,
    lr_scheduler_type="cosine",               # use cosine learning rate scheduler
    report_to="tensorboard",
    max_seq_length=MAX_SEQ_LENGTH,  #512,  ### should be 1024? or MAX_CHARS // 4
    packing=False,
    dataset_kwargs={
    "add_special_tokens": False,
    "append_concat_token": False,
    }
)

Tokenizer#

The AutoTokenizer in Hugging Face is a smart utility that automatically loads the correct tokenizer for a given pretrained model.

# Load the tokenizer and set the pad token id. 
tokenizer = AutoTokenizer.from_pretrained(base_model)
tokenizer.pad_token_id = tokenizer.eos_token_id

Quantization#

Quantization converts high-precision data to lower-precision data, for instance, by representing model weights and activation values as 4-bit or 8-bit integers instead of 32-bit floating point numbers. The bitsandbytes library for efficient low-bit model quantization is integrated with Hugging Face and works seamlessly with parameter-efficient fine-tuning like QLora.

# Load the Llama-3.1-8b-instruct model in 4-bit quantization to save GPU memory
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
)

AutoModel#

The AutoModel class in Hugging Face is a convenient interface that automatically loads the correct model architecture based on the model name or path. Its variants automatically load the correct model head (e.g., classification layer, decoder head) based on your specific task, e.g.

Class

Task

Output

AutoModel

Base model (no head)

Hidden states

AutoModelForSequenceClassification

Text classification (e.g. sentiment)

Class logits

AutoModelForTokenClassification

Token labeling (e.g. NER, POS)

Token-level logits

AutoModelForQuestionAnswering

Extractive QA

Start/end logits for answer spans

AutoModelForCausalLM

Text generation (GPT-style)

Next-token logits

AutoModelForMaskedLM

Mask filling (BERT-style)

Predictions for masked tokens

AutoModelForSeq2SeqLM

Translation, summarization (T5, BART)

Generated sequences

AutoModelForMultipleChoice

Multiple-choice QA (e.g. SWAG)

Choice logits

AutoModelForVision2Seq

Image captioning

Generated text

AutoModelForImageClassification

Vision tasks

Class logits

AutoModelForSpeechSeq2Seq

Speech translation

Generated text from audio

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    device_map="auto",
    torch_dtype="float16",
    quantization_config=bnb_config, 
)
model.config.use_cache = False
model.config.pretraining_tp = 1

Parameter-efficient fine-tuning#

Parameter-Efficient Fine-Tuning (PEFT) is both a technique and a Hugging Face library for adapting large language models (LLMs) to new tasks by training only a small subset of parameters. Instead of updating the entire model, the base (pretrained) model is kept frozen, and lightweight, trainable components called adapters are added. These adapters typically involve only a few million parameters, making fine-tuning faster and more memory-efficient.

  • Low-rank factorization: This is a compression technique which decomposes a large matrix of weights into a smaller, lower-rank matrix, resulting in a more compact approximation that requires fewer parameters and computations.

  • LoRA: A small number of trainable low-rank matrices are added to the model’s attention layers. The original weights are frozen and just these adapters are fine-tuned.

  • QLora: Combines LoRA with Quantization: The base model is converted to 4-bit precision, reducing memory usage dramatically without losing much performance.

# Extract the linear module names from the model using the bits and bytes library. 
def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names:  # needed for 16 bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)
modules = find_all_linear_names(model)
modules
['q_proj', 'down_proj', 'v_proj', 'gate_proj', 'o_proj', 'up_proj', 'k_proj']
# Configure LoRA for the target modules, task type, and other training arguments 
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=modules,
)

Industry text classification#

We fine-tune the model for classifying firms into ten Fama-French sector categories based on their business descriptions in 10-K filings. The text data for each U.S.-domiciled common stock is drawn from the most recent year’s Business Description section of their 10-K filings.

Load 10-K business description text for industry classification task

# Retrieve universe of stocks
beg, end = bd.begyr(CRSP_DATE), bd.endyr(CRSP_DATE)
print(f"{beg=}, {end=}")
univ = crsp.get_universe(bd.endyr(CRSP_DATE, -1))

# lookup company names
comnam = crsp.build_lookup(source='permno', target='comnam', fillna="")
univ['comnam'] = comnam(univ.index)

# lookup company names
comnam = crsp.build_lookup(source='permno', target='comnam', fillna="")
univ['comnam'] = comnam(univ.index)

# lookup ticker symbols
ticker = crsp.build_lookup(source='permno', target='ticker', fillna="")
univ['ticker'] = ticker(univ.index)

# lookup sic codes from Compustat, and map to FF 10-sector code
sic = pstat.build_lookup(source='lpermno', target='sic', fillna=0)
industry = Series(sic[univ.index], index=univ.index)
industry = industry.where(industry > 0, univ['siccd'])
sectors = Sectoring(sql, scheme='codes10', fillna='')   # supplement from crosswalk
univ['sector'] = sectors[industry]

# retrieve latest year's bus10K's
item, form = 'bus10K', '10-K'
rows = DataFrame(ed.open(form=form, item=item))
rows = rows[rows['date'].between(beg, end)]\
    .drop_duplicates(subset=['permno'], keep='last')\
    .set_index('permno')\
    .reindex(permnos)

# split documents into train/test sets
labels = univ.loc[permnos, 'sector']
class_labels = np.unique(labels)
print(f"{class_labels=}")

train_index, test_index = train_test_split(permnos,
                                           stratify=labels,
                                           random_state=42,
                                           test_size=0.2)
beg=20240102, end=20241231
class_labels=array(['Durbl', 'Enrgy', 'HiTec', 'Hlth', 'Manuf', 'NoDur', 'Other',
       'Shops', 'Telcm', 'Utils'], dtype=object)

HuggingFace dataset module#

The training data are converted to LLM instruction statements, and implemented as a HuggingFace Dataset class. This class can be conveniently created from many different sources, including data files of various formats or from a generator function.

# Create LLM instruction statement
MAX_CHARS = MAX_SEQ_LENGTH * 2
class_text = "'" + "' or '".join(class_labels) + "'"
def generate_prompt(permno, test=False):
    text = ed[rows.loc[permno, 'pathname']].replace('\n','')[:MAX_CHARS]
    return f"""
Classify the text into one of these {len(class_labels)} classification labels:
{class_text} 
and return the answer as the label.
text: {text}
label: {'' if test else univ.loc[permno, 'sector']}""".strip()
cuda_memory('before dataset')
------ BEFORE DATASET ------
Total memory: 15.74 GB
Reserved memory: 6.83 GB
Allocated memory: 5.63 GB
Free memory: 8.91 GB
X_train = DataFrame(columns=['text'], index=train_index,
                    data=[generate_prompt(permno, test=False) for permno in train_index])
X_test = DataFrame(columns=['text'], index=test_index,
                   data=[generate_prompt(permno, test=True) for permno in test_index])
y_test = [univ.loc[permno, 'sector'] for permno in test_index]

train_data = Dataset.from_pandas(X_train[["text"]])
test_data = Dataset.from_pandas(X_test[["text"]])
print(textwrap.fill(train_data['text'][3]))
Classify the text into one of these 10 classification labels: 'Durbl'
or 'Enrgy' or 'HiTec' or 'Hlth' or 'Manuf' or 'NoDur' or 'Other' or
'Shops' or 'Telcm' or 'Utils'  and return the answer as the label.
text: ITEM 1. BUSINESS  OVERVIEW  B. RILEY FINANCIAL, INC. (NASDAQ:
RILY) (THE COMPANY IS A DIVERSIFIED FINANCIAL SERVICES PLATFORM THAT
DELIVERS TAILORED SOLUTIONS TO MEET THE STRATEGIC, OPERATIONAL, AND
CAPITAL NEEDS OF ITS CLIENTS AND PARTNERS. WE OPERATE THROUGH SEVERAL
CONSOLIDATED SUBSIDIARIES (COLLECTIVELY, B. RILEY THAT PROVIDE
INVESTMENT BANKING, BROKERAGE, WEALTH MANAGEMENT, ASSET MANAGEMENT,
DIRECT LENDING, BUSINESS ADVISORY, VALUATION, AND ASSET DISPOSITION
SERVICES TO A BROAD CLIENT BASE SPANNING PUBLIC AND PRIVATE COMPANIES,
FINANCIAL SPONSORS, INVESTORS, FINANCIAL INSTITUTIONS, LEGAL AND
PROFESSIONAL SERVICES FIRMS, AND INDIVIDUALS.   THE COMPANY
OPPORTUNISTICALLY INVESTS IN AND ACQUIRES COMPANIES OR ASSETS WITH
ATTRACTIVE RISK-ADJUSTED RETURN PROFILES TO BENEFIT OUR SHAREHOLDERS.
WE OWN AND OPERATE SEVERAL UNCORRELATED CONSUMER BUSINESSES AND INVEST
IN BRANDS ON A PRINCIPAL BASIS. OUR APPROACH IS FOCUSED ON HIGH
QUALITY COMPANIES AND ASSETS IN INDUSTRIES IN WHICH WE HAVE EXTENSIVE
KNOWLEDGE AND CAN BENEFIT FROM OUR EXPERIENCE TO MAKE OPERATIONAL
IMPROVEMENTS AND MAXIMIZE FREE CASH FLOW. OUR PRINCIPAL INVESTMENTS
OFTEN LEVERAGE THE FINANCIAL, RESTRUCTURING, AND OPERATIONAL EXPERTISE
OF OUR PROFESSIONALS WHO WORK COLLABORATIVELY ACROSS DISCIPLINES.   WE
REFER TO B. RILEY AS A PLATFORM BECAUSE OF THE UNIQUE COMPOSITION OF
OUR BUSINESS. OUR PLATFORM HAS GROWN CONSIDERABLY AND BECOME MORE
DIVERSIFIED OVER THE PAST SEVERAL YEARS. WE HAVE INCREASED OUR MARKET
SHARE AND EXPANDED THE DEPTH AND BREADTH OF OUR BUSINESSES BOTH
ORGANICALLY AND THROUGH OPPORTUNISTIC ACQUISITIONS. OUR INCREASINGLY
DIVERSIFIED PLATFORM ENABLES US TO INVEST OPPORTUNISTICALLY AND TO
DELIVER STRONG LONG-TERM INVESTMENT PERFORMANCE THROUGHOUT A RANGE OF
ECONOMIC CYCLES.   OUR PLATFORM IS COMPRISED OF MORE THAN 2,700
AFFILIATED PROFESSIONALS, INCLUDING EMPLOYEES AND INDEPENDENT
CONTRACTORS. WE ARE HEADQUARTERED IN LOS ANGELES, CALIFORNIA AND
MAINTAIN OFFICES THROUGHOUT THE U.S., INCLUDING IN NEW YORK, CHICAGO,
METRO DISTRICT OF COLUMBIA, AT label: Other
# verify max_seq_length sufficient
curr_max = 0
for row, data in enumerate(train_data):
    tokenized = tokenizer.tokenize(data['text'])
    curr_max = max(curr_max, len(tokenized))
#    print(f"{row=}, {len(tokenized)=}")
assert curr_max < args.max_seq_length
print(curr_max, f"{MAX_SEQ_LENGTH=}")
820 MAX_SEQ_LENGTH=1024
cuda_memory('after dataset')
------ AFTER DATASET ------
Total memory: 15.74 GB
Reserved memory: 6.83 GB
Allocated memory: 5.63 GB
Free memory: 8.91 GB

Pipeline#

Hugging Face’s pipeline function enables one-line use for easy inference, by simply specifying the model, tokenizer, generation parameters (e.g. sampling methdology, maximum new tokens), and task, e.g.:

  • “text-classification”: Sentiment analysis, topic labeling

  • “token-classification”: Named Entity Recognition (NER), POS tagging

  • “question-answering”: Extractive QA from context

  • “text-generation”: Generate text (GPT-style)

  • “summarization”: Generate summaries from long text

# Use the text generation pipeline to predict labels from the “text” 
def generate(prompt, model=model, tokenizer=tokenizer, verbose=False):
    """Generate a response"""
    pipe = pipeline(task="text-generation", 
                    model=model, 
                    tokenizer=tokenizer,
                    do_sample=False,
                    top_p=None,
                    top_k=None,
                    return_full_text=False,
                    max_new_tokens=4,   # 2
                    temperature=None)    # 0.1        
    result = pipe(prompt)
    answer = result[0]['generated_text'].split("label:")[-1].strip()
    if verbose:
        print(f"{len(prompt)=}, {result=}, {answer=}")
    return answer

def predict(test, model, tokenizer, verbose=False):
    """Predict test set"""
    y_pred = []
    for i in tqdm(range(len(test))):
        prompt = test.iloc[i]["text"]
        answer = generate(prompt, model, tokenizer, verbose=verbose)
        # Determine the predicted category
        for category in class_labels:
            if category.lower() in answer.lower():
                y_pred.append(category)
                break
        else:
            y_pred.append("none")
    return y_pred

Create function that will use the predicted labels and true labels to compute the overall accuracy, classification report, and confusion matrix.

def evaluate(y_true, y_pred):
    mapping = {label: idx for idx, label in enumerate(class_labels)}

    def map_func(x):
        return mapping.get(x, -1)  # Map to -1 if not found, should not occur with correct data
    
    y_true_mapped = np.vectorize(map_func)(y_true)
    y_pred_mapped = np.vectorize(map_func)(y_pred)
    labels = list(mapping.values())
    target_names = list(mapping.keys())
    if -1 in y_pred_mapped:
        labels += [-1]
        target_names += ['none']
    
    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true_mapped, y_pred=y_pred_mapped)
    print(f'Accuracy: {accuracy:.3f}')

    # Generate classification report
    class_report = classification_report(y_true=y_true_mapped, y_pred=y_pred_mapped,
                                         target_names=target_names,
                                         labels=labels, zero_division=0.0)
    print('\nClassification Report:')
    print(class_report)
    
    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true_mapped, y_pred=y_pred_mapped,
                                   labels=labels)
    print('\nConfusion Matrix:')
    print(conf_matrix)

Evaluate accuracy before fine-tuning the model

y_pred = predict(X_test, model, tokenizer)
Series(y_pred).value_counts()
100%|██████████| 695/695 [05:45<00:00,  2.01it/s]
Manuf    217
NoDur    184
HiTec    109
Other     65
none      54
Hlth      24
Utils     15
Telcm     14
Shops      8
Enrgy      4
Durbl      1
Name: count, dtype: int64
evaluate(y_test, y_pred)
Accuracy: 0.203

Classification Report:
              precision    recall  f1-score   support

       Durbl       0.00      0.00      0.00        33
       Enrgy       0.50      0.10      0.17        20
       HiTec       0.25      0.19      0.22       139
        Hlth       0.88      0.13      0.22       164
       Manuf       0.22      0.70      0.34        69
       NoDur       0.03      0.21      0.06        28
       Other       0.25      0.10      0.15       153
       Shops       0.75      0.10      0.17        62
       Telcm       0.36      0.56      0.43         9
       Utils       0.67      0.56      0.61        18
        none       0.00      0.00      0.00         0

    accuracy                           0.20       695
   macro avg       0.35      0.24      0.21       695
weighted avg       0.44      0.20      0.21       695


Confusion Matrix:
[[ 0  0  2  0 18 10  2  1  0  0  0]
 [ 0  2  0  0  8  8  2  0  0  0  0]
 [ 0  0 27  0 43 38 18  0  8  4  1]
 [ 0  0 73 21 15 22 11  0  1  1 20]
 [ 0  0  3  0 48 12  5  1  0  0  0]
 [ 1  0  0  0 17  6  4  0  0  0  0]
 [ 0  0  4  1 41 64 16  0  0  0 27]
 [ 0  0  0  1 26 16  7  6  0  0  6]
 [ 0  0  0  0  0  4  0  0  5  0  0]
 [ 0  2  0  1  1  4  0  0  0 10  0]
 [ 0  0  0  0  0  0  0  0  0  0  0]]

Trainer#

Create the model trainer using training arguments, a LoRA configuration, and a dataset.

trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=train_data,
    peft_config=peft_config,
#    dataset_text_field="text",
    processing_class=tokenizer
)
# Initiate model training
cuda_memory('before training')
trainer_stats = trainer.train(resume_from_checkpoint=RESUME_FROM_CHECKPOINT)
------ BEFORE TRAINING ------
Total memory: 15.74 GB
Reserved memory: 11.04 GB
Allocated memory: 8.22 GB
Free memory: 4.70 GB
{'loss': 1.1984, 'grad_norm': 0.1371612697839737, 'learning_rate': 0.0001670747898848231, 'num_tokens': 1091299.0, 'mean_token_accuracy': 0.7146163220703602, 'epoch': 0.5755395683453237}
{'loss': 1.1205, 'grad_norm': 0.16719305515289307, 'learning_rate': 8.029070592154895e-05, 'num_tokens': 2179799.0, 'mean_token_accuracy': 0.7273549642927366, 'epoch': 1.1496402877697842}
{'loss': 1.034, 'grad_norm': 0.19266854226589203, 'learning_rate': 9.47361624665869e-06, 'num_tokens': 3270551.0, 'mean_token_accuracy': 0.7437317748367787, 'epoch': 1.725179856115108}
{'train_runtime': 9602.662, 'train_samples_per_second': 0.579, 'train_steps_per_second': 0.072, 'train_loss': 1.103452600044888, 'num_tokens': 3784895.0, 'mean_token_accuracy': 0.7479746815689067, 'epoch': 1.99568345323741}
# Save trained model and tokenizer
model.config.use_cache = True
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
cuda_memory('after training', trainer_stats=trainer_stats)
------ AFTER TRAINING ------
9602.662 seconds used for training.
Total memory: 15.74 GB
Reserved memory: 14.43 GB
Allocated memory: 8.26 GB
Free memory: 1.31 GB

Evaluation#

y_pred = predict(X_test, model, tokenizer, verbose=False)
Series(y_pred).value_counts()
  0%|          | 0/695 [00:00<?, ?it/s]/home/terence/env3.11/lib/python3.11/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
100%|██████████| 695/695 [08:21<00:00,  1.39it/s]
Hlth     168
Other    156
HiTec    140
Manuf     59
Shops     59
NoDur     34
Durbl     29
Enrgy     21
Utils     19
Telcm     10
Name: count, dtype: int64
evaluate(y_test, y_pred)
Accuracy: 0.829

Classification Report:
              precision    recall  f1-score   support

       Durbl       0.83      0.73      0.77        33
       Enrgy       0.90      0.95      0.93        20
       HiTec       0.79      0.80      0.80       139
        Hlth       0.89      0.91      0.90       164
       Manuf       0.80      0.68      0.73        69
       NoDur       0.59      0.71      0.65        28
       Other       0.85      0.86      0.85       153
       Shops       0.83      0.79      0.81        62
       Telcm       0.90      1.00      0.95         9
       Utils       0.84      0.89      0.86        18

    accuracy                           0.83       695
   macro avg       0.82      0.83      0.83       695
weighted avg       0.83      0.83      0.83       695


Confusion Matrix:
[[ 24   0   6   0   1   1   0   1   0   0]
 [  0  19   0   0   1   0   0   0   0   0]
 [  0   2 111   7   2   2  12   2   1   0]
 [  0   0   9 149   1   1   3   1   0   0]
 [  5   0   4   2  47   6   2   2   0   1]
 [  0   0   0   1   2  20   2   3   0   0]
 [  0   0  10   4   5   1 132   1   0   0]
 [  0   0   0   4   0   3   4  49   0   2]
 [  0   0   0   0   0   0   0   0   9   0]
 [  0   0   0   1   0   0   1   0   0  16]]
# merge and save model
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

del model
del trainer
torch.cuda.empty_cache()
cuda_memory('after empty')
# Reload base model and tokenizer to cpu
device_map = "cpu"
tokenizer = AutoTokenizer.from_pretrained(base_model)
base_model_reload = AutoModelForCausalLM.from_pretrained(
        base_model,
        return_dict=True,
        low_cpu_mem_usage=True,
        torch_dtype=torch.float16,
        device_map=device_map, # "cpu",   # "auto",
        trust_remote_code=True,
)
# Merge adapter with base model
from peft import PeftModel
model = PeftModel.from_pretrained(base_model_reload, output_dir, device_map=device_map)
model = model.merge_and_unload()
# Save the merged model
model.save_pretrained(model_dir)
tokenizer.save_pretrained(model_dir)
# Reload nerged model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_dir)
base_model_reload = AutoModelForCausalLM.from_pretrained(
        model_dir,
        return_dict=True,
        low_cpu_mem_usage=True,
        torch_dtype=torch.float16,
        device_map="auto",   # 'cpu',
        trust_remote_code=True,
)
# Check it is working
y_pred = predict(X_test, model, tokenizer)
evaluate(y_test, y_pred)

References:

Philipp Krähenbühl, 2025, “AI395T Advances in Deep Learning course materials”, retrieved from https://ut.philkr.net/advances_in_deeplearning/

Tim Dettmers, “Bitsandbytes: 8-bit Optimizers and Quantization for PyTorch”, 2022. GitHub repository: TimDettmers/bitsandbytes

https://www.datacamp.com/tutorial/fine-tuning-llama-3-1