LLM Fine-tuning#
To improve is to change; to be perfect is to change often - Winston Churchill
Large language models (LLMs) have demonstrated remarkable general capabilities, but tailoring them to specific tasks or domains may require fine-tuning – adjusting model weights by further training on task-specific data. We examine the fine-tuning of Meta’s Llama-3.1 model using tools from the Hugging Face ecosystem, applying efficient techniques such as quantization and low-rank adaptation (LoRA) to an industry text classification task using firm-level 10-K filings.
# By: Terence Lim, 2020-2025 (terence-lim.github.io)
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import os
from tqdm import tqdm
from pathlib import Path
from pprint import pprint
import textwrap
import warnings
import bitsandbytes as bnb
import torch
from datasets import Dataset
from peft import LoraConfig, PeftConfig
from trl import SFTTrainer
from transformers import (AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
pipeline,
logging)
import matplotlib.pyplot as plt
from sklearn.metrics import (accuracy_score,
classification_report,
confusion_matrix)
from sklearn.model_selection import train_test_split
from finds.database import SQL, RedisDB
from finds.unstructured import Edgar
from finds.structured import BusDay, CRSP, PSTAT
from finds.readers import Sectoring
from finds.utils import Store
from secret import paths, CRSP_DATE, credentials
logging.set_verbosity_error()
NUM_TRAIN_EPOCHS = 2 # 0 # 1
RESUME_FROM_CHECKPOINT = False # False # True
MAX_SEQ_LENGTH = 1024 #512 #2048
LOGGING_STEPS = 200
VERBOSE = 0
sql = SQL(**credentials['sql'], verbose=VERBOSE)
bd = BusDay(sql)
rdb = RedisDB(**credentials['redis'])
crsp = CRSP(sql, bd, rdb, verbose=VERBOSE)
pstat = PSTAT(sql, bd, verbose=VERBOSE)
ed = Edgar(paths['10X'], zipped=True, verbose=0)
store = Store('assets', ext='pkl')
permnos = list(store.load('nouns').keys())
print(f"{len(permnos)=}") # comparable sample
len(permnos)=3474
Meta Llama-3.1 model#
Meta’s Llama 3.1 is an open-source large language model released in July 2024 under the Llama 3.1 Community License, permitting broad use, including commercial applications. Key highlights include:
Model variants:
8B: 8 billion parameters.
70B: 70 billion parameters.
405B: 405 billion parameters.
Context length of up to 128,000 tokens.
Pre-trained on over 15 trillion tokens sourced from publicly available datasets.
Fine-tuned using supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF).
Multilingual support, including English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai.
https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
base_model = 'meta-llama/Llama-3.1-8B-Instruct'
# Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
max_memory = round(gpu_stats.total_memory / (1024**3), 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
def cuda_memory(title, trainer_stats=None):
"""Show final memory and optional trainer stats"""
if torch.cuda.is_available():
device = torch.device('cuda')
total_memory = torch.cuda.get_device_properties(device).total_memory
reserved_memory = torch.cuda.memory_reserved(device)
allocated_memory = torch.cuda.memory_allocated(device)
free_memory = total_memory - reserved_memory
print(f'------ {title.upper()} ------')
if trainer_stats:
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"Total memory: {total_memory / (1024**3):.2f} GB")
print(f"Reserved memory: {reserved_memory / (1024**3):.2f} GB")
print(f"Allocated memory: {allocated_memory / (1024**3):.2f} GB")
print(f"Free memory: {free_memory / (1024**3):.2f} GB")
GPU = NVIDIA GeForce RTX 3080 Laptop GPU. Max memory = 15.739 GB.
Supervised fine-tuning (SFT)#
Supervised Fine-Tuning is the process of enhancing a pre-trained language model by fine-tuning it on labeled input–output pairs using standard supervised learning. Common use cases include:
Instruction tuning: The model learns to follow new instructions
Chatbot fine-tuning (e.g., with help-desk data)
Domain adaptation (e.g., legal, medical)
Huggingface framework#
Several ecosystems support fine-tuning and training of LLMs. The Hugging Face Ecosystem includes:
transformers
: Model architectures and training components.Transformers Reinforcement Learning (
trl
): Training large language models (LLMs) with reinforcement learning techniques, especially for alignment tasks like RLHF (Reinforcement Learning with Human Feedback) and DPO (Direct Preference Optimization).bitsandbytes
: Enables efficient low-bit model quantization, allowing large language models to run on limited GPU memory without much loss in performance.Parameter-Efficient Fine-Tuning (
peft
): Tools to fine-tune large language models by training only a small number of additional parameters.Accelerate: Distributed training optimization.
datasets
: For loading, processing, and managing datasets
It provides access to 100k+ pre-trained transformer models, and tools for efficient-tuning of these models using low memory and quantized weights.
If you encounter a gated model repository on Hugging Face, it means the model requires manual access approval from the authors before you can use or download it. You should log in to your huggingface.ro account, go to the Model Page, and click on the “Request Access” button – approval may take up to a few days. When authorized, make sure you have set your Hugging Face token in your environment (e.g. huggingface-cli login
), see https://huggingface.co/settings/tokens
# Locations to save fine-tuned model weights
output_dir = str(Path(paths['scratch'], "fine-tuned-model")) # training checkpoints
model_dir = str(Path(paths['scratch'], "Llama-3.1-8B-Instruct-FF-Sector")) # final model
from trl import SFTConfig
args = SFTConfig(
output_dir=output_dir, # directory to save and repository id
num_train_epochs=NUM_TRAIN_EPOCHS, ####1 # number of training epochs
per_device_train_batch_size=2, ####1 # batch size per device during training
gradient_accumulation_steps=4, ####8 # before performing a backward/update pass
gradient_checkpointing=True, # use gradient checkpointing to save memory
optim="paged_adamw_32bit",
logging_strategy="steps", # or "steps" or "no" or "epoch"
logging_steps=LOGGING_STEPS, #### 1,
learning_rate=2e-4, # learning rate, based on QLoRA paper
weight_decay=0.001,
fp16=True,
bf16=False,
max_grad_norm=0.3, # max gradient norm based on QLoRA paper
max_steps=-1,
warmup_ratio=0.03, # warmup ratio based on QLoRA paper
group_by_length=False,
lr_scheduler_type="cosine", # use cosine learning rate scheduler
report_to="tensorboard",
max_seq_length=MAX_SEQ_LENGTH, #512, ### should be 1024? or MAX_CHARS // 4
packing=False,
dataset_kwargs={
"add_special_tokens": False,
"append_concat_token": False,
}
)
Tokenizer#
The AutoTokenizer
in Hugging Face is a smart utility that automatically loads the correct tokenizer for a given pretrained model.
# Load the tokenizer and set the pad token id.
tokenizer = AutoTokenizer.from_pretrained(base_model)
tokenizer.pad_token_id = tokenizer.eos_token_id
Quantization#
Quantization converts high-precision data to lower-precision data, for instance, by representing model weights and activation values as 4-bit or 8-bit integers instead of 32-bit floating point numbers. The bitsandbytes
library for efficient low-bit model quantization is integrated with Hugging Face and works seamlessly with parameter-efficient fine-tuning like QLora.
# Load the Llama-3.1-8b-instruct model in 4-bit quantization to save GPU memory
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=False,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="float16",
)
AutoModel#
The AutoModel
class in Hugging Face is a convenient interface that automatically loads the correct model architecture based on the model name or path. Its variants automatically load the correct model head (e.g., classification layer, decoder head) based on your specific task, e.g.
Class |
Task |
Output |
---|---|---|
|
Base model (no head) |
Hidden states |
|
Text classification (e.g. sentiment) |
Class logits |
|
Token labeling (e.g. NER, POS) |
Token-level logits |
|
Extractive QA |
Start/end logits for answer spans |
|
Text generation (GPT-style) |
Next-token logits |
|
Mask filling (BERT-style) |
Predictions for masked tokens |
|
Translation, summarization (T5, BART) |
Generated sequences |
|
Multiple-choice QA (e.g. SWAG) |
Choice logits |
|
Image captioning |
Generated text |
|
Vision tasks |
Class logits |
|
Speech translation |
Generated text from audio |
model = AutoModelForCausalLM.from_pretrained(
base_model,
device_map="auto",
torch_dtype="float16",
quantization_config=bnb_config,
)
model.config.use_cache = False
model.config.pretraining_tp = 1
Parameter-efficient fine-tuning#
Parameter-Efficient Fine-Tuning (PEFT) is both a technique and a Hugging Face library for adapting large language models (LLMs) to new tasks by training only a small subset of parameters. Instead of updating the entire model, the base (pretrained) model is kept frozen, and lightweight, trainable components called adapters are added. These adapters typically involve only a few million parameters, making fine-tuning faster and more memory-efficient.
Low-rank factorization: This is a compression technique which decomposes a large matrix of weights into a smaller, lower-rank matrix, resulting in a more compact approximation that requires fewer parameters and computations.
LoRA: A small number of trainable low-rank matrices are added to the model’s attention layers. The original weights are frozen and just these adapters are fine-tuned.
QLora: Combines LoRA with Quantization: The base model is converted to 4-bit precision, reducing memory usage dramatically without losing much performance.
# Extract the linear module names from the model using the bits and bytes library.
def find_all_linear_names(model):
cls = bnb.nn.Linear4bit
lora_module_names = set()
for name, module in model.named_modules():
if isinstance(module, cls):
names = name.split('.')
lora_module_names.add(names[0] if len(names) == 1 else names[-1])
if 'lm_head' in lora_module_names: # needed for 16 bit
lora_module_names.remove('lm_head')
return list(lora_module_names)
modules = find_all_linear_names(model)
modules
['q_proj', 'down_proj', 'v_proj', 'gate_proj', 'o_proj', 'up_proj', 'k_proj']
# Configure LoRA for the target modules, task type, and other training arguments
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0,
r=64,
bias="none",
task_type="CAUSAL_LM",
target_modules=modules,
)
Industry text classification#
We fine-tune the model for classifying firms into ten Fama-French sector categories based on their business descriptions in 10-K filings. The text data for each U.S.-domiciled common stock is drawn from the most recent year’s Business Description section of their 10-K filings.
Load 10-K business description text for industry classification task
# Retrieve universe of stocks
beg, end = bd.begyr(CRSP_DATE), bd.endyr(CRSP_DATE)
print(f"{beg=}, {end=}")
univ = crsp.get_universe(bd.endyr(CRSP_DATE, -1))
# lookup company names
comnam = crsp.build_lookup(source='permno', target='comnam', fillna="")
univ['comnam'] = comnam(univ.index)
# lookup company names
comnam = crsp.build_lookup(source='permno', target='comnam', fillna="")
univ['comnam'] = comnam(univ.index)
# lookup ticker symbols
ticker = crsp.build_lookup(source='permno', target='ticker', fillna="")
univ['ticker'] = ticker(univ.index)
# lookup sic codes from Compustat, and map to FF 10-sector code
sic = pstat.build_lookup(source='lpermno', target='sic', fillna=0)
industry = Series(sic[univ.index], index=univ.index)
industry = industry.where(industry > 0, univ['siccd'])
sectors = Sectoring(sql, scheme='codes10', fillna='') # supplement from crosswalk
univ['sector'] = sectors[industry]
# retrieve latest year's bus10K's
item, form = 'bus10K', '10-K'
rows = DataFrame(ed.open(form=form, item=item))
rows = rows[rows['date'].between(beg, end)]\
.drop_duplicates(subset=['permno'], keep='last')\
.set_index('permno')\
.reindex(permnos)
# split documents into train/test sets
labels = univ.loc[permnos, 'sector']
class_labels = np.unique(labels)
print(f"{class_labels=}")
train_index, test_index = train_test_split(permnos,
stratify=labels,
random_state=42,
test_size=0.2)
beg=20240102, end=20241231
class_labels=array(['Durbl', 'Enrgy', 'HiTec', 'Hlth', 'Manuf', 'NoDur', 'Other',
'Shops', 'Telcm', 'Utils'], dtype=object)
HuggingFace dataset
module#
The training data are converted to LLM instruction statements, and implemented as a HuggingFace Dataset class. This class can be conveniently created from many different sources, including data files of various formats or from a generator function.
# Create LLM instruction statement
MAX_CHARS = MAX_SEQ_LENGTH * 2
class_text = "'" + "' or '".join(class_labels) + "'"
def generate_prompt(permno, test=False):
text = ed[rows.loc[permno, 'pathname']].replace('\n','')[:MAX_CHARS]
return f"""
Classify the text into one of these {len(class_labels)} classification labels:
{class_text}
and return the answer as the label.
text: {text}
label: {'' if test else univ.loc[permno, 'sector']}""".strip()
cuda_memory('before dataset')
------ BEFORE DATASET ------
Total memory: 15.74 GB
Reserved memory: 6.83 GB
Allocated memory: 5.63 GB
Free memory: 8.91 GB
X_train = DataFrame(columns=['text'], index=train_index,
data=[generate_prompt(permno, test=False) for permno in train_index])
X_test = DataFrame(columns=['text'], index=test_index,
data=[generate_prompt(permno, test=True) for permno in test_index])
y_test = [univ.loc[permno, 'sector'] for permno in test_index]
train_data = Dataset.from_pandas(X_train[["text"]])
test_data = Dataset.from_pandas(X_test[["text"]])
print(textwrap.fill(train_data['text'][3]))
Classify the text into one of these 10 classification labels: 'Durbl'
or 'Enrgy' or 'HiTec' or 'Hlth' or 'Manuf' or 'NoDur' or 'Other' or
'Shops' or 'Telcm' or 'Utils' and return the answer as the label.
text: ITEM 1. BUSINESS OVERVIEW B. RILEY FINANCIAL, INC. (NASDAQ:
RILY) (THE COMPANY IS A DIVERSIFIED FINANCIAL SERVICES PLATFORM THAT
DELIVERS TAILORED SOLUTIONS TO MEET THE STRATEGIC, OPERATIONAL, AND
CAPITAL NEEDS OF ITS CLIENTS AND PARTNERS. WE OPERATE THROUGH SEVERAL
CONSOLIDATED SUBSIDIARIES (COLLECTIVELY, B. RILEY THAT PROVIDE
INVESTMENT BANKING, BROKERAGE, WEALTH MANAGEMENT, ASSET MANAGEMENT,
DIRECT LENDING, BUSINESS ADVISORY, VALUATION, AND ASSET DISPOSITION
SERVICES TO A BROAD CLIENT BASE SPANNING PUBLIC AND PRIVATE COMPANIES,
FINANCIAL SPONSORS, INVESTORS, FINANCIAL INSTITUTIONS, LEGAL AND
PROFESSIONAL SERVICES FIRMS, AND INDIVIDUALS. THE COMPANY
OPPORTUNISTICALLY INVESTS IN AND ACQUIRES COMPANIES OR ASSETS WITH
ATTRACTIVE RISK-ADJUSTED RETURN PROFILES TO BENEFIT OUR SHAREHOLDERS.
WE OWN AND OPERATE SEVERAL UNCORRELATED CONSUMER BUSINESSES AND INVEST
IN BRANDS ON A PRINCIPAL BASIS. OUR APPROACH IS FOCUSED ON HIGH
QUALITY COMPANIES AND ASSETS IN INDUSTRIES IN WHICH WE HAVE EXTENSIVE
KNOWLEDGE AND CAN BENEFIT FROM OUR EXPERIENCE TO MAKE OPERATIONAL
IMPROVEMENTS AND MAXIMIZE FREE CASH FLOW. OUR PRINCIPAL INVESTMENTS
OFTEN LEVERAGE THE FINANCIAL, RESTRUCTURING, AND OPERATIONAL EXPERTISE
OF OUR PROFESSIONALS WHO WORK COLLABORATIVELY ACROSS DISCIPLINES. WE
REFER TO B. RILEY AS A PLATFORM BECAUSE OF THE UNIQUE COMPOSITION OF
OUR BUSINESS. OUR PLATFORM HAS GROWN CONSIDERABLY AND BECOME MORE
DIVERSIFIED OVER THE PAST SEVERAL YEARS. WE HAVE INCREASED OUR MARKET
SHARE AND EXPANDED THE DEPTH AND BREADTH OF OUR BUSINESSES BOTH
ORGANICALLY AND THROUGH OPPORTUNISTIC ACQUISITIONS. OUR INCREASINGLY
DIVERSIFIED PLATFORM ENABLES US TO INVEST OPPORTUNISTICALLY AND TO
DELIVER STRONG LONG-TERM INVESTMENT PERFORMANCE THROUGHOUT A RANGE OF
ECONOMIC CYCLES. OUR PLATFORM IS COMPRISED OF MORE THAN 2,700
AFFILIATED PROFESSIONALS, INCLUDING EMPLOYEES AND INDEPENDENT
CONTRACTORS. WE ARE HEADQUARTERED IN LOS ANGELES, CALIFORNIA AND
MAINTAIN OFFICES THROUGHOUT THE U.S., INCLUDING IN NEW YORK, CHICAGO,
METRO DISTRICT OF COLUMBIA, AT label: Other
# verify max_seq_length sufficient
curr_max = 0
for row, data in enumerate(train_data):
tokenized = tokenizer.tokenize(data['text'])
curr_max = max(curr_max, len(tokenized))
# print(f"{row=}, {len(tokenized)=}")
assert curr_max < args.max_seq_length
print(curr_max, f"{MAX_SEQ_LENGTH=}")
820 MAX_SEQ_LENGTH=1024
cuda_memory('after dataset')
------ AFTER DATASET ------
Total memory: 15.74 GB
Reserved memory: 6.83 GB
Allocated memory: 5.63 GB
Free memory: 8.91 GB
Pipeline#
Hugging Face’s pipeline
function enables one-line use for easy inference, by simply specifying the model, tokenizer, generation parameters (e.g. sampling methdology, maximum new tokens), and task, e.g.:
“text-classification”: Sentiment analysis, topic labeling
“token-classification”: Named Entity Recognition (NER), POS tagging
“question-answering”: Extractive QA from context
“text-generation”: Generate text (GPT-style)
“summarization”: Generate summaries from long text
# Use the text generation pipeline to predict labels from the “text”
def generate(prompt, model=model, tokenizer=tokenizer, verbose=False):
"""Generate a response"""
pipe = pipeline(task="text-generation",
model=model,
tokenizer=tokenizer,
do_sample=False,
top_p=None,
top_k=None,
return_full_text=False,
max_new_tokens=4, # 2
temperature=None) # 0.1
result = pipe(prompt)
answer = result[0]['generated_text'].split("label:")[-1].strip()
if verbose:
print(f"{len(prompt)=}, {result=}, {answer=}")
return answer
def predict(test, model, tokenizer, verbose=False):
"""Predict test set"""
y_pred = []
for i in tqdm(range(len(test))):
prompt = test.iloc[i]["text"]
answer = generate(prompt, model, tokenizer, verbose=verbose)
# Determine the predicted category
for category in class_labels:
if category.lower() in answer.lower():
y_pred.append(category)
break
else:
y_pred.append("none")
return y_pred
Create function that will use the predicted labels and true labels to compute the overall accuracy, classification report, and confusion matrix.
def evaluate(y_true, y_pred):
mapping = {label: idx for idx, label in enumerate(class_labels)}
def map_func(x):
return mapping.get(x, -1) # Map to -1 if not found, should not occur with correct data
y_true_mapped = np.vectorize(map_func)(y_true)
y_pred_mapped = np.vectorize(map_func)(y_pred)
labels = list(mapping.values())
target_names = list(mapping.keys())
if -1 in y_pred_mapped:
labels += [-1]
target_names += ['none']
# Calculate accuracy
accuracy = accuracy_score(y_true=y_true_mapped, y_pred=y_pred_mapped)
print(f'Accuracy: {accuracy:.3f}')
# Generate classification report
class_report = classification_report(y_true=y_true_mapped, y_pred=y_pred_mapped,
target_names=target_names,
labels=labels, zero_division=0.0)
print('\nClassification Report:')
print(class_report)
# Generate confusion matrix
conf_matrix = confusion_matrix(y_true=y_true_mapped, y_pred=y_pred_mapped,
labels=labels)
print('\nConfusion Matrix:')
print(conf_matrix)
Evaluate accuracy before fine-tuning the model
y_pred = predict(X_test, model, tokenizer)
Series(y_pred).value_counts()
100%|██████████| 695/695 [05:45<00:00, 2.01it/s]
Manuf 217
NoDur 184
HiTec 109
Other 65
none 54
Hlth 24
Utils 15
Telcm 14
Shops 8
Enrgy 4
Durbl 1
Name: count, dtype: int64
evaluate(y_test, y_pred)
Accuracy: 0.203
Classification Report:
precision recall f1-score support
Durbl 0.00 0.00 0.00 33
Enrgy 0.50 0.10 0.17 20
HiTec 0.25 0.19 0.22 139
Hlth 0.88 0.13 0.22 164
Manuf 0.22 0.70 0.34 69
NoDur 0.03 0.21 0.06 28
Other 0.25 0.10 0.15 153
Shops 0.75 0.10 0.17 62
Telcm 0.36 0.56 0.43 9
Utils 0.67 0.56 0.61 18
none 0.00 0.00 0.00 0
accuracy 0.20 695
macro avg 0.35 0.24 0.21 695
weighted avg 0.44 0.20 0.21 695
Confusion Matrix:
[[ 0 0 2 0 18 10 2 1 0 0 0]
[ 0 2 0 0 8 8 2 0 0 0 0]
[ 0 0 27 0 43 38 18 0 8 4 1]
[ 0 0 73 21 15 22 11 0 1 1 20]
[ 0 0 3 0 48 12 5 1 0 0 0]
[ 1 0 0 0 17 6 4 0 0 0 0]
[ 0 0 4 1 41 64 16 0 0 0 27]
[ 0 0 0 1 26 16 7 6 0 0 6]
[ 0 0 0 0 0 4 0 0 5 0 0]
[ 0 2 0 1 1 4 0 0 0 10 0]
[ 0 0 0 0 0 0 0 0 0 0 0]]
Trainer#
Create the model trainer using training arguments, a LoRA configuration, and a dataset.
trainer = SFTTrainer(
model=model,
args=args,
train_dataset=train_data,
peft_config=peft_config,
# dataset_text_field="text",
processing_class=tokenizer
)
# Initiate model training
cuda_memory('before training')
trainer_stats = trainer.train(resume_from_checkpoint=RESUME_FROM_CHECKPOINT)
------ BEFORE TRAINING ------
Total memory: 15.74 GB
Reserved memory: 11.04 GB
Allocated memory: 8.22 GB
Free memory: 4.70 GB
{'loss': 1.1984, 'grad_norm': 0.1371612697839737, 'learning_rate': 0.0001670747898848231, 'num_tokens': 1091299.0, 'mean_token_accuracy': 0.7146163220703602, 'epoch': 0.5755395683453237}
{'loss': 1.1205, 'grad_norm': 0.16719305515289307, 'learning_rate': 8.029070592154895e-05, 'num_tokens': 2179799.0, 'mean_token_accuracy': 0.7273549642927366, 'epoch': 1.1496402877697842}
{'loss': 1.034, 'grad_norm': 0.19266854226589203, 'learning_rate': 9.47361624665869e-06, 'num_tokens': 3270551.0, 'mean_token_accuracy': 0.7437317748367787, 'epoch': 1.725179856115108}
{'train_runtime': 9602.662, 'train_samples_per_second': 0.579, 'train_steps_per_second': 0.072, 'train_loss': 1.103452600044888, 'num_tokens': 3784895.0, 'mean_token_accuracy': 0.7479746815689067, 'epoch': 1.99568345323741}
# Save trained model and tokenizer
model.config.use_cache = True
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
cuda_memory('after training', trainer_stats=trainer_stats)
------ AFTER TRAINING ------
9602.662 seconds used for training.
Total memory: 15.74 GB
Reserved memory: 14.43 GB
Allocated memory: 8.26 GB
Free memory: 1.31 GB
Evaluation#
y_pred = predict(X_test, model, tokenizer, verbose=False)
Series(y_pred).value_counts()
0%| | 0/695 [00:00<?, ?it/s]/home/terence/env3.11/lib/python3.11/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
100%|██████████| 695/695 [08:21<00:00, 1.39it/s]
Hlth 168
Other 156
HiTec 140
Manuf 59
Shops 59
NoDur 34
Durbl 29
Enrgy 21
Utils 19
Telcm 10
Name: count, dtype: int64
evaluate(y_test, y_pred)
Accuracy: 0.829
Classification Report:
precision recall f1-score support
Durbl 0.83 0.73 0.77 33
Enrgy 0.90 0.95 0.93 20
HiTec 0.79 0.80 0.80 139
Hlth 0.89 0.91 0.90 164
Manuf 0.80 0.68 0.73 69
NoDur 0.59 0.71 0.65 28
Other 0.85 0.86 0.85 153
Shops 0.83 0.79 0.81 62
Telcm 0.90 1.00 0.95 9
Utils 0.84 0.89 0.86 18
accuracy 0.83 695
macro avg 0.82 0.83 0.83 695
weighted avg 0.83 0.83 0.83 695
Confusion Matrix:
[[ 24 0 6 0 1 1 0 1 0 0]
[ 0 19 0 0 1 0 0 0 0 0]
[ 0 2 111 7 2 2 12 2 1 0]
[ 0 0 9 149 1 1 3 1 0 0]
[ 5 0 4 2 47 6 2 2 0 1]
[ 0 0 0 1 2 20 2 3 0 0]
[ 0 0 10 4 5 1 132 1 0 0]
[ 0 0 0 4 0 3 4 49 0 2]
[ 0 0 0 0 0 0 0 0 9 0]
[ 0 0 0 1 0 0 1 0 0 16]]
# merge and save model
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
del model
del trainer
torch.cuda.empty_cache()
cuda_memory('after empty')
# Reload base model and tokenizer to cpu
device_map = "cpu"
tokenizer = AutoTokenizer.from_pretrained(base_model)
base_model_reload = AutoModelForCausalLM.from_pretrained(
base_model,
return_dict=True,
low_cpu_mem_usage=True,
torch_dtype=torch.float16,
device_map=device_map, # "cpu", # "auto",
trust_remote_code=True,
)
# Merge adapter with base model
from peft import PeftModel
model = PeftModel.from_pretrained(base_model_reload, output_dir, device_map=device_map)
model = model.merge_and_unload()
# Save the merged model
model.save_pretrained(model_dir)
tokenizer.save_pretrained(model_dir)
# Reload nerged model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_dir)
base_model_reload = AutoModelForCausalLM.from_pretrained(
model_dir,
return_dict=True,
low_cpu_mem_usage=True,
torch_dtype=torch.float16,
device_map="auto", # 'cpu',
trust_remote_code=True,
)
# Check it is working
y_pred = predict(X_test, model, tokenizer)
evaluate(y_test, y_pred)
References:
Philipp Krähenbühl, 2025, “AI395T Advances in Deep Learning course materials”, retrieved from https://ut.philkr.net/advances_in_deeplearning/
Tim Dettmers, “Bitsandbytes: 8-bit Optimizers and Quantization for PyTorch”, 2022. GitHub repository: TimDettmers/bitsandbytes