Sunday, October 15, 2023

Step by Step Fine-Tuning Mistral 7B with Custom Dataset

Large Language Models are trained on huge amount of data. Falcon 40B model, e.g. has been trained on 1 trillion tokens with 40 billion parameters. This training took around 2 months and 384 GPUs on AWS. 

If you want to use these LLMs for your own data, then you need to adapt them or fine-tune them. Fine-tuning a model larger than 10B is an expensive and time consuming task. 

This is where HuggingFace's PEFT library comes handy. PEFT stands for parameter efficent fine tuning. We can use a fine-tuning technique called as QLORA to train LLMs on our own dataset in far less time using far less resources. QLORA stands for Quantized Low Rank Adapation and allows us to to train a small portion of model without losing much efficieny. After the training is completed, there is no necessity to save the entire model, as the base model remains frozen.

Python Package Installation:


We begin by installing all the required dependencies. 

- The Huggingface Transformer Reinforcement Learning (TRL) library simplifies Reinforcement Learning from Human Feedback (RLHF) settings. 

- Transformers is a Python library that makes downloading and training state-of-the-art ML models easy.

- Accelerate is a library that enables the same PyTorch code to be run across any distributed configuration by adding just four lines of code

- Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters.

- Datasets is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks. 

- Bitsandbytes is a lightweight wrapper around CUDA custom functions, in particular 8-bit optimizers and quantization functions.

- einops stands for Einstein-Inspired Notation for operations. It is an open-source python framework for writing deep learning code in a new and better way.

- Tiktoken is an open-source tool developed by OpenAI that is utilized for tokenizing text. Tokenization is when you split a text string to a list of tokens. Tokens can be letters, words or grouping of words

- By using wandb, you can track, compare, explain and reproduce machine learning experiments.

- xFormers is a PyTorch based library which hosts flexible Transformers parts.

- SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training.

!pip intall -q trl transformers accelerate peft datasets bitsandbytes einops tiktoken wandb xformers sentencepiece

Prepare Dataset:


I will be using Gath_baize dataset comprising approximately 210k prompts to train Mistral-7b. The dataset consists of a mixture of data from Alpaca, Stack Overflow, medical, and Quora datasets. In this load_dataset function we are loading the dataset with full train split as we are going to use this dataset in training. If we would be just testing it, then we would use split=test. 

from datasets import load_dataset

gathbaize = load_dataset("gathnex/Gath_baize",split="train")



gathbaize_sampled = gathbaize.shuffle(seed=42).select(range(50))


Check for GPU:


The NVIDIA System Management Interface (nvidia-smi) is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.


Create LLM Model:


-Torch is an open source ML library used for creating deep neural networks 

-AutoModelForCausalLM used for auto-regressive models. regressive means referring to previous state. Auto-regressive models predict future values based on past values.

-A tokenizer is responsible for preprocessing text into an array of numbers as inputs to a model.

-Bitsandbytes library simplifies the process of model quantization, making it more accessible and user-friendly.

import torch

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

from peft import prepare_model_for_kbit_training   => this prepares the model for fine-tuning.

model_name = "ybelkada/Mistral-7B-v0.1-bf16-sharded"

- BitsandBytesConfig is configuration for QLORA. QLoRA reduces the memory usage of LLM finetuning without performance tradeoffs compared to standard 16-bit model finetuning. QLoRA uses 4-bit quantization to compress a pretrained language model. The LM parameters are then frozen and a relatively small number of trainable parameters are added to the model in the form of Low-Rank Adapters. During finetuning, QLoRA backpropagates gradients through the frozen 4-bit quantized pretrained language model into the Low-Rank Adapters. The LoRA layers are the only parameters being updated during training. 

- The basic way to load a model in 4bit is to pass the argument load_in_4bit=True

- There are different variants of 4bit quantization such as NF4 (normalized float 4 (default)) or pure FP4 quantization. NF4 is better for performance.

- You can change the compute dtype of the quantized model by just changing the bnb_4bit_compute_dtype argument. A dtype (data type) object describes how the bytes in the fixed-size block of memory corresponding to an array item should be interpreted.

- bnb_4bit_use_double_quant uses a second quantization after the first one to save an additional 0.4 bits per parameter. 

bnb_config = BitsAndBytesConfig(

    load_in_4bit= True,

    bnb_4bit_quant_type= "nf4",

    bnb_4bit_compute_dtype= torch.bfloat16,

    bnb_4bit_use_double_quant= False,


- Whether or not to allow for custom models defined on the Hub in their own modeling files. 

model = AutoModelForCausalLM.from_pretrained(






- When fine-tuning the model, you want to use the updated model params. Using the old (cached) values kinda defeats the purpose of finetuning. Hence, the past (cached) key values are disregarded for the fine-tuned model.

- Setting config.pretraining_tp to a value different than 1 will activate the more accurate but slower computation of the linear layers

- Gradient check-pointing is only needed if training leads to out-of-memory (OOM) errors so its a sort of best practice.

model.config.use_cache = False

model.config.pretraining_tp = 1


model = prepare_model_for_kbit_training(model)

Create LLM Tokenizer:


- Pad_token is a special token used to make arrays of tokens the same size for batching purpose.

- eos_token is a special token used as an end of sentence token

- bos_token is a special token representing the beginning of a sentence.

tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True)

tokenizer.pad_token = tokenizer.eos_token

tokenizer.add_eos_token = True

tokenizer.add_bos_token, tokenizer.add_eos_token

from peft import LoraConfig, TaskType

- LoraConfig allows you to control how LoRA is applied to the base model through the following parameters:

lora_alpha: LoRA scaling factor.

r: the rank of the update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.

bias: Specifies if the bias parameters should be trained. Can be 'none', 'all' or 'lora_only'.

target_modules: The modules (for example, attention blocks) to apply the LoRA update matrices.

(lora_dropout): This is the probability that each neuron's output is set to zero during training, used to prevent overfitting.

peft_config = LoraConfig(






target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]


from peft import get_peft_model

model = get_peft_model(model,peft_config)

from transformers import TrainingArguments

- num_train_epochs(`float`, *optional*, defaults to 3.0): Total number of training epochs to perform

- per_device_train_batch_size is the batch size per GPU/TPU core/CPU for training. 

- Gradient accumulation is a technique that simulates a larger batch size by accumulating gradients from multiple small batches before performing a weight update. This technique can be helpful in scenarios where the available memory is limited, and the batch size that can fit in memory is small.

- This parameter tells the optimizer how far to move the weights in the direction opposite of the gradient for a mini-batch.

- warmup_ration is ratio of total training steps used for a linear warmup from 0 to learning_rate.

- max steps  If set to a positive number, the total number of training steps to perform.

training_arguments = TrainingArguments(

    output_dir= "./results",

    num_train_epochs= 1,

    per_device_train_batch_size= 8,

    gradient_accumulation_steps= 2,

    optim = "paged_adamw_8bit",

    save_steps= 5000,

    logging_steps= 30,

    learning_rate= 2e-4,

    weight_decay= 0.001,

    fp16= False,

    bf16= False,

    max_grad_norm= 0.3,

    max_steps= -1,

    warmup_ratio= 0.3,

    group_by_length= True,

    lr_scheduler_type= "constant"


from trl import SFTTrainer

- The SFTTrainer is a light wrapper around the transformers Trainer to easily fine-tune language models or adapters on a custom dataset.

- max_seq_length: maximum sequence length to use for the `ConstantLengthDataset` and for automaticallty creating the Dataset. Defaults to `512`.

- SFTTrainer supports example packing, where multiple short examples are packed in the same input sequence to increase training efficiency.


trainer = SFTTrainer(








packing= False,



Saving the Model:


trained_model_dir = './trained_model'


Load the Trained Model:


from peft import PeftConfig, PeftModel

config = PeftConfig.from_pretrained(trained_model_dir)

trained_model = AutoModelForCausalLM.from_pretrained(







trained_model = PeftModel.from_pretrained(trained_model,trained_model_dir)

trained_model_tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path,trust_remote_code=True)

trained_model_tokenizer.pad_token = trained_model_tokenizer.eos_token

Create Generation Config for Prediction:


generation_config = trained_model.generation_config

generation_config.max_new_token = 1024

generation_config.tempreture = 0.7

generation_config.top_p = 0.7

generation_config.num_return_sequence = 1

generation_config.pad_token_id = trained_model_tokenizer.pad_token_id

generation_config.eos_token_id = trained_model_tokenizer.eos_token_id


Model Inference:


device = 'cuda:0'

query = 'larget text to be summarized'

user_prompt = 'Explain large language models'

system_prompt = 'The conversation between Human and AI assisatance named MyMistral\n'

B_INST, E_INST = "[INST]", "[/INST]"

prompt = f"{system_prompt}{B_INST}{user_prompt.strip()}\n{E_INST}"

encodings = trained_model_tokenizer(prompt, return_tensors='pt').to(device)


with torch.inference_mode():

outputs = trained_model.generate(







outputs = trained_model_tokenizer.decode(outputs[0],skip_special_tokens=True)


No comments: