Saturday, August 26, 2023

Tutorial - How Tokenization Work in LLM

 Tokenization means to split large text into smaller units for LLMs to 'digest.' Tokenization converts text into numbers as LLMs understand numbers and not text. Every model has its own tokenizer so when you are using a model, make sure to use its correct tokenizer otherwise model's output could be wrong. 




Lets see how tokenizer work in demo. I am using google colab. Lets first isntall some pre-req libraries. 

We are using Autotokenizer from transformer library which automatically finds right tokenizer for your model. 


Lets tokenize some text:


!pip install transformers

!pip install datasets

!pip install huggingface_hub

import pandas as pd

import datasets

from pprint import pprint

from transformers import AutoTokenizer


from huggingface_hub import notebook_login

notebook_login()


tokenizer = AutoTokenizer.from_pretrained("TinyPixel/Llama-2-7B-bf16-sharded")   or stabilityai/stablecode-instruct-alpha-3b


text = "I am in Sydney."


tokenized_text = tokenizer(text)["input_ids"]


tokenized_text


Untokenized_text = tokenizer.decode(tokenized_text)

Untokenized_text


In real world, there will be lot of text, so lets see example of that:


list_text = ["I am in Sydney", "Near Bronte Beach", "Not near Blue mountains", "wow"]

tokenized_text = tokenizer(list_text)

tokenized_text["input_ids"]


As you can see that lists in this output are not of same length. Models need every list of tokens of same length because we use fixed number of tensors. So next step is to make all of these lists of same size. To do that, we first determine whats the max length of lists, and then expand each list to that length. This process is called as padding. Lets see this example:


tokenizer.pad_token = tokenizer.eos_token 

tokenized_texts_longest = tokenizer(list_text, padding=True)

tokenized_texts_longest["input_ids"]


Now another thing is that every model has a max length whigh is limit of tokens. So we need to truncate the tokens as per max length.  This is how you do it. 


tokenized_texts_final = tokenizer(list_text, max_length=3, truncation=True, padding=True)

tokenized_texts_final["input_ids"]

No comments: