I want to find out the role of truncation and padding in Huggingface Transformers pretrained models and any fine-tuning model on top of that. Therefore I played around with these parameters, but I could not find a way to set them for my own text input for the fine-tuning model.
To see a default guide, the code that can help understand what this question is about is at Huggingface - Transformers - Fine-tune a pretrained mode. Here, you can pass the truncation parameter to a function that takes care of each example. If a dataset has 10 examples (10 quoted sentences, each in a new row), you will apply this function tokenize_function to each example.
from datasets import load_dataset
from transformers import AutoTokenizer
dataset = load_dataset("yelp_review_full")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
Afterwards, you can go on, the Trainer class takes care of the rest:
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset,
compute_metrics=compute_metrics,
)
trainer.train()
Now how would that be done if I did not load the dataset from the datasets module but if I had just an input text like "A dog jumps over the wall"? I want to take the German GPT2 and run a text generation model as a fine-tuning model on top of that.
How could I tell the tokenizer that it should or should not truncate, and then take that output and run the fine-tuning model on it so that it runs through?
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
text = "Wie ist das Wetter heute?"
tokenizer = AutoTokenizer.from_pretrained("dbmdz/german-gpt2")
model = AutoModelForCausalLM.from_pretrained(model_name)
Unclear why, I had to add these two lines of code:
Check if the tokenizer has a padding token
if tokenizer.pad_token is None:
# If not, assign the eos_token as the padding token
tokenizer.pad_token = tokenizer.eos_token
HERE: --> make the text a dataset that can be tokenized with or without truncation
THIS BLOCK NEEDS TO BE REPLACED BY THE NEEDED CODE
output object: dataset
And here is again the function of the guide above that is called by the map function
of the dataset object. This is likely not the only way to get the text above
tokenized with or without truncation.
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
training_args = TrainingArguments(output_dir="test_trainer")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset,
)
trainer.train()
I could make a dataset object by loading the text from a text file instead, like this:
from transformers import AutoModelForCausalLM, AutoTokenizer, DataCollatorForLanguageModeling, TextDataset, Trainer, TrainingArguments
model_name = "dbmdz/german-gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
file_path = './myfile.txt'
train_dataset = TextDataset( #LineByLineTextDataset(
tokenizer=tokenizer,
file_path=file_path,
block_size=512,
overwrite_cache=True,
# truncation=True, # cannot be done in this object
# padding=True, # cannot be done in this object
)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
training_args = TrainingArguments(
output_dir="./output",
overwrite_output_dir=True,
num_train_epochs=num_train_epochs,
per_device_train_batch_size=per_device_train_batch_size,
save_steps=save_steps,
)
model = AutoModelForCausalLM.from_pretrained(model_name)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
)
But then I do not know how to set the truncation or padding arguments, uncommenting truncation=True
leads to TypeError: __init__() got an unexpected keyword argument 'truncation', same with padding.
And when I try the truncation afterwards:
train_dataset = dataset.map(
lambda examples: tokenize_function(examples, tokenizer, bln_truncation),
batched=True,
)
It throws an error:
AttributeError: 'TextDataset' object has no attribute 'map'
The same with
dataset = LineByLineTextDataset( #LineByLineTextDataset( #TextDataset(
tokenizer=tokenizer,
file_path='myfile.txt',
block_size=128,
# truncation=True, # cannot be done in this object
# padding=True, # cannot be done in this object
)
Out:
AttributeError: 'LineByLineTextDataset' object has no attribute 'map'
Therefore, I tried it with the load_dataset module of the datasets package to get the map() method of the class:
from datasets import load_dataset
dataset = load_dataset("text", data_files=file_path)
# with this `['train']` key as a first test
train_dataset = dataset['train']
But the fine-tuning at trainer.train() throws:
File ~/.local/lib/python3.9/site-packages/datasets/dataset_dict.py:48, in DatasetDict.__getitem__(self, k)
44 available_suggested_splits = [
45 str(split) for split in (Split.TRAIN, Split.TEST, Split.VALIDATION) if split in self
46 ]
47 suggested_split = available_suggested_splits[0] if available_suggested_splits else list(self)[0]
---> 48 raise KeyError(
49 f"Invalid key: {k}. Please first select a split. For example: "
50 f"`my_dataset_dictionary['{suggested_split}'][{k}]`. "
51 f"Available splits: {sorted(self)}"
52 )
KeyError: "Invalid key: 0. Please first select a split. For example: my_dataset_dictionary['train'][0]. Available splits: ['train']"
And if I code it without this ['train'] key since I want to have the full data and not a slice:
from datasets import load_dataset
dataset = load_dataset("text", data_files=file_path)
train_dataset = dataset
The fine-tuning at trainer.train() throws:
File /srv/home/seid/miniconda3/lib/python3.9/site-packages/torch/_utils.py:644, in ExceptionWrapper.reraise(self)
640 except TypeError:
641 # If the exception takes multiple arguments, don't try to
642 # instantiate since we don't know how to
643 raise RuntimeError(msg) from None
--> 644 raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/srv/home/seid/miniconda3/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(input, kwargs)
File "/srv/home/seid/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(args, kwargs)
File "/srv/home/tester/.local/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1046, in forward
transformer_outputs = self.transformer(
File "/srv/home/seid/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, kwargs)
File "/srv/home/tester/.local/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 889, in forward
outputs = block(
File "/srv/home/seid/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(args, kwargs)
File "/srv/home/tester/.local/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 390, in forward
attn_outputs = self.attn(
File "/srv/home/seid/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(args, kwargs)
File "/srv/home/tester/.local/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 312, in forward
query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2)
File "/srv/home/seid/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, kwargs)
File "/srv/home/tester/.local/lib/python3.9/site-packages/transformers/pytorch_utils.py", line 107, in forward
x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
All of this coding runs in circles when I ask ChatGPT 3.5 for help (I did not test a higher version). It will begin to subclass classes like class MyTrainer(Trainer) or class MyDataCollator(DataCollatorForLanguageModeling) and tries new functions like collate_function(batch, tokenizer) or overrides or overloads methods. After trying this for hours, I do not get to any code where I can just set the truncation, to begin with, let alone the padding, and get a working fine-tuned text generation model from this tokenized short input text.
How should I code a Huggingface fine-tuning model with the Trainer class where I can set the arguments for the truncation and padding parameters to tokenize my own short input text and where afterwards trainer.train() runs through?
PS: why I ask
I want to know this since I feed a text generation fine-tuning model with a short essay and ask it questions (even if it is not a Question Answering model). I do this since a Question Answering model is only cutting pieces from the text, there is no free speech, and even though the answers are not bad either, they are not as if a human being would make judgements, it is more a fishing for some keywords and their embedding. In short, the output is not good, I already tried a lot of hyperparameters. I do not split the text into sentences but load the whole text as one so-called "example" of the dataset. I do not know whether this is good, I just hope that the model understands the text better if it is one full text in one "example".
I ask it questions regarding the whole text, not just some split sentences. Therefore, I want to make sure that the tokenizer works such that the whole text is read and tokenized without dropping any text. To check this, I want to change truncation from the default False to True only to see whether the output is worse. If it is not worse, I would know that it gets truncated to the max_length of the model. In short, I want to find out with this question here whether I need to change the code and split the text into many examples or not, but I also want to play around with the parameters and want to know whether truncation plays any role for the output at all.