How can you get a Huggingface fine-tuning model with the Trainer class from your own text where you can set the arguments for truncation and padding?

Question

I want to find out the role of truncation and padding in Huggingface Transformers pretrained models and any fine-tuning model on top of that. Therefore I played around with these parameters, but I could not find a way to set them for my own text input for the fine-tuning model.

To see a default guide, the code that can help understand what this question is about is at Huggingface - Transformers - Fine-tune a pretrained mode. Here, you can pass the truncation parameter to a function that takes care of each example. If a dataset has 10 examples (10 quoted sentences, each in a new row), you will apply this function tokenize_function to each example.

from datasets import load_dataset
from transformers import AutoTokenizer
dataset = load_dataset("yelp_review_full")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
def tokenize_function(examples):
return tokenizer(examples[&quot;text&quot;], padding=&quot;max_length&quot;, truncation=True)



tokenized_datasets = dataset.map(tokenize_function, batched=True)

Afterwards, you can go on, the Trainer class takes care of the rest:

from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)
trainer.train()

Now how would that be done if I did not load the dataset from the datasets module but if I had just an input text like "A dog jumps over the wall"? I want to take the German GPT2 and run a text generation model as a fine-tuning model on top of that.

How could I tell the tokenizer that it should or should not truncate, and then take that output and run the fine-tuning model on it so that it runs through?

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
text = "Wie ist das Wetter heute?"
tokenizer = AutoTokenizer.from_pretrained("dbmdz/german-gpt2")
model = AutoModelForCausalLM.from_pretrained(model_name)
Unclear why, I had to add these two lines of code:
Check if the tokenizer has a padding token
if tokenizer.pad_token is None:
    # If not, assign the eos_token as the padding token
    tokenizer.pad_token = tokenizer.eos_token

HERE: --> make the text a dataset that can be tokenized with or without truncation
THIS BLOCK NEEDS TO BE REPLACED BY THE NEEDED CODE
output object: dataset

And here is again the function of the guide above that is called by the map function
of the dataset object. This is likely not the only way to get the text above
tokenized with or without truncation.
def tokenize_function(examples):
return tokenizer(examples[&quot;text&quot;], padding=&quot;max_length&quot;, truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)
training_args = TrainingArguments(output_dir="test_trainer")
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
)
trainer.train()

I could make a dataset object by loading the text from a text file instead, like this:

from transformers import AutoModelForCausalLM, AutoTokenizer, DataCollatorForLanguageModeling, TextDataset, Trainer, TrainingArguments
model_name = "dbmdz/german-gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
file_path = './myfile.txt'
train_dataset = TextDataset( #LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path=file_path,
    block_size=512,
    overwrite_cache=True,
    # truncation=True, # cannot be done in this object
    # padding=True, # cannot be done in this object
)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
training_args = TrainingArguments(
    output_dir="./output",
    overwrite_output_dir=True,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    save_steps=save_steps,
)
model = AutoModelForCausalLM.from_pretrained(model_name)
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

But then I do not know how to set the truncation or padding arguments, uncommenting truncation=True leads to TypeError: __init__() got an unexpected keyword argument 'truncation', same with padding.

And when I try the truncation afterwards:

    train_dataset = dataset.map(
        lambda examples: tokenize_function(examples, tokenizer, bln_truncation),
        batched=True,
    )

It throws an error:

AttributeError: 'TextDataset' object has no attribute 'map'

The same with

    dataset = LineByLineTextDataset( #LineByLineTextDataset( #TextDataset(
        tokenizer=tokenizer,
        file_path='myfile.txt',
        block_size=128,
#         truncation=True, # cannot be done in this object
#         padding=True, # cannot be done in this object
    )

Out:

AttributeError: 'LineByLineTextDataset' object has no attribute 'map'

Therefore, I tried it with the load_dataset module of the datasets package to get the map() method of the class:

    from datasets import load_dataset
dataset = load_dataset(&quot;text&quot;, data_files=file_path)

# with this `['train']` key as a first test
train_dataset = dataset['train']

But the fine-tuning at trainer.train() throws:

File ~/.local/lib/python3.9/site-packages/datasets/dataset_dict.py:48, in DatasetDict.__getitem__(self, k)
     44 available_suggested_splits = [
     45     str(split) for split in (Split.TRAIN, Split.TEST, Split.VALIDATION) if split in self
     46 ]
     47 suggested_split = available_suggested_splits[0] if available_suggested_splits else list(self)[0]
---> 48 raise KeyError(
     49     f"Invalid key: {k}. Please first select a split. For example: "
     50     f"`my_dataset_dictionary['{suggested_split}'][{k}]`. "
     51     f"Available splits: {sorted(self)}"
     52 )
KeyError: "Invalid key: 0. Please first select a split. For example: my_dataset_dictionary['train'][0]. Available splits: ['train']"

And if I code it without this ['train'] key since I want to have the full data and not a slice:

    from datasets import load_dataset
dataset = load_dataset(&quot;text&quot;, data_files=file_path)

train_dataset = dataset

The fine-tuning at trainer.train() throws:

File /srv/home/seid/miniconda3/lib/python3.9/site-packages/torch/_utils.py:644, in ExceptionWrapper.reraise(self)
    640 except TypeError:
    641     # If the exception takes multiple arguments, don't try to
    642     # instantiate since we don't know how to
    643     raise RuntimeError(msg) from None
--> 644 raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/srv/home/seid/miniconda3/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(input, kwargs)
  File "/srv/home/seid/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(args, kwargs)
  File "/srv/home/tester/.local/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1046, in forward
    transformer_outputs = self.transformer(
  File "/srv/home/seid/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, kwargs)
  File "/srv/home/tester/.local/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 889, in forward
    outputs = block(
  File "/srv/home/seid/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(args, kwargs)
  File "/srv/home/tester/.local/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 390, in forward
    attn_outputs = self.attn(
  File "/srv/home/seid/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(args, kwargs)
  File "/srv/home/tester/.local/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 312, in forward
    query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2)
  File "/srv/home/seid/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, kwargs)
  File "/srv/home/tester/.local/lib/python3.9/site-packages/transformers/pytorch_utils.py", line 107, in forward
    x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

All of this coding runs in circles when I ask ChatGPT 3.5 for help (I did not test a higher version). It will begin to subclass classes like class MyTrainer(Trainer) or class MyDataCollator(DataCollatorForLanguageModeling) and tries new functions like collate_function(batch, tokenizer) or overrides or overloads methods. After trying this for hours, I do not get to any code where I can just set the truncation, to begin with, let alone the padding, and get a working fine-tuned text generation model from this tokenized short input text.

How should I code a Huggingface fine-tuning model with the Trainer class where I can set the arguments for the truncation and padding parameters to tokenize my own short input text and where afterwards trainer.train() runs through?

PS: why I ask

I want to know this since I feed a text generation fine-tuning model with a short essay and ask it questions (even if it is not a Question Answering model). I do this since a Question Answering model is only cutting pieces from the text, there is no free speech, and even though the answers are not bad either, they are not as if a human being would make judgements, it is more a fishing for some keywords and their embedding. In short, the output is not good, I already tried a lot of hyperparameters. I do not split the text into sentences but load the whole text as one so-called "example" of the dataset. I do not know whether this is good, I just hope that the model understands the text better if it is one full text in one "example".

I ask it questions regarding the whole text, not just some split sentences. Therefore, I want to make sure that the tokenizer works such that the whole text is read and tokenized without dropping any text. To check this, I want to change truncation from the default False to True only to see whether the output is worse. If it is not worse, I would know that it gets truncated to the max_length of the model. In short, I want to find out with this question here whether I need to change the code and split the text into many examples or not, but I also want to play around with the parameters and want to know whether truncation plays any role for the output at all.

Valentas · Answer 1 · 2024-01-12T12:38:46.183

If you use transformers example scripts, for example, summarization, you can control padding and truncation with command line arguments --pad_to_max_length, --max_source_length, --max_target_length (which are just passed as Trainer or tokenizer parameters).

Of course, if you set --max_source_length unreasonably high, you will soon run out of your GPU memory.

To use your own dataset, specify --train_file and --validation_file as described in their README.

If you want to use this in your own Python code, you can see how it is implemented in run_summarization.py.

For other types of tasks explore the content of other examples in this repository.

questionto42 · Accepted Answer · 2024-01-29T13:50:14.077

Bad workaround: TextDataset classes

No control of truncation/padding in Transformer TextDataset classes

At least for TextDataset and LineByLineTextDataset, there just might be no such attribute:

TextDataset can’t set max_seq_length?:

TextDatasetForNextSentencePrediction (TDFNSP) in combination with a BertModel cannot handle sequences longer than 512. TDFNSP does not truncate or pad. The input for TDFNSP should be sentences separated by a whiteline (\n), and an additional whiteline between documents. If you preprocess your data this way, I cannot imagine one single sentence to be longer than 512 tokens. If this is the case, then it is not a sentence.

Yet, after some time of trying to read the sentences that I marked with a \n as the eos_token (end of sentence), I can say that TextDataset does not care about them. Only the TDFNSP needs them. This does not help here to get a split into sentences or other blocks where the rest of the space is filled with padding. The Textdataset just puts everything into blocks and cares only about not cutting any token in halves, but it does not care about sentence endings. This makes sense since its aim is to read large text files into memory, it is not a sentencizer, let alone a tokenizer.

Here is another link that might be a hint for us, though it is again about the TDFNSP: TextDatasetForNextSentencePrediction does not seem to contain truncate function unlike LineByLineWithSOPTextDataset.

TextDateForNextSentencePrediction, unlike LineByLineWithSOPTextDataset, does not have the truncate feature like truncate_seq_pair in the create_examples_from_document function. The code then was fixed with:

     def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens): 
         """Truncates a pair of sequences to a maximum sequence length.""" 
         while True: 
             total_length = len(tokens_a) + len(tokens_b) 
             if total_length <= max_num_tokens: 
                 break 
             trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b 
             assert len(trunc_tokens) >= 1 
             # We want to sometimes truncate from the front and sometimes from the 
             # back to add more randomness and avoid biases. 
             if random.random() < 0.5: 
                 del trunc_tokens[0]

From this, we see that there is quite a mess in the code. If you need to patch it to get some truncation done, it can also be that the other class "TextDataset" does not have fully grown code.

The TextDataset splits the text input by the block_size=512 into blocks. And for the very last block, there is a sort of truncation without padding: the very last block that does not reach 512 tokens is just dropped. For example, I lose a length of 250 tokens when I load my text into a TextDataset.

TextDataset needs a bad workaround and is outdated

One could take the TextDataset and add 511 padding "[PAD]" tags hardcoded at the end of the text file so that the last block will always have just padding tags when it gets dropped. I have tried adding this inside the TextDataset by sub-classing it and overriding some methods, but I could not get it to work since the TextDataset seems to lose this last block already at the very beginning, it is not just a matter of a loop that stops one block too early. One could fix it with more work, but it is outdated anyway, you can check this in the next link.

Fix: Making a dataset for the `Trainer` class with the text tokenizer from the pretrained model and the `datasets` package

tokenizer from the pretrained model

If the TextDataset loses its last block, I thought I should check the tokenizer again to load the text from a variable and make that a Dataset object. The good thing is that the tokenizer function of the model has all the parameters that I need:

from datasets import load_dataset
path=r'myfile.txt'
context = load_dataset("text", data_files=path)
text = context['train']['text'][0]
encoded = tokenizer.encode_plus(
    text,
    max_length=512, truncation=False, padding="max_length", return_tensors="pt")

If you then run tokenizer.decode(encoded['input_ids'][0][-300:]), you see that it has the missing text since it is not split into blocks at all, while if you ran it with truncation=True, the text would be cut already after 512 tokens, thus, most of the text would not become an input of the model.

More on the parameters of the `tokenizer.encode_plus()`

I have not digged into all of the parameters of this method. The encode_plus() method is more powerful than encode(), see what's difference between tokenizer.encode and tokenizer.encode_plus in Hugging Face.


> PreTrainedTokenizerBase.encode_plus(self, text, text_pair,
> add_special_tokens, padding, truncation, max_length, stride,
> is_split_into_words, pad_to_multiple_of, return_tensors,
> return_token_type_ids, return_attention_mask,
> return_overflowing_tokens, return_special_tokens_mask,
> return_offsets_mapping, return_length, verbose, **kwargs)

And these parameters of tokenizer.encode_plus() are outdated but might show that there are strategies:

> Backward compatibility for truncation_strategy, pad_to_max_length,  
> padding_strategy, truncation_strategy, max_length,  
> kwargs = self._get_padding_truncation_strategies(...

Fixed code with `from datasets import load_dataset`

And here is how you can put that in the Trainer class as the argument for "train_dataset".

We must switch to the from dataset import Dataset class, as the warning (see Outdated Transformers TextDataset class drops last block when text overlaps. Replace by datasets Dataset class as input of Trainer train_dataset?)

/srv/home/my_user/.local/lib/python3.9/site-packages/transformers/data/datasets/language_modeling.py:54: FutureWarning: This dataset will be removed from the library soon, preprocessing should be handled with the Datasets library. You can have a look at this example script for pointers: https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py

tells you to, but we will need only from datasets import load_dataset instead. You can make the needed dataset from your own text file for the Trainer class where you can also handle the needed parameters, see again Outdated Transformers TextDataset class drops last block when text overlaps. Replace by datasets Dataset class as input of Trainer train_dataset?.

The code that runs through:

from transformers import AutoModelForCausalLM, AutoTokenizer, DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import load_dataset
model_name = "dbmdz/german-gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
file_path = './myfile.txt'
padding = "max_length"
bln_truncation = False
dataset = load_dataset("text", data_files={"train": file_path})
block_size = 512
tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=block_size, eos_token="\n", pad_token="[PAD]")
def tokenize_function(examples):
    return tokenizer(examples["text"], padding=padding, truncation=bln_truncation)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
training_args = TrainingArguments(
    output_dir="./output",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=1,
    save_steps=10000,
)
model = AutoModelForCausalLM.from_pretrained(model_name)
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
)
trainer.train()

Leave Trainer class and switch to native PyTorch/Keras?

The Pytorch Trainer class has an open issue since August 2020 that you cannot set the device, it will always default to "GPU 0": Transformers Trainer: "RuntimeError: module must have its parameters ... on device cuda:6 (device_ids[0]) but found one of them on device: cuda:0", and as we saw above, it is quite hard to feed with the right self-built dataset. Even if I got it to work now with the datasets package and this small workaround of setting "GPU 0" as the device and in front of the device lists, native PyTorch/Keras might be a good choice if this black box Pytorch Trainer class seems to be buggy again. Yet, mind that the Trainer class seems to be first choice at Huggingface Fine-tune a pretrained model.

How can you get a Huggingface fine-tuning model with the Trainer class from your own text where you can set the arguments for truncation and padding?

Unclear why, I had to add these two lines of code:

Check if the tokenizer has a padding token

HERE: --> make the text a dataset that can be tokenized with or without truncation

THIS BLOCK NEEDS TO BE REPLACED BY THE NEEDED CODE

output object: dataset

And here is again the function of the guide above that is called by the map function

of the dataset object. This is likely not the only way to get the text above

tokenized with or without truncation.

PS: why I ask

2 Answers2

Bad workaround: TextDataset classes

No control of truncation/padding in Transformer TextDataset classes

TextDataset needs a bad workaround and is outdated

Fix: Making a dataset for the `Trainer` class with the text tokenizer from the pretrained model and the `datasets` package

tokenizer from the pretrained model

More on the parameters of the `tokenizer.encode_plus()`

Fixed code with `from datasets import load_dataset`

Leave Trainer class and switch to native PyTorch/Keras?

Linked

How can you get a Huggingface fine-tuning model with the Trainer class from your own text where you can set the arguments for truncation and padding?

Unclear why, I had to add these two lines of code:

Check if the tokenizer has a padding token

HERE: --> make the text a dataset that can be tokenized with or without truncation

THIS BLOCK NEEDS TO BE REPLACED BY THE NEEDED CODE

output object: dataset

And here is again the function of the guide above that is called by the map function

of the dataset object. This is likely not the only way to get the text above

tokenized with or without truncation.

PS: why I ask

2 Answers2

Bad workaround: TextDataset classes

No control of truncation/padding in Transformer TextDataset classes

TextDataset needs a bad workaround and is outdated

Fix: Making a dataset for the Trainer class with the text tokenizer from the pretrained model and the datasets package

tokenizer from the pretrained model

More on the parameters of the tokenizer.encode_plus()

Fixed code with from datasets import load_dataset

Leave Trainer class and switch to native PyTorch/Keras?

Linked

Fix: Making a dataset for the `Trainer` class with the text tokenizer from the pretrained model and the `datasets` package

More on the parameters of the `tokenizer.encode_plus()`

Fixed code with `from datasets import load_dataset`