Should you care about truncation and padding in an LLM even if it has a very large tokenizer.max_length so that truncation will never happen?

Question

I want to find out the role of truncation and padding in Huggingface Transformers pretrained models and/or any fine-tuning models on top. Taking a large language model like the German GPT2 shows that the max_length is very large so that truncation should not play a role in the code - if I am not mistaken:

tokenizer = AutoTokenizer.from_pretrained("dbmdz/german-gpt2")
tokenizer.truncate_sequences

Out:

> <bound method PreTrainedTokenizerBase.truncate_sequences of
> PreTrainedTokenizerFast(name_or_path='dbmdz/german-gpt2',
> vocab_size=50265, model_max_len=1000000000000000019884624838656,
> is_fast=True, padding_side='right', truncation_side='right',
> special_tokens={'bos_token': '<|endoftext|>', 'eos_token':
> '<|endoftext|>', 'unk_token': '<|endoftext|>'})>

Checking only the max_length:

tokenizer.model_max_length

Out:

1000000000000000019884624838656

We can see that the max_length is so utterly large that I doubt any full document will ever reach it - and this is just the length of each example, row by row, in the dataset. I guess that truncation does not play a role at all anymore if such a large max_length is set. "Set the truncation parameter to True to truncate a sequence to the maximum length accepted by the model" is what the Huggingface guide says about truncation, see Truncation.

That does not sound as if I ever can truncate the text or ever wanted to since that would lead to a worse understanding of the text. And that seems right since Padding and truncation shows that:

False or 'do_not_truncate': no truncation is applied. This is the default behavior.
...
False or 'do_not_pad': no padding is applied. This is the default behavior.

The default for truncation is False anyway, why should I care about truncation? Should I care at all? If the tokenizer.max_length is so very very large, is truncation not just never happening anyway? Will the change to True change the output of the model or not?

score 5 · Accepted Answer · answered Jan 12 '24 at 11:27

The large number you are seeing is not the maximum length, but the maximum representable integer at that precision. It's there because no maximum length has been set.

The original GPT-2 has a maximum length of 1024, and german-gpt2 is just a retraining of the original architecture on German text, so it should be the same. Anyway, you can make sure checking model.config.max_position_embeddings, which, as per you comments, is also 1024.

You should indeed care about the truncation. When training, usually you want to use as much text as possible, so normally you split very large text chunks into different samples.

questionto42 · Answer 2 · 2024-06-05T18:50:44.517

Yes, you need to care about truncation/padding/max numbers for your fine-tuning dataset

The question here is just about the max number, but since you need to care about how you steer the preprocessing, this linked question will help: How can you get a Huggingface fine-tuning model with the Trainer class from your own text where you can set the arguments for truncation and padding? for code and more insight.

Unless you load a pretrained dataset with Auto classes, you should care about such parameters. You should check if your text is fully tokenized by decoding the input_ids that you must have in the dataset. Also mind that the model has its own setup after the tokenizing which should outweigh the preprocessing setup. The preprocessing is just hand over the right input for the training. The model training reads and understand this input in its own way.

You need to care about truncation and the like during both preprocessing and training. Check samples and try understanding what is going on. I still do not know how the eos_token and pad_token come into play for the fine-tuning training and which arguments change what. The next chapter is only about the max numbers. I guess that changing the max number of tokens is not enough to steer the ship.

Auto classes (of GPT2) and DistilBert: max numbers

This takes up the other answer and checks the max numbers of the tokenizer and model objects of two example models ("german-gpt2" and "distilbert-base-cased-distilled-squad").

It shows that the AutoTokenizer is empty when it is loaded. That is why the tokenizer.model_max_length is the highest number you can get from the data type. It does not play any role, and that is why loading the model works at all. As long as you try loading it with the GPT2Tokenizer, it will fail to load, but if you load it with the empty AutoTokenizer instead, as the model card asks you to, it works. From this, I can see without any further reading that the model outweighs the settings of the tokenizer. Large text input will be split into blocks that the model allows, and it does not seem to play a role how the tokenizer itself is set up. One could say that the AutoTokenizer is the tokenizer of the model's settings.

The checks also show that you sometimes have to search for the right model max variable yourself since for the DistilBert model, the max number seems to be called n_positions in the "german-gpt2" config but max_position_embeddings in the "distilbert-base-cased-distilled-squad" config, and yet, you can ask both models for model.config.max_position_embeddings.

model_name = "dbmdz/german-gpt2"
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
print(tokenizer.max_len_sentences_pair)
print(tokenizer.max_len_single_sentence)
print(tokenizer.max_model_input_sizes)
print(tokenizer.model_max_length)
1000000000000000019884624838656
1000000000000000019884624838656
{'gpt2': 1024, 'gpt2-medium': 1024, 'gpt2-large': 1024, 'gpt2-xl': 1024, 'distilgpt2': 1024}
1000000000000000019884624838656

But once you set the tokenizer with your own arguments, the output changes:

tokenizer = AutoTokenizer.from_pretrained(
    model_name, eos_token="<|endoftext|>", pad_token="[PAD]", model_max_length=512)
print(tokenizer.max_len_sentences_pair)
print(tokenizer.max_len_single_sentence)
print(tokenizer.max_model_input_sizes)
print(tokenizer.model_max_length)
512
512
{'gpt2': 1024, 'gpt2-medium': 1024, 'gpt2-large': 1024, 'gpt2-xl': 1024, 'distilgpt2': 1024}
512

Yet, you do not seem to need the tokenizing at all since the model will tell in which blocks it will read the input.

model = AutoModelForCausalLM.from_pretrained(model_name)
model.config.max_position_embeddings
1024

Here is an insight into the max variables of DistilBert:

model_name = "distilbert-base-cased-distilled-squad"
from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
print(tokenizer.max_len_sentences_pair)
print(tokenizer.max_len_single_sentence)
print(tokenizer.max_model_input_sizes)
print(tokenizer.model_max_length)
from transformers import DistilBertModel
model = DistilBertModel.from_pretrained(model_name)
model.config.max_position_embeddings
509
510
{'distilbert-base-uncased': 512, 'distilbert-base-uncased-distilled-squad': 512, 'distilbert-base-cased': 512, 'distilbert-base-cased-distilled-squad': 512, 'distilbert-base-german-cased': 512, 'distilbert-base-multilingual-cased': 512}
512
512

And if you check the model.config, you find these max variables, and n_positions should be the same as max_position_embeddings:

  "n_ctx": 1024,
  "n_embd": 768,
  ...
  "n_positions": 1024,

Deep dive

Needed for the checks:

import inspect

Example 1: German GPT2 (with Autoclasses)

You have to import this model with AutoClasses, see the model card Using the model. They do not ask you to enter any arguments for the AutoTokenizer object, and the class is empty but still works since the mode config outweighs the tokenizer settings.

AutoTokenizer

model_name = "dbmdz/german-gpt2"
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

inspect.signature(AutoTokenizer)
<Signature ()>

dir(AutoTokenizer)
['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'from_pretrained',
 'register']

from transformers import AutoModel
model = AutoModel.from_pretrained(model_name)
model.config.max_position_embeddings
1024

dir(AutoModel)  
['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_model_mapping',
 'from_config',
 'from_pretrained',
 'register']

GPT2Tokenizer

Even though this throws an error for the tokenizer of the german-gpt2 model, we can still check how its classes and parameters look like.

from transformers import GPT2Tokenizer
inspect.signature(GPT2Tokenizer)
<Signature (vocab_file, merges_file, errors='replace', unk_token='<|endoftext|>', bos_token='<|endoftext|>', eos_token='<|endoftext|>', pad_token=None, add_prefix_space=False, add_bos_token=False, **kwargs)>

dir(GPT2Tokenizer)
['SPECIAL_TOKENS_ATTRIBUTES',
 '__annotations__',
 '__call__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_add_tokens',
 '_auto_class',
 '_batch_encode_plus',
 '_batch_prepare_for_model',
 '_build_conversation_input_ids',
 '_convert_id_to_token',
 '_convert_token_to_id',
 '_convert_token_to_id_with_added_voc',
 '_create_or_get_repo',
 '_create_trie',
 '_decode',
 '_encode_plus',
 '_eventual_warn_about_too_long_sequence',
 '_eventually_correct_t5_max_length',
 '_from_pretrained',
 '_get_padding_truncation_strategies',
 '_get_repo_url_from_name',
 '_pad',
 '_push_to_hub',
 '_save_pretrained',
 '_set_processor_class',
 '_tokenize',
 'add_special_tokens',
 'add_tokens',
 'additional_special_tokens',
 'additional_special_tokens_ids',
 'all_special_ids',
 'all_special_tokens',
 'all_special_tokens_extended',
 'as_target_tokenizer',
 'batch_decode',
 'batch_encode_plus',
 'bos_token',
 'bos_token_id',
 'bpe',
 'build_inputs_with_special_tokens',
 'clean_up_tokenization',
 'cls_token',
 'cls_token_id',
 'convert_ids_to_tokens',
 'convert_tokens_to_ids',
 'convert_tokens_to_string',
 'create_token_type_ids_from_sequences',
 'decode',
 'encode',
 'encode_plus',
 'eos_token',
 'eos_token_id',
 'from_pretrained',
 'get_added_vocab',
 'get_special_tokens_mask',
 'get_vocab',
 'is_fast',
 'mask_token',
 'mask_token_id',
 'max_len_sentences_pair',
 'max_len_single_sentence',
 'max_model_input_sizes',
 'model_input_names',
 'num_special_tokens_to_add',
 'pad',
 'pad_token',
 'pad_token_id',
 'pad_token_type_id',
 'padding_side',
 'prepare_for_model',
 'prepare_for_tokenization',
 'prepare_seq2seq_batch',
 'pretrained_init_configuration',
 'pretrained_vocab_files_map',
 'push_to_hub',
 'register_for_auto_class',
 'sanitize_special_tokens',
 'save_pretrained',
 'save_vocabulary',
 'sep_token',
 'sep_token_id',
 'slow_tokenizer_class',
 'special_tokens_map',
 'special_tokens_map_extended',
 'tokenize',
 'truncate_sequences',
 'truncation_side',
 'unk_token',
 'unk_token_id',
 'vocab_files_names',
 'vocab_size']

(AutoTokenizer) tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)
inspect.signature(tokenizer)
<Signature (text: Union[str, List[str], List[List[str]]], text_pair: Union[str, List[str], List[List[str]], NoneType] = None, add_special_tokens: bool = True, padding: Union[bool, str, transformers.utils.generic.PaddingStrategy] = False, truncation: Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = False, max_length: Optional[int] = None, stride: int = 0, is_split_into_words: bool = False, pad_to_multiple_of: Optional[int] = None, return_tensors: Union[str, transformers.utils.generic.TensorType, NoneType] = None, return_token_type_ids: Optional[bool] = None, return_attention_mask: Optional[bool] = None, return_overflowing_tokens: bool = False, return_special_tokens_mask: bool = False, return_offsets_mapping: bool = False, return_length: bool = False, verbose: bool = True, **kwargs) -> transformers.tokenization_utils_base.BatchEncoding>

dir(tokenizer)
['SPECIAL_TOKENS_ATTRIBUTES',
 '__annotations__',
 '__call__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_add_tokens',
 '_additional_special_tokens',
 '_auto_class',
 '_batch_encode_plus',
 '_bos_token',
 '_build_conversation_input_ids',
 '_cls_token',
 '_convert_encoding',
 '_convert_id_to_token',
 '_convert_token_to_id_with_added_voc',
 '_create_or_get_repo',
 '_decode',
 '_decode_use_source_tokenizer',
 '_encode_plus',
 '_eos_token',
 '_eventual_warn_about_too_long_sequence',
 '_eventually_correct_t5_max_length',
 '_from_pretrained',
 '_get_padding_truncation_strategies',
 '_get_repo_url_from_name',
 '_mask_token',
 '_pad',
 '_pad_token',
 '_pad_token_type_id',
 '_processor_class',
 '_push_to_hub',
 '_save_pretrained',
 '_sep_token',
 '_set_processor_class',
 '_tokenizer',
 '_unk_token',
 'add_prefix_space',
 'add_special_tokens',
 'add_tokens',
 'additional_special_tokens',
 'additional_special_tokens_ids',
 'all_special_ids',
 'all_special_tokens',
 'all_special_tokens_extended',
 'as_target_tokenizer',
 'backend_tokenizer',
 'batch_decode',
 'batch_encode_plus',
 'bos_token',
 'bos_token_id',
 'build_inputs_with_special_tokens',
 'can_save_slow_tokenizer',
 'clean_up_tokenization',
 'cls_token',
 'cls_token_id',
 'convert_ids_to_tokens',
 'convert_tokens_to_ids',
 'convert_tokens_to_string',
 'create_token_type_ids_from_sequences',
 'decode',
 'decoder',
 'deprecation_warnings',
 'encode',
 'encode_plus',
 'eos_token',
 'eos_token_id',
 'from_pretrained',
 'get_added_vocab',
 'get_special_tokens_mask',
 'get_vocab',
 'init_inputs',
 'init_kwargs',
 'is_fast',
 'mask_token',
 'mask_token_id',
 'max_len_sentences_pair',
 'max_len_single_sentence',
 'max_model_input_sizes',
 'model_input_names',
 'model_max_length',
 'name_or_path',
 'num_special_tokens_to_add',
 'pad',
 'pad_token',
 'pad_token_id',
 'pad_token_type_id',
 'padding_side',
 'prepare_for_model',
 'prepare_seq2seq_batch',
 'pretrained_init_configuration',
 'pretrained_vocab_files_map',
 'push_to_hub',
 'register_for_auto_class',
 'sanitize_special_tokens',
 'save_pretrained',
 'save_vocabulary',
 'sep_token',
 'sep_token_id',
 'set_truncation_and_padding',
 'slow_tokenizer_class',
 'special_tokens_map',
 'special_tokens_map_extended',
 'tokenize',
 'train_new_from_iterator',
 'truncate_sequences',
 'truncation_side',
 'unk_token',
 'unk_token_id',
 'verbose',
 'vocab',
 'vocab_files_names',
 'vocab_size']

(AutoTokenizer) tokenizer.encode

inspect.signature(tokenizer.encode)
<Signature (text: Union[str, List[str], List[int]], text_pair: Union[str, List[str], List[int], NoneType] = None, add_special_tokens: bool = True, padding: Union[bool, str, transformers.utils.generic.PaddingStrategy] = False, truncation: Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = False, max_length: Optional[int] = None, stride: int = 0, return_tensors: Union[str, transformers.utils.generic.TensorType, NoneType] = None, **kwargs) -> List[int]>

dir(tokenizer.encode)
['__call__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__func__',
 '__ge__',
 '__get__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__self__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__']

(Autotokenizer) tokenizer.encode_plus

inspect.signature(tokenizer.encode_plus)
<Signature (text: Union[str, List[str], List[int]], text_pair: Union[str, List[str], List[int], NoneType] = None, add_special_tokens: bool = True, padding: Union[bool, str, transformers.utils.generic.PaddingStrategy] = False, truncation: Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = False, max_length: Optional[int] = None, stride: int = 0, is_split_into_words: bool = False, pad_to_multiple_of: Optional[int] = None, return_tensors: Union[str, transformers.utils.generic.TensorType, NoneType] = None, return_token_type_ids: Optional[bool] = None, return_attention_mask: Optional[bool] = None, return_overflowing_tokens: bool = False, return_special_tokens_mask: bool = False, return_offsets_mapping: bool = False, return_length: bool = False, verbose: bool = True, **kwargs) -> transformers.tokenization_utils_base.BatchEncoding>

dir(tokenizer.encode_plus)
['__call__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__func__',
 '__ge__',
 '__get__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__self__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__']

AutoConfig

model.config
GPT2Config {
  "_name_or_path": "dbmdz/german-gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.0,
  "bos_token_id": 50256,
  "embd_pdrop": 0.0,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.0,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "float32",
  "transformers_version": "4.19.2",
  "use_cache": true,
  "vocab_size": 50265
}

Example 2: DistilBert

DistilBertTokenizer

from transformers import DistilBertTokenizer
inspect.signature(DistilBertTokenizer)
<Signature (vocab_file, do_lower_case=True, do_basic_tokenize=True, never_split=None, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', tokenize_chinese_chars=True, strip_accents=None, **kwargs)>

dir(DistilBertTokenizer)

['SPECIAL_TOKENS_ATTRIBUTES',
 '__annotations__',
 '__call__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_add_tokens',
 '_auto_class',
 '_batch_encode_plus',
 '_batch_prepare_for_model',
 '_convert_id_to_token',
 '_convert_token_to_id',
 '_convert_token_to_id_with_added_voc',
 '_create_or_get_repo',
 '_create_trie',
 '_decode',
 '_encode_plus',
 '_eventual_warn_about_too_long_sequence',
 '_eventually_correct_t5_max_length',
 '_from_pretrained',
 '_get_padding_truncation_strategies',
 '_get_repo_url_from_name',
 '_pad',
 '_push_to_hub',
 '_save_pretrained',
 '_set_processor_class',
 '_tokenize',
 'add_special_tokens',
 'add_tokens',
 'additional_special_tokens',
 'additional_special_tokens_ids',
 'all_special_ids',
 'all_special_tokens',
 'all_special_tokens_extended',
 'as_target_tokenizer',
 'batch_decode',
 'batch_encode_plus',
 'bos_token',
 'bos_token_id',
 'build_inputs_with_special_tokens',
 'clean_up_tokenization',
 'cls_token',
 'cls_token_id',
 'convert_ids_to_tokens',
 'convert_tokens_to_ids',
 'convert_tokens_to_string',
 'create_token_type_ids_from_sequences',
 'decode',
 'do_lower_case',
 'encode',
 'encode_plus',
 'eos_token',
 'eos_token_id',
 'from_pretrained',
 'get_added_vocab',
 'get_special_tokens_mask',
 'get_vocab',
 'is_fast',
 'mask_token',
 'mask_token_id',
 'max_len_sentences_pair',
 'max_len_single_sentence',
 'max_model_input_sizes',
 'model_input_names',
 'num_special_tokens_to_add',
 'pad',
 'pad_token',
 'pad_token_id',
 'pad_token_type_id',
 'padding_side',
 'prepare_for_model',
 'prepare_for_tokenization',
 'prepare_seq2seq_batch',
 'pretrained_init_configuration',
 'pretrained_vocab_files_map',
 'push_to_hub',
 'register_for_auto_class',
 'sanitize_special_tokens',
 'save_pretrained',
 'save_vocabulary',
 'sep_token',
 'sep_token_id',
 'slow_tokenizer_class',
 'special_tokens_map',
 'special_tokens_map_extended',
 'tokenize',
 'truncate_sequences',
 'truncation_side',
 'unk_token',
 'unk_token_id',
 'vocab_files_names',
 'vocab_size']

DistilBertModel

dir(DistilBertModel)
['T_destination',
 '__annotations__',
 '__call__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_apply',
 '_auto_class',
 '_backward_compatibility_gradient_checkpointing',
 '_call_impl',
 '_can_retrieve_inputs_from_name',
 '_convert_head_mask_to_5d',
 '_create_or_get_repo',
 '_expand_inputs_for_generation',
 '_from_config',
 '_get_backward_hooks',
 '_get_backward_pre_hooks',
 '_get_decoder_start_token_id',
 '_get_logits_processor',
 '_get_logits_warper',
 '_get_name',
 '_get_repo_url_from_name',
 '_get_resized_embeddings',
 '_get_resized_lm_head',
 '_get_stopping_criteria',
 '_hook_rss_memory_post_forward',
 '_hook_rss_memory_pre_forward',
 '_init_weights',
 '_keys_to_ignore_on_load_missing',
 '_keys_to_ignore_on_load_unexpected',
 '_keys_to_ignore_on_save',
 '_load_from_state_dict',
 '_load_pretrained_model',
 '_load_pretrained_model_low_mem',
 '_maybe_warn_non_full_backward_hook',
 '_merge_criteria_processor_list',
 '_named_members',
 '_prepare_attention_mask_for_generation',
 '_prepare_decoder_input_ids_for_generation',
 '_prepare_encoder_decoder_kwargs_for_generation',
 '_prepare_input_ids_for_generation',
 '_prepare_model_inputs',
 '_prune_heads',
 '_push_to_hub',
 '_register_load_state_dict_pre_hook',
 '_register_state_dict_hook',
 '_reorder_cache',
 '_replicate_for_data_parallel',
 '_resize_token_embeddings',
 '_save_to_state_dict',
 '_set_default_torch_dtype',
 '_slow_forward',
 '_tie_encoder_decoder_weights',
 '_tie_or_clone_weights',
 '_update_model_kwargs_for_generation',
 '_version',
 'add_memory_hooks',
 'add_module',
 'adjust_logits_during_generation',
 'apply',
 'base_model',
 'base_model_prefix',
 'beam_sample',
 'beam_search',
 'bfloat16',
 'buffers',
 'call_super_init',
 'children',
 'compute_transition_beam_scores',
 'config_class',
 'constrained_beam_search',
 'cpu',
 'create_extended_attention_mask_for_decoder',
 'cuda',
 'device',
 'double',
 'dtype',
 'dummy_inputs',
 'dump_patches',
 'estimate_tokens',
 'eval',
 'extra_repr',
 'float',
 'floating_point_ops',
 'forward',
 'framework',
 'from_pretrained',
 'generate',
 'get_buffer',
 'get_extended_attention_mask',
 'get_extra_state',
 'get_head_mask',
 'get_input_embeddings',
 'get_output_embeddings',
 'get_parameter',
 'get_position_embeddings',
 'get_submodule',
 'gradient_checkpointing_disable',
 'gradient_checkpointing_enable',
 'greedy_search',
 'group_beam_search',
 'half',
 'init_weights',
 'invert_attention_mask',
 'ipu',
 'is_gradient_checkpointing',
 'is_parallelizable',
 'load_state_dict',
 'load_tf_weights',
 'main_input_name',
 'modules',
 'named_buffers',
 'named_children',
 'named_modules',
 'named_parameters',
 'num_parameters',
 'parameters',
 'post_init',
 'prepare_inputs_for_generation',
 'prune_heads',
 'push_to_hub',
 'register_backward_hook',
 'register_buffer',
 'register_for_auto_class',
 'register_forward_hook',
 'register_forward_pre_hook',
 'register_full_backward_hook',
 'register_full_backward_pre_hook',
 'register_load_state_dict_post_hook',
 'register_module',
 'register_parameter',
 'register_state_dict_pre_hook',
 'requires_grad_',
 'reset_memory_hooks_state',
 'resize_position_embeddings',
 'resize_token_embeddings',
 'retrieve_modules_from_names',
 'sample',
 'save_pretrained',
 'set_extra_state',
 'set_input_embeddings',
 'share_memory',
 'state_dict',
 'supports_gradient_checkpointing',
 'tie_weights',
 'to',
 'to_empty',
 'train',
 'type',
 'xpu',
 'zero_grad']

model.config
DistilBertConfig {
  "_name_or_path": "distilbert-base-cased-distilled-squad",
  "activation": "gelu",
  "architectures": [
    "DistilBertForQuestionAnswering"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": true,
  "tie_weights_": true,
  "transformers_version": "4.19.2",
  "vocab_size": 28996
}