How do I get model.generate() to omit the input sequence from the generation?

Question

I'm using Huggingface to do inference on llama-3-B. Here is my model:

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-Instruct-bnb-4bit",
    max_seq_length = 2048,
    dtype = torch.float16,
    load_in_4bit = True,
)
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = True,
    random_state = 42,
    use_rslora = False,
    use_dora = False,
    loftq_config = None,
)

My issue is that when I prompt the model, it outputs "prompt + generation" rather than just outputing the generation. Here is where I prompt it:

inputs = tokenizer(prompts[:2], return_tensors = "pt", padding=True).to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 256, use_cache = True)
predictions = tokenizer.batch_decode(outputs, skip_special_tokens = True)
print(predictions[0])

E.g. if the input prompt i.e. prompt[0] is "What color is grass?" the output is like "What color is grass? Green." but I want it to just be "Green.".

score 4 · Accepted Answer · answered Apr 26 '24 at 15:27

Given that Llama is a decoder-only model, the output of the model (i.e. outputs) is just the whole sequence. You can, nevertheless, decode only the new tokens easily like this:

predictions = tokenizer.batch_decode(
    outputs[:, inputs.shape[1]:],
    skip_special_tokens = True
)

How do I get model.generate() to omit the input sequence from the generation?

1 Answers1