4

I'm using Huggingface to do inference on llama-3-B. Here is my model:

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-Instruct-bnb-4bit",
    max_seq_length = 2048,
    dtype = torch.float16,
    load_in_4bit = True,
)

model = FastLanguageModel.get_peft_model( model, r = 16, target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_alpha = 16, lora_dropout = 0, bias = "none", use_gradient_checkpointing = True, random_state = 42, use_rslora = False, use_dora = False, loftq_config = None, )

My issue is that when I prompt the model, it outputs "prompt + generation" rather than just outputing the generation. Here is where I prompt it:

inputs = tokenizer(prompts[:2], return_tensors = "pt", padding=True).to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 256, use_cache = True)
predictions = tokenizer.batch_decode(outputs, skip_special_tokens = True)
print(predictions[0])

E.g. if the input prompt i.e. prompt[0] is "What color is grass?" the output is like "What color is grass? Green." but I want it to just be "Green.".

noe
  • 28,203
  • 1
  • 49
  • 83
Ameen Izhac
  • 107
  • 6

1 Answers1

4

Given that Llama is a decoder-only model, the output of the model (i.e. outputs) is just the whole sequence. You can, nevertheless, decode only the new tokens easily like this:

predictions = tokenizer.batch_decode(
    outputs[:, inputs.shape[1]:],
    skip_special_tokens = True
)
noe
  • 28,203
  • 1
  • 49
  • 83