I'm using Huggingface to do inference on llama-3-B. Here is my model:
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/llama-3-8b-Instruct-bnb-4bit",
max_seq_length = 2048,
dtype = torch.float16,
load_in_4bit = True,
)
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = True,
random_state = 42,
use_rslora = False,
use_dora = False,
loftq_config = None,
)
My issue is that when I prompt the model, it outputs "prompt + generation" rather than just outputing the generation. Here is where I prompt it:
inputs = tokenizer(prompts[:2], return_tensors = "pt", padding=True).to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 256, use_cache = True)
predictions = tokenizer.batch_decode(outputs, skip_special_tokens = True)
print(predictions[0])
E.g. if the input prompt i.e. prompt[0] is "What color is grass?" the output is like "What color is grass? Green." but I want it to just be "Green.".