2

I have built a openAI llama index based model which takes multiple pdf and able to give chatbot based response. I want to evaluate the llm for accuracy. I already know method such as Rouge and Bleu. Is there any other way to evaluate model ?

1 Answers1

1

Is there any other way to evaluate model ?

Assuming you have a test set with gold answers, you can use a text generation metric. See Evaluation of Text Generation: A Survey. Typical metrics: TF-IDF cosine, Rouge, Bleu, BertScore, Sentence-Bert, and more recently, GPTScore and G-Eval. Note that they have some serious limitations when evaluating GPT output {1,2}.


More recent works:


References:


Since comparing the semantic similarity of the predicted answer with a reference question has some limitations, one may look at complementing the analysis with a human evaluation. The following papers survey criteria used for human evaluation of generated texts (thanks Mengjiao Zhang for pointing me to refs {3,4}):

From {3}:

enter image description here

Franck Dernoncourt
  • 5,862
  • 12
  • 44
  • 80