I have built a openAI llama index based model which takes multiple pdf and able to give chatbot based response. I want to evaluate the llm for accuracy. I already know method such as Rouge and Bleu. Is there any other way to evaluate model ?
Asked
Active
Viewed 189 times
1 Answers
1
Is there any other way to evaluate model ?
Assuming you have a test set with gold answers, you can use a text generation metric. See Evaluation of Text Generation: A Survey. Typical metrics: TF-IDF cosine, Rouge, Bleu, BertScore, Sentence-Bert, and more recently, GPTScore and G-Eval. Note that they have some serious limitations when evaluating GPT output {1,2}.
More recent works:
- JudgeLM: Fine-tuned Large Language Models are Scalable Judges. Code: https://github.com/baaivision/JudgeLM. Paper: https://arxiv.org/abs/2310.17631
References:
- {1} Goyal, Tanya, Junyi Jessy Li, and Greg Durrett. "News Summarization and Evaluation in the Era of GPT-3." arXiv preprint arXiv:2209.12356 (2022).
- {2} Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, Tatsunori B. Hashimoto. Benchmarking Large Language Models for News Summarization. arXiv:2301.13848.
Since comparing the semantic similarity of the predicted answer with a reference question has some limitations, one may look at complementing the analysis with a human evaluation. The following papers survey criteria used for human evaluation of generated texts (thanks Mengjiao Zhang for pointing me to refs {3,4}):
- {3} A Survey of Evaluation Metrics Used for NLG Systems
- {4} Perturbation CheckLists for Evaluating NLG Evaluation Metrics (unfortunately, they forgot about QA)
From {3}:
Franck Dernoncourt
- 5,862
- 12
- 44
- 80
