How can someone evaluate llama index model?

Question

I have built a openAI llama index based model which takes multiple pdf and able to give chatbot based response. I want to evaluate the llm for accuracy. I already know method such as Rouge and Bleu. Is there any other way to evaluate model ?

Franck Dernoncourt · Answer 1 · 2023-11-02T01:00:01.437

Is there any other way to evaluate model ?

Assuming you have a test set with gold answers, you can use a text generation metric. See Evaluation of Text Generation: A Survey. Typical metrics: TF-IDF cosine, Rouge, Bleu, BertScore, Sentence-Bert, and more recently, GPTScore and G-Eval. Note that they have some serious limitations when evaluating GPT output {1,2}.

More recent works:

JudgeLM: Fine-tuned Large Language Models are Scalable Judges. Code: https://github.com/baaivision/JudgeLM. Paper: https://arxiv.org/abs/2310.17631

References:

{1} Goyal, Tanya, Junyi Jessy Li, and Greg Durrett. "News Summarization and Evaluation in the Era of GPT-3." arXiv preprint arXiv:2209.12356 (2022).
{2} Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, Tatsunori B. Hashimoto. Benchmarking Large Language Models for News Summarization. arXiv:2301.13848.

Since comparing the semantic similarity of the predicted answer with a reference question has some limitations, one may look at complementing the analysis with a human evaluation. The following papers survey criteria used for human evaluation of generated texts (thanks Mengjiao Zhang for pointing me to refs {3,4}):

{3} A Survey of Evaluation Metrics Used for NLG Systems
{4} Perturbation CheckLists for Evaluating NLG Evaluation Metrics (unfortunately, they forgot about QA)

From {3}:

How can someone evaluate llama index model?

1 Answers1