Measuring precision and recall

Question

Trying to improve my chat App:

Using previous (pre-processed) chat interactions from my domain, I have built a tool that offers the user 5 possible utterances to a given chat context, for example:

Raw: "Hi John."

Context: hi [[USER_NAME]]
Utterances : [Hi ,Hello , How are you, Hi there, Hello again]

Of Course the results are not always relevant, for example:

Raw: "Hi John. How are you? I am fine, are you in the office?"

Context: hi [[USER_NAME]] how are you i am fine are you in the office
Utterances : [Yes, No , Hi , Yes i am, How are you]

I am using Elasticsearch with TF/IDF similarity model and an index structured like so:

{
  "_index": "engagements",
  "_type": "context",
  "_id": "48",
  "_score": 1,
  "_source": {
    "context": "hi [[USER_NAME]] how are you i am fine are you in the office",
    "utterance": "Yes I am"
  }
}

Problem: I know for sure that for the context "hi [[USER_NAME]] how are you i am fine are you in the office" the utterance "Yes I am" is relevant, however "Yes" , "No" are relevant too because they appeared on a similar context.

Trying to use this excellent video, as a starting point

Q: How can I measure precision and recall, if all I know (from my raw data) is just one true utterance?

Brian Spiering · Answer 1 · 2021-07-11T13:07:15.757

1

Precision and recall are "hard" metrics. They are measure if the model's prediction is exactly the same as the target label.

Often times systems like yours can use a more flexible metric such as top-5 error rate, the model is considered to have generated the correct response if the target label is one of the model’s top 5 predictions.

edited Jul 11 '21 at 13:07

answered Jul 10 '21 at 17:25

Brian Spiering

23,131
2
29
113

Measuring precision and recall

1 Answers1