2

Trying to improve my chat App:

Using previous (pre-processed) chat interactions from my domain, I have built a tool that offers the user 5 possible utterances to a given chat context, for example:

Raw: "Hi John."

Context: hi [[USER_NAME]]
Utterances : [Hi ,Hello , How are you, Hi there, Hello again]


Of Course the results are not always relevant, for example:

Raw: "Hi John. How are you? I am fine, are you in the office?"

Context: hi [[USER_NAME]] how are you i am fine are you in the office
Utterances : [Yes, No , Hi , Yes i am, How are you]

I am using Elasticsearch with TF/IDF similarity model and an index structured like so:

{
  "_index": "engagements",
  "_type": "context",
  "_id": "48",
  "_score": 1,
  "_source": {
    "context": "hi [[USER_NAME]] how are you i am fine are you in the office",
    "utterance": "Yes I am"
  }
}

Problem: I know for sure that for the context "hi [[USER_NAME]] how are you i am fine are you in the office" the utterance "Yes I am" is relevant, however "Yes" , "No" are relevant too because they appeared on a similar context.

Trying to use this excellent video, as a starting point

Q: How can I measure precision and recall, if all I know (from my raw data) is just one true utterance?

1 Answers1

1

Precision and recall are "hard" metrics. They are measure if the model's prediction is exactly the same as the target label.

Often times systems like yours can use a more flexible metric such as top-5 error rate, the model is considered to have generated the correct response if the target label is one of the model’s top 5 predictions.

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113