I have folder of about 60k PDF documents that I would like to learn to rank based on queries to surface the most relevant results. The goal is to surface and rank relevant documents, very much like a search engine. I understand that Learning to Rank is a supervised algorithm that requires features generated based on query-document pairs. However, the problem is that none of them are labelled. How many queries should I have to even begin training the model?
Asked
Active
Viewed 296 times
1 Answers
1
There are different ways to look at this:
- You can apply a totally unsupervised method, like computing a TDIDF vector for the query and then ranking according to its similarity (e.g. cosine) against every document. This requires no training at all, but you can't even evaluate the method.
- You can use an already implemented system like ElasticSearch.
- You can train a supervised ranking model with any number of samples, but obviously it's going to work much better with a large number. The first difficulty is to generate a sample of queries which is as representative as possible. The second difficulty is to find a way to select the top document(s) for every query: if done manually, the annotator needs to read 60k documents (ouch!). I'm not even going to talk about taking into account the subjectivity and potential ambiguity of a query.
- You could try to do some form of semi-supervised learning or active learning. You could progressively refine the model by using user feedback for instance, if this works for your use case.
Erwan
- 26,519
- 3
- 16
- 39