You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Peng Cheng <rh...@gmail.com> on 2014/02/07 22:59:46 UTC

Best practice for feature engineering of learning-to-rank based search engine? (using Lucene-Solr)

Hi Honoured Contributors,

May I ask for your suggestion on learning-to-rank based search in Solr?
Personally I believe that the accuracy of learning-to-rank and performance
of solr can be a lethal combination. In addition, I've recently notice a
lot of compatibility improvement to Solr from Mahout side. Perhaps we
should do the same? I have long experience using Mahout so I'll be happy to
work on this part.

I ask on behalf of a text/metadata search engine project that has large
numbers of hand-picked best matching dataset as ground truth. My purpose is
to find an ensemble ranking model that achieve highest nDCG on the ground
truth. The underlying feature set is obtained from various
string/tokenstream similarities (tf-idf, levenshtein, etc) of many fields
of query and indexed documents and user's vectorized info (etc. registered
time, age, scores of collaborative filtering, etc.)

While I got no problem generating those features by expanding Query and
ValueSource. I encountered many problems exporting them as vectors or
feeding them into a learning-to-rank classifier. Namely:

1. I cannot use 2 similarities in one index, e.g. I cannot use tf-idf and
BM25 scores as 2 features and ensemble them into 1 sore. What I did is to
create a new ValueSource and move part of the BM25's code into it, I wonder
if there is an easier way.

2. the only way to 'see' those features in one line is to combine those
weak subqueries into a CustomScoreQuery, then use the explain() function to
look into it. This is quite unhandy as the Explanation class is designed
for manual tweaking rather than machine learning:
- Descriptions are far too elaborated and differ from each other on two
documents for the same feature.
- Cannot explain any feature from nested document or parent document. Their
'explain()' implementation are half-done.
Eventually I have to extend CustomScoreQuery with a custom 'explain()'
function that has a fixed description for each subqueries. And write a
converter to convert Explanation to Vector class in Mahout. This takes a
lot of time and I feel this part should be streamlined and standardized,
considering how popular this learning-to-rank has become.

Regards Peng