You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by peng <pc...@uowmail.edu.au> on 2014/02/09 23:26:32 UTC

Learning to rank support in Mahout and Solr integration?

This is what I believe to be a typical learning to rank model:

1. Create many weak rankers/scorers (a.k.a feature engineering, in Solr 
these are queries/function queries).
2. Test those scorers on a ground truth dataset. Generating feature 
vectors for top-n results annotated by human.
3. Use an existing classifier/regressor (e.g. support vector ranking, 
GBDT, random forest etc.) on those feature vectors to get a ranking model.
4. Export this ranking model back to Solr as a custom ensemble query (a 
BooleanQuery with custom boosting factor for linear model, or a 
CustomScoreQuery with custom scoring function for non-linear model), 
push it to Solr server, register with QParser. Push it to production. 
End of.

But I didn't find this workflow quite easy to implement in mahout-solr 
integration (is it discouraged for some reason?). Namely, there is no 
pipeline from results of scorers to a Mahout-compatible vector form, and 
there is no pipeline from ranking model back to ensemble query. (I only 
found the lucene2seq class, and the upcoming recommendation support, 
which don't quite fit into the scenario). So what's the best practice 
for easily implementing a realtime, learning to rank search engine in 
this case? I've worked in a bunch of startups and such appliance seems 
to be in high demand. (Remember that solr-based collaborative filtering 
model proposed by Dr Dunning? This is the content-based counterpart of it)

I'm looking forward to streamline this process to make my upcoming work 
easier. I think Mahout/Solr is the undisputed instrument of choice due 
to their scalability and machine learning background of many of their 
top committers. Can we talk about it at some point?

Yours Peng

Learning to rank support in Mahout and Solr integration?

Posted by peng <pc...@uowmail.edu.au>.
I was working on a large-scale learning-to-rank search engine, but found 
that there is no out-of-the-box solution under Apache, and it takes a 
lot of work to 're-invent the wheel'. I first asked for support in the 
Mahout community, then Ahmet pointed out its actually a Lucene-Solr 
feature suggestion, and advised to carry on the discussion here to see 
if people had the same business requirement. Also to see how their 
solutions work comparing to mine.

I found a few interesting proposals on the internet but non of them fits 
the definition of 'learning-to-rank':

http://www.opensourceconnections.com/2013/04/04/complete-n00bs-guide-to-enhancing-solrlucene-search-with-mahouts-machine-learning/ 
(Pretty much for LSI, not learning-to-rank, term vector features only)

http://www.slideshare.net/lucenerevolution/text-classification-with-lucenesolr-apache-hadoop-and-libsvm 
(Again, for text classification, term vector-ish features only)

https://docs.google.com/viewer?url=http%3A%2F%2Ffiles.meetup.com%2F2717472%2Fdiscovery-1-19-12.pptx 
(This prefer to calculate relevance externally by a ML library and use 
it as boost factor, an ideal solution for recommendation engine 
integration, but learning-to-rank? not really)

http://www.cs.cornell.edu/People/tj/career/ (Made by authors of SVM 
rank, one of the the earliest research outcome in LTR, they have two 
implementations on top of lucene (searching arXiv and Cornell Lib), 
unfortunately they don't publish their source code.)

http://www.slideshare.net/LucidImagination/bialecki-andrzej-clickthroughrelevancerankinginsolrlucidworksenterprise 
(Seems to be a likely solution similar to previous one, enhanced with 
automatic clickthourgh collection and self-improving ability, but again, 
Lucidwork enterprise is not opensource)

If you are interested in a reusable component for LTR in Solr, please 
express you idea.

-Peng



-------- Original Message --------
Subject: 	Re: Learning to rank support in Mahout and Solr integration?
Date: 	Sun, 09 Feb 2014 21:35:17 -0500
From: 	peng <pc...@uowmail.edu.au>
To: 	dev@mahout.apache.org



Hi Dr Dunning,

Thanks a lot! I was trying to make the model generalizable enough, but
I'm also afraid I may 'abuse' it a bit, Here is my existing solution:

1. wrap any scorer by a ValueSource (many out-of-the-box exists in
lucene-solr, extensions are possible but they don't have to be
registered with ValueSourceParser-they won't be used independently)
2. extend CustomScoreQuery to have a flat and straightforward
explanation form. Use this as a wrapper of filters (As SubQ) and
scorers (As FunctionQ)
3. write a converter to print flat explanation to Mahout-compatible
vectors.
4. run a job to 'explain()' those ground truths on an index and dump
the result vectors.
5. (optional) run other jobs to get not-content-based score vectors.
6. join them, feed into a classifier-regressor, do some model
selections.
7. (from this point I haven't done anything) try to 'migrate' this
model into another CustomScoreQuery, which has a strong scorer that
ensemble features in the same way the model suggested.
8. push into Solr Cloud Server. Register with Qparser.

What I found to be hard:

1. explanation is kind of abusive, its only designed for manual
tweaking. I constantly run into problems where 'explain()'
implementation was look down upon by developers and code stubs are used
to fill. Notably, ToParentBlockJoin won't show nested scores, and
ToChildBlockJoin simply doesn't work.
2. There is no automatic way to 'migrate' model to ensemble query.
Though I haven't proceed that far I'm already afraid of the difficulty.
3. As a NoSQL database optimized to the core in text processing, Solr
extensions are totally not intuitive and hard to debug and maintain. We
try to keep this part minimal but still get stagnated at some point.

Environment is build on CDH 5.0beta2 with YARN and Cloudera search
(Solr 4.4), some bugs then force me to uninstall it and install Solr
Cloud 4.6. I wonder if there are more 'out-of-the-box' solutions?

Yours Peng

On Sun 09 Feb 2014 05:53:20 PM EST, Ted Dunning wrote:
> I think that this is a bit of an idiosyncratic model for learning to rank,
> but it is a reasonably viable one.
>
> It would be good to have a discussion of what you find hard or easy and
> what you think is needed to make this work.
>
> Let's talk.
>
>
>
> On Sun, Feb 9, 2014 at 2:26 PM, peng <pc...@uowmail.edu.au> wrote:
>
>> This is what I believe to be a typical learning to rank model:
>>
>> 1. Create many weak rankers/scorers (a.k.a feature engineering, in Solr
>> these are queries/function queries).
>> 2. Test those scorers on a ground truth dataset. Generating feature
>> vectors for top-n results annotated by human.
>> 3. Use an existing classifier/regressor (e.g. support vector ranking,
>> GBDT, random forest etc.) on those feature vectors to get a ranking model.
>> 4. Export this ranking model back to Solr as a custom ensemble query (a
>> BooleanQuery with custom boosting factor for linear model, or a
>> CustomScoreQuery with custom scoring function for non-linear model), push
>> it to Solr server, register with QParser. Push it to production. End of.
>>
>> But I didn't find this workflow quite easy to implement in mahout-solr
>> integration (is it discouraged for some reason?). Namely, there is no
>> pipeline from results of scorers to a Mahout-compatible vector form, and
>> there is no pipeline from ranking model back to ensemble query. (I only
>> found the lucene2seq class, and the upcoming recommendation support, which
>> don't quite fit into the scenario). So what's the best practice for easily
>> implementing a realtime, learning to rank search engine in this case? I've
>> worked in a bunch of startups and such appliance seems to be in high
>> demand. (Remember that solr-based collaborative filtering model proposed by
>> Dr Dunning? This is the content-based counterpart of it)
>>
>> I'm looking forward to streamline this process to make my upcoming work
>> easier. I think Mahout/Solr is the undisputed instrument of choice due to
>> their scalability and machine learning background of many of their top
>> committers. Can we talk about it at some point?
>>
>> Yours Peng
>>
>




Re: Learning to rank support in Mahout and Solr integration?

Posted by peng <pc...@uowmail.edu.au>.
Hi Dr Dunning,

Thanks a lot! I was trying to make the model generalizable enough, but 
I'm also afraid I may 'abuse' it a bit, Here is my existing solution:

1. wrap any scorer by a ValueSource (many out-of-the-box exists in 
lucene-solr, extensions are possible but they don't have to be 
registered with ValueSourceParser-they won't be used independently)
2. extend CustomScoreQuery to have a flat and straightforward 
explanation form. Use this as a wrapper of filters (As SubQ) and 
scorers (As FunctionQ)
3. write a converter to print flat explanation to Mahout-compatible 
vectors.
4. run a job to 'explain()' those ground truths on an index and dump 
the result vectors.
5. (optional) run other jobs to get not-content-based score vectors.
6. join them, feed into a classifier-regressor, do some model 
selections.
7. (from this point I haven't done anything) try to 'migrate' this 
model into another CustomScoreQuery, which has a strong scorer that 
ensemble features in the same way the model suggested.
8. push into Solr Cloud Server. Register with Qparser.

What I found to be hard:

1. explanation is kind of abusive, its only designed for manual 
tweaking. I constantly run into problems where 'explain()' 
implementation was look down upon by developers and code stubs are used 
to fill. Notably, ToParentBlockJoin won't show nested scores, and 
ToChildBlockJoin simply doesn't work.
2. There is no automatic way to 'migrate' model to ensemble query. 
Though I haven't proceed that far I'm already afraid of the difficulty.
3. As a NoSQL database optimized to the core in text processing, Solr 
extensions are totally not intuitive and hard to debug and maintain. We 
try to keep this part minimal but still get stagnated at some point.

Environment is build on CDH 5.0beta2 with YARN and Cloudera search 
(Solr 4.4), some bugs then force me to uninstall it and install Solr 
Cloud 4.6. I wonder if there are more 'out-of-the-box' solutions?

Yours Peng

On Sun 09 Feb 2014 05:53:20 PM EST, Ted Dunning wrote:
> I think that this is a bit of an idiosyncratic model for learning to rank,
> but it is a reasonably viable one.
>
> It would be good to have a discussion of what you find hard or easy and
> what you think is needed to make this work.
>
> Let's talk.
>
>
>
> On Sun, Feb 9, 2014 at 2:26 PM, peng <pc...@uowmail.edu.au> wrote:
>
>> This is what I believe to be a typical learning to rank model:
>>
>> 1. Create many weak rankers/scorers (a.k.a feature engineering, in Solr
>> these are queries/function queries).
>> 2. Test those scorers on a ground truth dataset. Generating feature
>> vectors for top-n results annotated by human.
>> 3. Use an existing classifier/regressor (e.g. support vector ranking,
>> GBDT, random forest etc.) on those feature vectors to get a ranking model.
>> 4. Export this ranking model back to Solr as a custom ensemble query (a
>> BooleanQuery with custom boosting factor for linear model, or a
>> CustomScoreQuery with custom scoring function for non-linear model), push
>> it to Solr server, register with QParser. Push it to production. End of.
>>
>> But I didn't find this workflow quite easy to implement in mahout-solr
>> integration (is it discouraged for some reason?). Namely, there is no
>> pipeline from results of scorers to a Mahout-compatible vector form, and
>> there is no pipeline from ranking model back to ensemble query. (I only
>> found the lucene2seq class, and the upcoming recommendation support, which
>> don't quite fit into the scenario). So what's the best practice for easily
>> implementing a realtime, learning to rank search engine in this case? I've
>> worked in a bunch of startups and such appliance seems to be in high
>> demand. (Remember that solr-based collaborative filtering model proposed by
>> Dr Dunning? This is the content-based counterpart of it)
>>
>> I'm looking forward to streamline this process to make my upcoming work
>> easier. I think Mahout/Solr is the undisputed instrument of choice due to
>> their scalability and machine learning background of many of their top
>> committers. Can we talk about it at some point?
>>
>> Yours Peng
>>
>

Re: Learning to rank support in Mahout and Solr integration?

Posted by Ted Dunning <te...@gmail.com>.
I think that this is a bit of an idiosyncratic model for learning to rank,
but it is a reasonably viable one.

It would be good to have a discussion of what you find hard or easy and
what you think is needed to make this work.

Let's talk.



On Sun, Feb 9, 2014 at 2:26 PM, peng <pc...@uowmail.edu.au> wrote:

> This is what I believe to be a typical learning to rank model:
>
> 1. Create many weak rankers/scorers (a.k.a feature engineering, in Solr
> these are queries/function queries).
> 2. Test those scorers on a ground truth dataset. Generating feature
> vectors for top-n results annotated by human.
> 3. Use an existing classifier/regressor (e.g. support vector ranking,
> GBDT, random forest etc.) on those feature vectors to get a ranking model.
> 4. Export this ranking model back to Solr as a custom ensemble query (a
> BooleanQuery with custom boosting factor for linear model, or a
> CustomScoreQuery with custom scoring function for non-linear model), push
> it to Solr server, register with QParser. Push it to production. End of.
>
> But I didn't find this workflow quite easy to implement in mahout-solr
> integration (is it discouraged for some reason?). Namely, there is no
> pipeline from results of scorers to a Mahout-compatible vector form, and
> there is no pipeline from ranking model back to ensemble query. (I only
> found the lucene2seq class, and the upcoming recommendation support, which
> don't quite fit into the scenario). So what's the best practice for easily
> implementing a realtime, learning to rank search engine in this case? I've
> worked in a bunch of startups and such appliance seems to be in high
> demand. (Remember that solr-based collaborative filtering model proposed by
> Dr Dunning? This is the content-based counterpart of it)
>
> I'm looking forward to streamline this process to make my upcoming work
> easier. I think Mahout/Solr is the undisputed instrument of choice due to
> their scalability and machine learning background of many of their top
> committers. Can we talk about it at some point?
>
> Yours Peng
>