You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jianxiong Dong <jd...@gmail.com> on 2017/04/14 16:51:36 UTC

extract multi-features for one solr feature extractor in solr learning to rank

Hi,
    I found that solr learning-to-rank (LTR) supports only ONE feature
for a given feature extractor.

See interface:

https://github.com/apache/lucene-solr/blob/master/solr/contrib/ltr/src/java/org/apache/solr/ltr/feature/Feature.java

Line (281, 282) (in FeatureScorer)
@Override
      public abstract float score() throws IOException;

I have a user case: given a <query, doc>, I like to extract multiple
features (e.g.  100 features.  In the current framework,  I have to
define 100 features in feature.json. Also more cost for scored doc
iterations).

I would like to have an interface:

public abstract Map<String, Float> score() throws IOException;

It helps support sparse vector feature.

Can anybody provide an insight?

Thanks

Jianxiong

Re: extract multi-features for one solr feature extractor in solr learning to rank

Posted by "alessandro.benedetti" <a....@sease.io>.

Hi Jianxiong, this is definitely interesting.
Briefly reviewing the paper you linked the use case seems clear :
You want similar "family" of features, to be calculated on each field.
Let's take as example the TF feature, you may want to define in the
features.json only one feature including all the fields involved :

{ 
    "store" : "MyFeatureStore", 
    "name" : "query_term_frequency", 
    "class" : "com.apache.solr.ltr.feature.TermCountFeature", 
    "params" : { 
       "fields" : ["field1","field2","field3"], 
       "terms" : "${user_terms}"
} 

And then under the hood you would like this feature to be translated to N
features in the feature vector .

You have few solutions here :

1) out of the box, when you create the features.json, you do it
programmatically, your client app takes in input a simplified features.json
and it extends it automatically based on your custom config ( i was using
this approach to encode categorical features in N binary features)

2) you dive deep into the code and you add this flexibility to the plugin,
this will involve a modification in how currently the feature vector is
generated.

Cheers



-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: http://lucene.472066.n3.nabble.com/extract-multi-features-for-one-solr-feature-extractor-in-solr-learning-to-rank-tp4330058p4331217.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: extract multi-features for one solr feature extractor in solr learning to rank

Posted by Jianxiong Dong <jd...@gmail.com>.

Hi, Michael,
     Thank for very valuable feedbacks.

> You can pass in different params in the
> features.json config for each feature, even though they use the same
> feature class.
I used this idea to extract some features in this paper
(https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/letor3.pdf)
e.g.
Table 2 (1-15) features are just <query, doc> term features in various forms.

{
    "store" : "MyFeatureStore",
    "name" : "term_count_1",
    "class" : "com.apache.solr.ltr.feature.TermCountFeature",
    "params" : {
       "field" : "a_text",
       "terms" : "${user_terms}",
       "method"  : "1"
    }
  },

{
    "store" : "MyFeatureStore",
    "name" : "term_count_2",
    "class" : "com.apache.solr.ltr.feature.TermCountFeature",
    "params" : {
       "field" : "a_text",
       "terms" : "${user_terms}",
       "method"  : "2"
    }
  },

where method id corresponds to features on Table 2 (1-15).  Although
those features share the same class,  the differences are minor.  In
product deployment, this overhead may not be an issue. After feature
selection, probably only a small number of features are useful.

Another use case:
use convolution neural network or LSTM to extract embedded feature
vector for  both query and document, where dimension of the embedded
feature vectors should be 50-100. Then we feed those features into
learning-to-rank models.

> Your performance point about 100 features vs 1 feature is true,
> and pull requests to improve the plugin's performance and usability would
I will do some performance benchmark for some user cases to justify
whether supporting new multi-features for one feature class is worthy.
If yes, I will share the results and create pull request.

Thanks

Jianxiong

On 4/18/17, Michael Nilsson <mn...@gmail.com> wrote:
> Hi Jianxiong,
>
> What you say is true.  If you want 100 different feature values extracted,
> you need to specify 100 different features in the
> features.json config so that there is a direct mapping of features in and
> features out.  However, you more than likely need
> to only implement 1 feature class that you will use for those 100 feature
> values.  You can pass in different params in the
> features.json config for each feature, even though they use the same
> feature class.  In some cases you might be able to
> just have 1 feature output 1 value that changes per document, if you can
> collapse those features together.  This 2nd option
> may or may not work for you depending on your data, what you are trying to
> bucket, and what algorithm you are trying to
> use because not all algorithms can easily handle this case.  To illustrate:
>
>
> *A) Multiple binary features using the same 1 class*
> {
>     "name" : "isProductCheap",
>     "class" : "org.apache.solr.ltr.feature.SolrFeature",
>     "params" : {
>       "fq": [ "price:[0 TO 100]" ]
>     }
> },{
>     "name" : "isProductExpensive",
>     "class" : "org.apache.solr.ltr.feature.SolrFeature",
>     "params" : {
>       "fq": [ "price:[101 TO 1000]" ]
>     }
> },{
>     "name" : "isProductCrazyExpensive",
>     "class" : "org.apache.solr.ltr.feature.SolrFeature",
>     "params" : {
>       "fq": [ "price:[1001 TO *]" ]
>     }
> }
>
>
> *B) 1 feature that outputs different values (some algorithms don't handle
> discrete features well)*
> {
>     "name" : "productPricePoint",
>     "class" : "org.apache.solr.ltr.feature.MyPricePointFeature",
>     "params" : {
>
>       // Either hard code price map in MyPricePointFeature.java, or
>       // pass it in through params for flexible customization,
>       // and return different values for cheap, expensive, and
> crazyExpensive
>
>     }
> }
>
> The 2 options above satisfy most use cases, which is what we were
> targeting.
> In my specific use case, I opted for option A,
> and wrote a simple script that generates the features.json so I wouldn't
> have to write 100 similar features by hand.  You
> also mentioned that you want to extract features sparsely.  You can change
> the configuration of the Feature Transformer
> <http://lucene.apache.org/solr/6_5_0/solr-ltr/org/apache/solr/ltr/response/transform/LTRFeatureLoggerTransformerFactory.html>
>
> to return features that actually triggered in a sparse format
> <https://cwiki.apache.org/confluence/display/solr/Learning+To+Rank#LearningToRank-Advancedoptions>.
> Your performance point about 100 features vs 1 feature is true,
> and pull requests to improve the plugin's performance and usability would
> be more than welcome!
>
> -Michael
>
>
>
> On Fri, Apr 14, 2017 at 12:51 PM, Jianxiong Dong <jd...@gmail.com>
> wrote:
>
>> Hi,
>>     I found that solr learning-to-rank (LTR) supports only ONE feature
>> for a given feature extractor.
>>
>> See interface:
>>
>> https://github.com/apache/lucene-solr/blob/master/solr/
>> contrib/ltr/src/java/org/apache/solr/ltr/feature/Feature.java
>>
>> Line (281, 282) (in FeatureScorer)
>> @Override
>>       public abstract float score() throws IOException;
>>
>> I have a user case: given a <query, doc>, I like to extract multiple
>> features (e.g.  100 features.  In the current framework,  I have to
>> define 100 features in feature.json. Also more cost for scored doc
>> iterations).
>>
>> I would like to have an interface:
>>
>> public abstract Map<String, Float> score() throws IOException;
>>
>> It helps support sparse vector feature.
>>
>> Can anybody provide an insight?
>>
>> Thanks
>>
>> Jianxiong
>>
>

Re: extract multi-features for one solr feature extractor in solr learning to rank

Posted by Michael Nilsson <mn...@gmail.com>.

Hi Jianxiong,

What you say is true.  If you want 100 different feature values extracted,
you need to specify 100 different features in the
features.json config so that there is a direct mapping of features in and
features out.  However, you more than likely need
to only implement 1 feature class that you will use for those 100 feature
values.  You can pass in different params in the
features.json config for each feature, even though they use the same
feature class.  In some cases you might be able to
just have 1 feature output 1 value that changes per document, if you can
collapse those features together.  This 2nd option
may or may not work for you depending on your data, what you are trying to
bucket, and what algorithm you are trying to
use because not all algorithms can easily handle this case.  To illustrate:

*A) Multiple binary features using the same 1 class*
{
    "name" : "isProductCheap",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : {
      "fq": [ "price:[0 TO 100]" ]
    }
},{
    "name" : "isProductExpensive",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : {
      "fq": [ "price:[101 TO 1000]" ]
    }
},{
    "name" : "isProductCrazyExpensive",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : {
      "fq": [ "price:[1001 TO *]" ]
    }
}

*B) 1 feature that outputs different values (some algorithms don't handle
discrete features well)*
{
    "name" : "productPricePoint",
    "class" : "org.apache.solr.ltr.feature.MyPricePointFeature",
    "params" : {

      // Either hard code price map in MyPricePointFeature.java, or
      // pass it in through params for flexible customization,
      // and return different values for cheap, expensive, and
crazyExpensive

    }
}

The 2 options above satisfy most use cases, which is what we were targeting.
In my specific use case, I opted for option A,
and wrote a simple script that generates the features.json so I wouldn't
have to write 100 similar features by hand.  You
also mentioned that you want to extract features sparsely.  You can change
the configuration of the Feature Transformer
<http://lucene.apache.org/solr/6_5_0/solr-ltr/org/apache/solr/ltr/response/transform/LTRFeatureLoggerTransformerFactory.html>

to return features that actually triggered in a sparse format
<https://cwiki.apache.org/confluence/display/solr/Learning+To+Rank#LearningToRank-Advancedoptions>.
Your performance point about 100 features vs 1 feature is true,
and pull requests to improve the plugin's performance and usability would
be more than welcome!

-Michael

On Fri, Apr 14, 2017 at 12:51 PM, Jianxiong Dong <jd...@gmail.com>
wrote:

> Hi,
>     I found that solr learning-to-rank (LTR) supports only ONE feature
> for a given feature extractor.
>
> See interface:
>
> https://github.com/apache/lucene-solr/blob/master/solr/
> contrib/ltr/src/java/org/apache/solr/ltr/feature/Feature.java
>
> Line (281, 282) (in FeatureScorer)
> @Override
>       public abstract float score() throws IOException;
>
> I have a user case: given a <query, doc>, I like to extract multiple
> features (e.g.  100 features.  In the current framework,  I have to
> define 100 features in feature.json. Also more cost for scored doc
> iterations).
>
> I would like to have an interface:
>
> public abstract Map<String, Float> score() throws IOException;
>
> It helps support sparse vector feature.
>
> Can anybody provide an insight?
>
> Thanks
>
> Jianxiong
>