You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by scott w <sc...@gmail.com> on 2009/10/08 16:54:33 UTC

Question about how to speed up custom scoring

I am trying to come up with a performant query that will allow me to use a
custom score where the custom score is a sum-product over a set of query
time weights where each weight gets applied only if the query time term
exists in the document . So for example if I have a doc with three fields:
company=Microsoft, city=Redmond, and size=large, I may want to score that
document according to the following function: city==Microsoft ? .3 : 0 *
size ==large ? 0.5 : 0 to get a score of 0.8. Attached is a subclass I have
tested that implements this with one extra component which is that it allow
the relevance score to be combined in.

The problem is this custom score is not performant at all. For example, on a
small index of 5 million documents with 10 weights passed in it does 0.01
req/sec.

Are there ways to make to compute the same custom score but in a much more
performant way?

thanks,
Scott

Re: Question about how to speed up custom scoring

Posted by scott w <sc...@gmail.com>.

Right exactly. I looked into payload initially and realized it wouldn't work
for my use case.

On Fri, Oct 9, 2009 at 2:00 PM, Grant Ingersoll <gs...@apache.org> wrote:

> Oops, just reread and realized you wanted query time weights.  Payloads are
> an index time thing.
>
>
> On Oct 9, 2009, at 5:49 PM, Grant Ingersoll wrote:
>
>  If you are trying to add specific term weights to terms in the index and
>> then incorporate them into scoring, you might benefit from payloads and the
>> PayloadTermQuery option.  See
>> http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/
>>
>> -Grant
>>
>> On Oct 8, 2009, at 11:56 AM, scott w wrote:
>>
>>  Oops, forgot to include the class I mentioned. Here it is:
>>>
>>> public class QueryTermBoostingQuery extends CustomScoreQuery {
>>> private Map<String, Float> queryTermWeights;
>>> private float bias;
>>> private IndexReader indexReader;
>>>
>>> public QueryTermBoostingQuery( Query q, Map<String, Float> termWeights,
>>> IndexReader indexReader, float bias) {
>>>  super( q );
>>>  this.indexReader = indexReader;
>>>  if (bias < 0 || bias > 1) {
>>>    throw new IllegalArgumentException( "Bias must be between 0 and 1" );
>>>  }
>>>  this.bias = bias;
>>>  queryTermWeights = termWeights;
>>> }
>>>
>>> @Override
>>> public float customScore( int doc, float subQueryScore, float valSrcScore
>>> ) {
>>>  Document document;
>>>  try {
>>>    document = indexReader.document( doc );
>>>  } catch (IOException e) {
>>>    throw new SearchException( e );
>>>  }
>>>  float termWeightedScore = 0;
>>>
>>>  for (String field : queryTermWeights.keySet()) {
>>>    String docFieldValue = document.get( field );
>>>    if (docFieldValue != null) {
>>>      Float weight = queryTermWeights.get( field );
>>>      if (weight != null) {
>>>        termWeightedScore += weight * Float.parseFloat( docFieldValue );
>>>      }
>>>    }
>>>  }
>>>  return bias * subQueryScore + (1 - bias) * termWeightedScore;
>>> }
>>> }
>>>
>>> On Thu, Oct 8, 2009 at 7:54 AM, scott w <sc...@gmail.com> wrote:
>>>
>>>  I am trying to come up with a performant query that will allow me to use
>>>> a
>>>> custom score where the custom score is a sum-product over a set of query
>>>> time weights where each weight gets applied only if the query time term
>>>> exists in the document . So for example if I have a doc with three
>>>> fields:
>>>> company=Microsoft, city=Redmond, and size=large, I may want to score
>>>> that
>>>> document according to the following function: city==Microsoft ? .3 : 0 *
>>>> size ==large ? 0.5 : 0 to get a score of 0.8. Attached is a subclass I
>>>> have
>>>> tested that implements this with one extra component which is that it
>>>> allow
>>>> the relevance score to be combined in.
>>>>
>>>> The problem is this custom score is not performant at all. For example,
>>>> on
>>>> a small index of 5 million documents with 10 weights passed in it does
>>>> 0.01
>>>> req/sec.
>>>>
>>>> Are there ways to make to compute the same custom score but in a much
>>>> more
>>>> performant way?
>>>>
>>>> thanks,
>>>> Scott
>>>>
>>>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Question about how to speed up custom scoring

Posted by Grant Ingersoll <gs...@apache.org>.

Oops, just reread and realized you wanted query time weights.   
Payloads are an index time thing.

On Oct 9, 2009, at 5:49 PM, Grant Ingersoll wrote:

> If you are trying to add specific term weights to terms in the index  
> and then incorporate them into scoring, you might benefit from  
> payloads and the PayloadTermQuery option.  See http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/
>
> -Grant
>
> On Oct 8, 2009, at 11:56 AM, scott w wrote:
>
>> Oops, forgot to include the class I mentioned. Here it is:
>>
>> public class QueryTermBoostingQuery extends CustomScoreQuery {
>> private Map<String, Float> queryTermWeights;
>> private float bias;
>> private IndexReader indexReader;
>>
>> public QueryTermBoostingQuery( Query q, Map<String, Float>  
>> termWeights,
>> IndexReader indexReader, float bias) {
>>   super( q );
>>   this.indexReader = indexReader;
>>   if (bias < 0 || bias > 1) {
>>     throw new IllegalArgumentException( "Bias must be between 0 and  
>> 1" );
>>   }
>>   this.bias = bias;
>>   queryTermWeights = termWeights;
>> }
>>
>> @Override
>> public float customScore( int doc, float subQueryScore, float  
>> valSrcScore
>> ) {
>>   Document document;
>>   try {
>>     document = indexReader.document( doc );
>>   } catch (IOException e) {
>>     throw new SearchException( e );
>>   }
>>   float termWeightedScore = 0;
>>
>>   for (String field : queryTermWeights.keySet()) {
>>     String docFieldValue = document.get( field );
>>     if (docFieldValue != null) {
>>       Float weight = queryTermWeights.get( field );
>>       if (weight != null) {
>>         termWeightedScore += weight * Float.parseFloat 
>> ( docFieldValue );
>>       }
>>     }
>>   }
>>   return bias * subQueryScore + (1 - bias) * termWeightedScore;
>> }
>> }
>>
>> On Thu, Oct 8, 2009 at 7:54 AM, scott w <sc...@gmail.com> wrote:
>>
>>> I am trying to come up with a performant query that will allow me  
>>> to use a
>>> custom score where the custom score is a sum-product over a set of  
>>> query
>>> time weights where each weight gets applied only if the query time  
>>> term
>>> exists in the document . So for example if I have a doc with three  
>>> fields:
>>> company=Microsoft, city=Redmond, and size=large, I may want to  
>>> score that
>>> document according to the following function: city==Microsoft ? . 
>>> 3 : 0 *
>>> size ==large ? 0.5 : 0 to get a score of 0.8. Attached is a  
>>> subclass I have
>>> tested that implements this with one extra component which is that  
>>> it allow
>>> the relevance score to be combined in.
>>>
>>> The problem is this custom score is not performant at all. For  
>>> example, on
>>> a small index of 5 million documents with 10 weights passed in it  
>>> does 0.01
>>> req/sec.
>>>
>>> Are there ways to make to compute the same custom score but in a  
>>> much more
>>> performant way?
>>>
>>> thanks,
>>> Scott
>>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Question about how to speed up custom scoring

Posted by Grant Ingersoll <gs...@apache.org>.

If you are trying to add specific term weights to terms in the index  
and then incorporate them into scoring, you might benefit from  
payloads and the PayloadTermQuery option.  See http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/

-Grant

On Oct 8, 2009, at 11:56 AM, scott w wrote:

> Oops, forgot to include the class I mentioned. Here it is:
>
> public class QueryTermBoostingQuery extends CustomScoreQuery {
>  private Map<String, Float> queryTermWeights;
>  private float bias;
>  private IndexReader indexReader;
>
>  public QueryTermBoostingQuery( Query q, Map<String, Float>  
> termWeights,
> IndexReader indexReader, float bias) {
>    super( q );
>    this.indexReader = indexReader;
>    if (bias < 0 || bias > 1) {
>      throw new IllegalArgumentException( "Bias must be between 0 and  
> 1" );
>    }
>    this.bias = bias;
>    queryTermWeights = termWeights;
>  }
>
>  @Override
>  public float customScore( int doc, float subQueryScore, float  
> valSrcScore
> ) {
>    Document document;
>    try {
>      document = indexReader.document( doc );
>    } catch (IOException e) {
>      throw new SearchException( e );
>    }
>    float termWeightedScore = 0;
>
>    for (String field : queryTermWeights.keySet()) {
>      String docFieldValue = document.get( field );
>      if (docFieldValue != null) {
>        Float weight = queryTermWeights.get( field );
>        if (weight != null) {
>          termWeightedScore += weight * Float.parseFloat 
> ( docFieldValue );
>        }
>      }
>    }
>    return bias * subQueryScore + (1 - bias) * termWeightedScore;
>  }
> }
>
> On Thu, Oct 8, 2009 at 7:54 AM, scott w <sc...@gmail.com> wrote:
>
>> I am trying to come up with a performant query that will allow me  
>> to use a
>> custom score where the custom score is a sum-product over a set of  
>> query
>> time weights where each weight gets applied only if the query time  
>> term
>> exists in the document . So for example if I have a doc with three  
>> fields:
>> company=Microsoft, city=Redmond, and size=large, I may want to  
>> score that
>> document according to the following function: city==Microsoft ? . 
>> 3 : 0 *
>> size ==large ? 0.5 : 0 to get a score of 0.8. Attached is a  
>> subclass I have
>> tested that implements this with one extra component which is that  
>> it allow
>> the relevance score to be combined in.
>>
>> The problem is this custom score is not performant at all. For  
>> example, on
>> a small index of 5 million documents with 10 weights passed in it  
>> does 0.01
>> req/sec.
>>
>> Are there ways to make to compute the same custom score but in a  
>> much more
>> performant way?
>>
>> thanks,
>> Scott
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Question about how to speed up custom scoring

Posted by Andrzej Bialecki <ab...@getopt.org>.

Erick Erickson wrote:
> I suspect your problem here is the line:
> document = indexReader.document( doc );
> 
> See the caution in the docs
> 
> You could try using lazy loading (so you don't load all
> the terms of the document, just those you're interested
> in). And I *think* (but it's been a while) that if the terms
> you load are indexed that'll help. But this is mostly
> a guess.

Just to clarify: IndexReader.document(doc) and .document(doc, selector) 
load _only_ stored fields, they don't interact at all with the 
terms-related part of Lucene..


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Question about how to speed up custom scoring

Posted by scott w <sc...@gmail.com>.

On Sun, Oct 11, 2009 at 9:10 AM, Jake Mannix <ja...@gmail.com> wrote:

> What do you mean "not something I can plug in on top of my original query"?
>
> Do you mean that you can't do it like the more complex example in the class
> you posted earlier in the thread, where you take a linear combination of
> the
> Map<String, Float> -based score, and the regular text score?
>

Yes exactly.


>
> Another option is to just use the CustomScoreQuery like your other
> attempt is:
>
> --------
> public class QueryTermBoostingQuery extends CustomScoreQuery {
>   private float bias;
>
>  public QueryTermBoostingQuery (Query q, ValueSourceQuery[] vQueries, float
> bias)  {
>    super(q, vQueries);
>    this.bias = bias;
>  }
>
>  @Override
>  public float customScore(int doc, float subQueryScore, float[]
> valSrcScores)   {
>    float termWeightedScore = 0;
>    for(float valSrcScore : valSrcScores) termWeightedScore +=
> valSrcScore;
>     return bias * subQueryScore + (1 - bias) * termWeightedScore;
>  }
> }
>
> And then to use it, you build up some FloatFieldSource instances like
> before:
>
>  FloatFieldSource m1 = // defined like I did in my previous message
>  FloatFieldSource m2 = //...
>  FloatFieldSource m3 = //...
>
>  ValueSourceQuery vsq1 = new ValueSourceQuery(m1);
>  vsq1.setBoost(model1Weight);
>
>  // vsq2 and vsq3 also
>
>  ValueSourceQuery[] vsq = { vsq1, vsq2, vsq3  };
>
>  Query textQuery = QueryParser.parse("company:Microsoft");
>
>  Query q = new QueryTermBoostingQuery(textQuery, vsq, bias);
> -------
>
> Does this work for you?
>

Yes I think this should work! Thanks for taking the time to clearly write up
a solution. Will report back after testing it out.

best,
Scott

Re: Question about how to speed up custom scoring

Posted by Jake Mannix <ja...@gmail.com>.

What do you mean "not something I can plug in on top of my original query"?

Do you mean that you can't do it like the more complex example in the class
you posted earlier in the thread, where you take a linear combination of the
Map<String, Float> -based score, and the regular text score?

Another option is to just use the CustomScoreQuery like your other
attempt is:

--------
public class QueryTermBoostingQuery extends CustomScoreQuery {
  private float bias;

  public QueryTermBoostingQuery (Query q, ValueSourceQuery[] vQueries, float
bias)  {
    super(q, vQueries);
    this.bias = bias;
  }

  @Override
  public float customScore(int doc, float subQueryScore, float[]
valSrcScores)   {
    float termWeightedScore = 0;
    for(float valSrcScore : valSrcScores) termWeightedScore +=
valSrcScore;
    return bias * subQueryScore + (1 - bias) * termWeightedScore;
  }
}

And then to use it, you build up some FloatFieldSource instances like
before:

  FloatFieldSource m1 = // defined like I did in my previous message
  FloatFieldSource m2 = //...
  FloatFieldSource m3 = //...

  ValueSourceQuery vsq1 = new ValueSourceQuery(m1);
  vsq1.setBoost(model1Weight);

  // vsq2 and vsq3 also

  ValueSourceQuery[] vsq = { vsq1, vsq2, vsq3  };

  Query textQuery = QueryParser.parse("company:Microsoft");

  Query q = new QueryTermBoostingQuery(textQuery, vsq, bias);
-------

Does this work for you?

  -jake

On Sat, Oct 10, 2009 at 7:24 PM, scott w <sc...@gmail.com> wrote:

> Haven't tried it yet but looking at it closer it looks like it's not
> something I can plug in on top of my original query. I am definitely happy
> using an approximation for the sake of performance but I do need to be able
> to have the original results stay the same.
>
> On Fri, Oct 9, 2009 at 5:32 PM, Jake Mannix <ja...@gmail.com> wrote:
>
> > Great Scott (hah!) - please do report back, even if it just works fine
> and
> > you have no more questions, I'd like to know whether this really is
> > what you were after and actually works for you.
> >
>
> > Note that the FieldCache is kinda "magic" - it's lazy (so the first query
> > will
> > be slow and you should fire one off just to warm it up after every time
> > you reload an IndexReader), and kinda leaky: the entries in the
> FieldCache
> > stick around until all references to the IndexReader they're keyed on
> > get garbage collected (there's a WeakHashMap in the background), so
> > don't accidentally hang onto references to those IndexReaders past
> > when needed.
> >
>
> Good to know.
>
> >
> >  -jake
> >
> > On Fri, Oct 9, 2009 at 3:52 PM, scott w <sc...@gmail.com> wrote:
> >
> > > Thanks Jake! I will test this out and report back soon in case it's
> > helpful
> > > to others. Definitely appreciate the help.
> > >
> > > Scott
> > >
> > > On Fri, Oct 9, 2009 at 3:33 PM, Jake Mannix <ja...@gmail.com>
> > wrote:
> > >
> > > > On Fri, Oct 9, 2009 at 3:07 PM, scott w <sc...@gmail.com>
> wrote:
> > > >
> > > > > Example Document:
> > > > > model_1_score = 0.9
> > > > > model_2_score = 0.3
> > > > > model_3_score = 0.7
> > > > >
> > > > > I want to be able to pass in the following map at query time:
> > > > > {model_1_score=0.4, model_2_score=0.7} and have that map get used
> as
> > > > input
> > > > > to a custom score function that would look like: 0.9*0.4 + 0.3*0.7,
> > so
> > > > that
> > > > > it is summing over the specified fields and multipling the indexed
> > > weight
> > > > > by
> > > > > the query time weight.
> > > > >
> > > >
> > > > Ok, now I think I get it.  You do have some index-time floats, and
> > > > query-time
> > > > boosts.
> > > >
> > > > You should be able to do a normal BooleanQuery with three OR'ed
> > together
> > > > ValueSourceQueries boosted by input weights, it seems, as that is the
> > way
> > > > they work - they take the values in different fields and use them as
> > the
> > > > score,
> > > > and the normal boosting technique will use query time weigting.
> > > >
> > > > To get good performance with a ValueSourceQuery, I'd suggest using a
> > > > FieldCacheSource for speedier processing:
> > > >
> > > > -----------
> > > >  String field = "model_1_score";
> > > >  float runTimeBoost = 0.5;
> > > >
> > > >  ValueSource model1Source = new FloatFieldSource(field,
> > > >    new FieldCache.FloatParser() {
> > > >      public float parseFloat(String s) { return Float.parseFloat(s);
> }
> > > >    });
> > > >
> > > >  Query model1Q = new ValueSourceQuery(model1Source);
> > > >
> > > >  model1Q.setBoost(runTimeBoost);
> > > >
> > > > // do this for model2 and model3 as well...
> > > >
> > > >  BooleanQuery bq = new BooleanQuery();
> > > >  bq.add(model1Q, Occur.SHOULD);
> > > >  bq.add(model2Q, Occur.SHOULD);
> > > >  bq.add(model3Q, Occur.SHOULD);
> > > > ---------------
> > > >
> > > >  I haven't tried this code, but it seems like this is what you are
> > trying
> > > > to do...
> > > >
> > > >  -jake
> > > >
> > >
> >
>

Re: Question about how to speed up custom scoring

Posted by scott w <sc...@gmail.com>.

Haven't tried it yet but looking at it closer it looks like it's not
something I can plug in on top of my original query. I am definitely happy
using an approximation for the sake of performance but I do need to be able
to have the original results stay the same.

On Fri, Oct 9, 2009 at 5:32 PM, Jake Mannix <ja...@gmail.com> wrote:

> Great Scott (hah!) - please do report back, even if it just works fine and
> you have no more questions, I'd like to know whether this really is
> what you were after and actually works for you.
>

> Note that the FieldCache is kinda "magic" - it's lazy (so the first query
> will
> be slow and you should fire one off just to warm it up after every time
> you reload an IndexReader), and kinda leaky: the entries in the FieldCache
> stick around until all references to the IndexReader they're keyed on
> get garbage collected (there's a WeakHashMap in the background), so
> don't accidentally hang onto references to those IndexReaders past
> when needed.
>

Good to know.

>
>  -jake
>
> On Fri, Oct 9, 2009 at 3:52 PM, scott w <sc...@gmail.com> wrote:
>
> > Thanks Jake! I will test this out and report back soon in case it's
> helpful
> > to others. Definitely appreciate the help.
> >
> > Scott
> >
> > On Fri, Oct 9, 2009 at 3:33 PM, Jake Mannix <ja...@gmail.com>
> wrote:
> >
> > > On Fri, Oct 9, 2009 at 3:07 PM, scott w <sc...@gmail.com> wrote:
> > >
> > > > Example Document:
> > > > model_1_score = 0.9
> > > > model_2_score = 0.3
> > > > model_3_score = 0.7
> > > >
> > > > I want to be able to pass in the following map at query time:
> > > > {model_1_score=0.4, model_2_score=0.7} and have that map get used as
> > > input
> > > > to a custom score function that would look like: 0.9*0.4 + 0.3*0.7,
> so
> > > that
> > > > it is summing over the specified fields and multipling the indexed
> > weight
> > > > by
> > > > the query time weight.
> > > >
> > >
> > > Ok, now I think I get it.  You do have some index-time floats, and
> > > query-time
> > > boosts.
> > >
> > > You should be able to do a normal BooleanQuery with three OR'ed
> together
> > > ValueSourceQueries boosted by input weights, it seems, as that is the
> way
> > > they work - they take the values in different fields and use them as
> the
> > > score,
> > > and the normal boosting technique will use query time weigting.
> > >
> > > To get good performance with a ValueSourceQuery, I'd suggest using a
> > > FieldCacheSource for speedier processing:
> > >
> > > -----------
> > >  String field = "model_1_score";
> > >  float runTimeBoost = 0.5;
> > >
> > >  ValueSource model1Source = new FloatFieldSource(field,
> > >    new FieldCache.FloatParser() {
> > >      public float parseFloat(String s) { return Float.parseFloat(s); }
> > >    });
> > >
> > >  Query model1Q = new ValueSourceQuery(model1Source);
> > >
> > >  model1Q.setBoost(runTimeBoost);
> > >
> > > // do this for model2 and model3 as well...
> > >
> > >  BooleanQuery bq = new BooleanQuery();
> > >  bq.add(model1Q, Occur.SHOULD);
> > >  bq.add(model2Q, Occur.SHOULD);
> > >  bq.add(model3Q, Occur.SHOULD);
> > > ---------------
> > >
> > >  I haven't tried this code, but it seems like this is what you are
> trying
> > > to do...
> > >
> > >  -jake
> > >
> >
>

Re: Question about how to speed up custom scoring

Posted by Jake Mannix <ja...@gmail.com>.

Great Scott (hah!) - please do report back, even if it just works fine and
you have no more questions, I'd like to know whether this really is
what you were after and actually works for you.

Note that the FieldCache is kinda "magic" - it's lazy (so the first query
will
be slow and you should fire one off just to warm it up after every time
you reload an IndexReader), and kinda leaky: the entries in the FieldCache
stick around until all references to the IndexReader they're keyed on
get garbage collected (there's a WeakHashMap in the background), so
don't accidentally hang onto references to those IndexReaders past
when needed.

  -jake

On Fri, Oct 9, 2009 at 3:52 PM, scott w <sc...@gmail.com> wrote:

> Thanks Jake! I will test this out and report back soon in case it's helpful
> to others. Definitely appreciate the help.
>
> Scott
>
> On Fri, Oct 9, 2009 at 3:33 PM, Jake Mannix <ja...@gmail.com> wrote:
>
> > On Fri, Oct 9, 2009 at 3:07 PM, scott w <sc...@gmail.com> wrote:
> >
> > > Example Document:
> > > model_1_score = 0.9
> > > model_2_score = 0.3
> > > model_3_score = 0.7
> > >
> > > I want to be able to pass in the following map at query time:
> > > {model_1_score=0.4, model_2_score=0.7} and have that map get used as
> > input
> > > to a custom score function that would look like: 0.9*0.4 + 0.3*0.7, so
> > that
> > > it is summing over the specified fields and multipling the indexed
> weight
> > > by
> > > the query time weight.
> > >
> >
> > Ok, now I think I get it.  You do have some index-time floats, and
> > query-time
> > boosts.
> >
> > You should be able to do a normal BooleanQuery with three OR'ed together
> > ValueSourceQueries boosted by input weights, it seems, as that is the way
> > they work - they take the values in different fields and use them as the
> > score,
> > and the normal boosting technique will use query time weigting.
> >
> > To get good performance with a ValueSourceQuery, I'd suggest using a
> > FieldCacheSource for speedier processing:
> >
> > -----------
> >  String field = "model_1_score";
> >  float runTimeBoost = 0.5;
> >
> >  ValueSource model1Source = new FloatFieldSource(field,
> >    new FieldCache.FloatParser() {
> >      public float parseFloat(String s) { return Float.parseFloat(s); }
> >    });
> >
> >  Query model1Q = new ValueSourceQuery(model1Source);
> >
> >  model1Q.setBoost(runTimeBoost);
> >
> > // do this for model2 and model3 as well...
> >
> >  BooleanQuery bq = new BooleanQuery();
> >  bq.add(model1Q, Occur.SHOULD);
> >  bq.add(model2Q, Occur.SHOULD);
> >  bq.add(model3Q, Occur.SHOULD);
> > ---------------
> >
> >  I haven't tried this code, but it seems like this is what you are trying
> > to do...
> >
> >  -jake
> >
>

Re: Question about how to speed up custom scoring

Posted by scott w <sc...@gmail.com>.

Thanks Jake! I will test this out and report back soon in case it's helpful
to others. Definitely appreciate the help.

Scott

On Fri, Oct 9, 2009 at 3:33 PM, Jake Mannix <ja...@gmail.com> wrote:

> On Fri, Oct 9, 2009 at 3:07 PM, scott w <sc...@gmail.com> wrote:
>
> > Example Document:
> > model_1_score = 0.9
> > model_2_score = 0.3
> > model_3_score = 0.7
> >
> > I want to be able to pass in the following map at query time:
> > {model_1_score=0.4, model_2_score=0.7} and have that map get used as
> input
> > to a custom score function that would look like: 0.9*0.4 + 0.3*0.7, so
> that
> > it is summing over the specified fields and multipling the indexed weight
> > by
> > the query time weight.
> >
>
> Ok, now I think I get it.  You do have some index-time floats, and
> query-time
> boosts.
>
> You should be able to do a normal BooleanQuery with three OR'ed together
> ValueSourceQueries boosted by input weights, it seems, as that is the way
> they work - they take the values in different fields and use them as the
> score,
> and the normal boosting technique will use query time weigting.
>
> To get good performance with a ValueSourceQuery, I'd suggest using a
> FieldCacheSource for speedier processing:
>
> -----------
>  String field = "model_1_score";
>  float runTimeBoost = 0.5;
>
>  ValueSource model1Source = new FloatFieldSource(field,
>    new FieldCache.FloatParser() {
>      public float parseFloat(String s) { return Float.parseFloat(s); }
>    });
>
>  Query model1Q = new ValueSourceQuery(model1Source);
>
>  model1Q.setBoost(runTimeBoost);
>
> // do this for model2 and model3 as well...
>
>  BooleanQuery bq = new BooleanQuery();
>  bq.add(model1Q, Occur.SHOULD);
>  bq.add(model2Q, Occur.SHOULD);
>  bq.add(model3Q, Occur.SHOULD);
> ---------------
>
>  I haven't tried this code, but it seems like this is what you are trying
> to do...
>
>  -jake
>

Re: Question about how to speed up custom scoring

Posted by Jake Mannix <ja...@gmail.com>.

On Fri, Oct 9, 2009 at 3:07 PM, scott w <sc...@gmail.com> wrote:

> Example Document:
> model_1_score = 0.9
> model_2_score = 0.3
> model_3_score = 0.7
>
> I want to be able to pass in the following map at query time:
> {model_1_score=0.4, model_2_score=0.7} and have that map get used as input
> to a custom score function that would look like: 0.9*0.4 + 0.3*0.7, so that
> it is summing over the specified fields and multipling the indexed weight
> by
> the query time weight.
>

Ok, now I think I get it.  You do have some index-time floats, and
query-time
boosts.

You should be able to do a normal BooleanQuery with three OR'ed together
ValueSourceQueries boosted by input weights, it seems, as that is the way
they work - they take the values in different fields and use them as the
score,
and the normal boosting technique will use query time weigting.

To get good performance with a ValueSourceQuery, I'd suggest using a
FieldCacheSource for speedier processing:

-----------
  String field = "model_1_score";
  float runTimeBoost = 0.5;

  ValueSource model1Source = new FloatFieldSource(field,
    new FieldCache.FloatParser() {
      public float parseFloat(String s) { return Float.parseFloat(s); }
    });

  Query model1Q = new ValueSourceQuery(model1Source);

  model1Q.setBoost(runTimeBoost);

// do this for model2 and model3 as well...

  BooleanQuery bq = new BooleanQuery();
  bq.add(model1Q, Occur.SHOULD);
  bq.add(model2Q, Occur.SHOULD);
  bq.add(model3Q, Occur.SHOULD);
---------------

  I haven't tried this code, but it seems like this is what you are trying
to do...

  -jake

Re: Question about how to speed up custom scoring

Posted by scott w <sc...@gmail.com>.

Hi Jake --

Sorry for the confusion. I have two similar but slightly different use cases
in mind and the example I gave you corresponds to one use case while the
code corresponds to the other slightly more complicated one. Ignore the
original example, and let me restate the one I have in mind so it hopefully
makes more sense:

Example Document:
model_1_score = 0.9
model_2_score = 0.3
model_3_score = 0.7

I want to be able to pass in the following map at query time:
{model_1_score=0.4, model_2_score=0.7} and have that map get used as input
to a custom score function that would look like: 0.9*0.4 + 0.3*0.7, so that
it is summing over the specified fields and multipling the indexed weight by
the query time weight. For the purposes of this discussion, ignore the bit
in the code where I also factor in the relevance score and the bias that
determines how much weight to apply to the relevance score.

Hopefully that make more sense.

The other use case I had in mind is one where it doesn't care about the
indexed value and only looks at whether the field is present or not and then
uses the query supplied weight to measure the relative importance of that
field.

thanks,
Scott



On Fri, Oct 9, 2009 at 2:40 PM, Jake Mannix <ja...@gmail.com> wrote:

> Hey Scott,
>
>  I'm still not sure I understand what your dynamic boosts are for: they
> are the names of fields, right, not terms in the fields?  So in terms
> of your example { company = microsoft, city = redmond, size = big },
> the three possible choices for keys in your map are company, city,
> or size, right?
>
>  So if you passed in { company => 0.5, size => 0.3 } as your map,
> how would you compute the score for the two documents
>
>  { company: microsoft, city: redmond, size: big } and
>  { company: google, city: mountain view, size: big }
>
> Where is the Float that you have stored in the doc, that you're
> accessing via Float.parseFloat( document.get(fieldName) ) ?
>
>  So is it that you have fields in the index, which basically contain
> numeric data, and then you want to multiply those indexed floats
> by query-time values to make the score?
>
>  I'm sorry if I'm not fully following how this works.  Can you restate
> your example showing what you have in the index, and what comes
> in in the query?
>
>  -jake
>
> On Fri, Oct 9, 2009 at 2:18 PM, scott w <sc...@gmail.com> wrote:
>
> > (Apologies if this message gets sent more than once. I received an error
> > sending it the first two times so sent directly to Jake but reposting to
> > group.)
> > Hi Jake --
> >
> > Thanks for the feedback.
> >
> > What I am trying to implement is a way to custom score documents using a
> > scoring function that takes as input a map of fields (which may or may
> not
> > be in any given document) and weights for those fields supplied at query
> > time, and outputs an aggregate score that is based on taking the numeric
> > field weights that are already stored and indexed and then readjusting
> > those
> > weights based on the map.
> >
> > Another thing I would like to do is the same thing but for fields that do
> > not have weights associated with them in the index and so the query time
> > supplied weights essentially get used directly instead of adjusting the
> > already indexed weights.
> >
> > You can think of this is as implementing a form of personalization where
> > you
> > have a default set of weights and you want to adjust them on the fly
> > although our use case is a little different.
> >
> > thanks,
> > Scott
> >
> > On Fri, Oct 9, 2009 at 10:40 AM, Jake Mannix <ja...@gmail.com>
> > wrote:
> >
> > > Scott,
> > >
> > >  To reiterate what Erick and Andrzej's said: calling
> > > IndexReader.document(docId)
> > > in your inner scoring loop is the source of your performance problem -
> > > iterating
> > > over all these stored fields is what is killing you.
> > >
> > >  To do this a better way, can you try to explain exactly what this
> Scorer
> > > is
> > > supposed to be doing?  You're extending CustomScoreQuery and which
> > > is usually used with ValueSourceQuery, but you don't use that part, and
> > > ignore the valSrcScore in your computation.
> > >
> > >  Where are the parts of your score coming from?  The termWeight map
> > > is used how exactly?
> > >
> > >  -jake
> > >
> > > On Fri, Oct 9, 2009 at 10:30 AM, scott w <sc...@gmail.com> wrote:
> > >
> > > > Thanks for the suggestions Erick. I am using Lucene 2.3. Terms are
> > stored
> > > > and given Andrzej's comments in the follow up email sounds like it's
> > not
> > > > the
> > > > stored field issue. I'll keep investigating...
> > > >
> > > > thanks,
> > > > Scott
> > > >
> > > > On Thu, Oct 8, 2009 at 8:06 AM, Erick Erickson <
> > erickerickson@gmail.com
> > > > >wrote:
> > > >
> > > > > I suspect your problem here is the line:
> > > > > document = indexReader.document( doc );
> > > > >
> > > > > See the caution in the docs
> > > > >
> > > > > You could try using lazy loading (so you don't load all
> > > > > the terms of the document, just those you're interested
> > > > > in). And I *think* (but it's been a while) that if the terms
> > > > > you load are indexed that'll help. But this is mostly
> > > > > a guess.
> > > > >
> > > > > What version of Lucene are you using???
> > > > >
> > > > > Good luck!
> > > > > Erick
> > > > >
> > > > > On Thu, Oct 8, 2009 at 10:56 AM, scott w <sc...@gmail.com>
> > wrote:
> > > > >
> > > > > > Oops, forgot to include the class I mentioned. Here it is:
> > > > > >
> > > > > > public class QueryTermBoostingQuery extends CustomScoreQuery {
> > > > > >  private Map<String, Float> queryTermWeights;
> > > > > >  private float bias;
> > > > > >  private IndexReader indexReader;
> > > > > >
> > > > > >  public QueryTermBoostingQuery( Query q, Map<String, Float>
> > > > termWeights,
> > > > > > IndexReader indexReader, float bias) {
> > > > > >    super( q );
> > > > > >    this.indexReader = indexReader;
> > > > > >    if (bias < 0 || bias > 1) {
> > > > > >      throw new IllegalArgumentException( "Bias must be between 0
> > and
> > > 1"
> > > > > );
> > > > > >    }
> > > > > >    this.bias = bias;
> > > > > >    queryTermWeights = termWeights;
> > > > > >  }
> > > > > >
> > > > > >  @Override
> > > > > >  public float customScore( int doc, float subQueryScore, float
> > > > > valSrcScore
> > > > > > ) {
> > > > > >    Document document;
> > > > > >    try {
> > > > > >      document = indexReader.document( doc );
> > > > > >    } catch (IOException e) {
> > > > > >      throw new SearchException( e );
> > > > > >    }
> > > > > >    float termWeightedScore = 0;
> > > > > >
> > > > > >    for (String field : queryTermWeights.keySet()) {
> > > > > >      String docFieldValue = document.get( field );
> > > > > >      if (docFieldValue != null) {
> > > > > >        Float weight = queryTermWeights.get( field );
> > > > > >        if (weight != null) {
> > > > > >          termWeightedScore += weight * Float.parseFloat(
> > > docFieldValue
> > > > );
> > > > > >        }
> > > > > >      }
> > > > > >    }
> > > > > >    return bias * subQueryScore + (1 - bias) * termWeightedScore;
> > > > > >   }
> > > > > > }
> > > > > >
> > > > > > On Thu, Oct 8, 2009 at 7:54 AM, scott w <sc...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > I am trying to come up with a performant query that will allow
> me
> > > to
> > > > > use
> > > > > > a
> > > > > > > custom score where the custom score is a sum-product over a set
> > of
> > > > > query
> > > > > > > time weights where each weight gets applied only if the query
> > time
> > > > term
> > > > > > > exists in the document . So for example if I have a doc with
> > three
> > > > > > fields:
> > > > > > > company=Microsoft, city=Redmond, and size=large, I may want to
> > > score
> > > > > that
> > > > > > > document according to the following function: city==Microsoft ?
> > .3
> > > :
> > > > 0
> > > > > *
> > > > > > > size ==large ? 0.5 : 0 to get a score of 0.8. Attached is a
> > > subclass
> > > > I
> > > > > > have
> > > > > > > tested that implements this with one extra component which is
> > that
> > > it
> > > > > > allow
> > > > > > > the relevance score to be combined in.
> > > > > > >
> > > > > > > The problem is this custom score is not performant at all. For
> > > > example,
> > > > > > on
> > > > > > > a small index of 5 million documents with 10 weights passed in
> it
> > > > does
> > > > > > 0.01
> > > > > > > req/sec.
> > > > > > >
> > > > > > > Are there ways to make to compute the same custom score but in
> a
> > > much
> > > > > > more
> > > > > > > performant way?
> > > > > > >
> > > > > > > thanks,
> > > > > > > Scott
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Question about how to speed up custom scoring

Posted by Jake Mannix <ja...@gmail.com>.

Hey Scott,

  I'm still not sure I understand what your dynamic boosts are for: they
are the names of fields, right, not terms in the fields?  So in terms
of your example { company = microsoft, city = redmond, size = big },
the three possible choices for keys in your map are company, city,
or size, right?

  So if you passed in { company => 0.5, size => 0.3 } as your map,
how would you compute the score for the two documents

  { company: microsoft, city: redmond, size: big } and
  { company: google, city: mountain view, size: big }

Where is the Float that you have stored in the doc, that you're
accessing via Float.parseFloat( document.get(fieldName) ) ?

  So is it that you have fields in the index, which basically contain
numeric data, and then you want to multiply those indexed floats
by query-time values to make the score?

  I'm sorry if I'm not fully following how this works.  Can you restate
your example showing what you have in the index, and what comes
in in the query?

  -jake

On Fri, Oct 9, 2009 at 2:18 PM, scott w <sc...@gmail.com> wrote:

> (Apologies if this message gets sent more than once. I received an error
> sending it the first two times so sent directly to Jake but reposting to
> group.)
> Hi Jake --
>
> Thanks for the feedback.
>
> What I am trying to implement is a way to custom score documents using a
> scoring function that takes as input a map of fields (which may or may not
> be in any given document) and weights for those fields supplied at query
> time, and outputs an aggregate score that is based on taking the numeric
> field weights that are already stored and indexed and then readjusting
> those
> weights based on the map.
>
> Another thing I would like to do is the same thing but for fields that do
> not have weights associated with them in the index and so the query time
> supplied weights essentially get used directly instead of adjusting the
> already indexed weights.
>
> You can think of this is as implementing a form of personalization where
> you
> have a default set of weights and you want to adjust them on the fly
> although our use case is a little different.
>
> thanks,
> Scott
>
> On Fri, Oct 9, 2009 at 10:40 AM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > Scott,
> >
> >  To reiterate what Erick and Andrzej's said: calling
> > IndexReader.document(docId)
> > in your inner scoring loop is the source of your performance problem -
> > iterating
> > over all these stored fields is what is killing you.
> >
> >  To do this a better way, can you try to explain exactly what this Scorer
> > is
> > supposed to be doing?  You're extending CustomScoreQuery and which
> > is usually used with ValueSourceQuery, but you don't use that part, and
> > ignore the valSrcScore in your computation.
> >
> >  Where are the parts of your score coming from?  The termWeight map
> > is used how exactly?
> >
> >  -jake
> >
> > On Fri, Oct 9, 2009 at 10:30 AM, scott w <sc...@gmail.com> wrote:
> >
> > > Thanks for the suggestions Erick. I am using Lucene 2.3. Terms are
> stored
> > > and given Andrzej's comments in the follow up email sounds like it's
> not
> > > the
> > > stored field issue. I'll keep investigating...
> > >
> > > thanks,
> > > Scott
> > >
> > > On Thu, Oct 8, 2009 at 8:06 AM, Erick Erickson <
> erickerickson@gmail.com
> > > >wrote:
> > >
> > > > I suspect your problem here is the line:
> > > > document = indexReader.document( doc );
> > > >
> > > > See the caution in the docs
> > > >
> > > > You could try using lazy loading (so you don't load all
> > > > the terms of the document, just those you're interested
> > > > in). And I *think* (but it's been a while) that if the terms
> > > > you load are indexed that'll help. But this is mostly
> > > > a guess.
> > > >
> > > > What version of Lucene are you using???
> > > >
> > > > Good luck!
> > > > Erick
> > > >
> > > > On Thu, Oct 8, 2009 at 10:56 AM, scott w <sc...@gmail.com>
> wrote:
> > > >
> > > > > Oops, forgot to include the class I mentioned. Here it is:
> > > > >
> > > > > public class QueryTermBoostingQuery extends CustomScoreQuery {
> > > > >  private Map<String, Float> queryTermWeights;
> > > > >  private float bias;
> > > > >  private IndexReader indexReader;
> > > > >
> > > > >  public QueryTermBoostingQuery( Query q, Map<String, Float>
> > > termWeights,
> > > > > IndexReader indexReader, float bias) {
> > > > >    super( q );
> > > > >    this.indexReader = indexReader;
> > > > >    if (bias < 0 || bias > 1) {
> > > > >      throw new IllegalArgumentException( "Bias must be between 0
> and
> > 1"
> > > > );
> > > > >    }
> > > > >    this.bias = bias;
> > > > >    queryTermWeights = termWeights;
> > > > >  }
> > > > >
> > > > >  @Override
> > > > >  public float customScore( int doc, float subQueryScore, float
> > > > valSrcScore
> > > > > ) {
> > > > >    Document document;
> > > > >    try {
> > > > >      document = indexReader.document( doc );
> > > > >    } catch (IOException e) {
> > > > >      throw new SearchException( e );
> > > > >    }
> > > > >    float termWeightedScore = 0;
> > > > >
> > > > >    for (String field : queryTermWeights.keySet()) {
> > > > >      String docFieldValue = document.get( field );
> > > > >      if (docFieldValue != null) {
> > > > >        Float weight = queryTermWeights.get( field );
> > > > >        if (weight != null) {
> > > > >          termWeightedScore += weight * Float.parseFloat(
> > docFieldValue
> > > );
> > > > >        }
> > > > >      }
> > > > >    }
> > > > >    return bias * subQueryScore + (1 - bias) * termWeightedScore;
> > > > >   }
> > > > > }
> > > > >
> > > > > On Thu, Oct 8, 2009 at 7:54 AM, scott w <sc...@gmail.com>
> > wrote:
> > > > >
> > > > > > I am trying to come up with a performant query that will allow me
> > to
> > > > use
> > > > > a
> > > > > > custom score where the custom score is a sum-product over a set
> of
> > > > query
> > > > > > time weights where each weight gets applied only if the query
> time
> > > term
> > > > > > exists in the document . So for example if I have a doc with
> three
> > > > > fields:
> > > > > > company=Microsoft, city=Redmond, and size=large, I may want to
> > score
> > > > that
> > > > > > document according to the following function: city==Microsoft ?
> .3
> > :
> > > 0
> > > > *
> > > > > > size ==large ? 0.5 : 0 to get a score of 0.8. Attached is a
> > subclass
> > > I
> > > > > have
> > > > > > tested that implements this with one extra component which is
> that
> > it
> > > > > allow
> > > > > > the relevance score to be combined in.
> > > > > >
> > > > > > The problem is this custom score is not performant at all. For
> > > example,
> > > > > on
> > > > > > a small index of 5 million documents with 10 weights passed in it
> > > does
> > > > > 0.01
> > > > > > req/sec.
> > > > > >
> > > > > > Are there ways to make to compute the same custom score but in a
> > much
> > > > > more
> > > > > > performant way?
> > > > > >
> > > > > > thanks,
> > > > > > Scott
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Question about how to speed up custom scoring

Posted by scott w <sc...@gmail.com>.

(Apologies if this message gets sent more than once. I received an error
sending it the first two times so sent directly to Jake but reposting to
group.)
Hi Jake --

Thanks for the feedback.

What I am trying to implement is a way to custom score documents using a
scoring function that takes as input a map of fields (which may or may not
be in any given document) and weights for those fields supplied at query
time, and outputs an aggregate score that is based on taking the numeric
field weights that are already stored and indexed and then readjusting those
weights based on the map.

Another thing I would like to do is the same thing but for fields that do
not have weights associated with them in the index and so the query time
supplied weights essentially get used directly instead of adjusting the
already indexed weights.

You can think of this is as implementing a form of personalization where you
have a default set of weights and you want to adjust them on the fly
although our use case is a little different.

thanks,
Scott

On Fri, Oct 9, 2009 at 10:40 AM, Jake Mannix <ja...@gmail.com> wrote:

> Scott,
>
>  To reiterate what Erick and Andrzej's said: calling
> IndexReader.document(docId)
> in your inner scoring loop is the source of your performance problem -
> iterating
> over all these stored fields is what is killing you.
>
>  To do this a better way, can you try to explain exactly what this Scorer
> is
> supposed to be doing?  You're extending CustomScoreQuery and which
> is usually used with ValueSourceQuery, but you don't use that part, and
> ignore the valSrcScore in your computation.
>
>  Where are the parts of your score coming from?  The termWeight map
> is used how exactly?
>
>  -jake
>
> On Fri, Oct 9, 2009 at 10:30 AM, scott w <sc...@gmail.com> wrote:
>
> > Thanks for the suggestions Erick. I am using Lucene 2.3. Terms are stored
> > and given Andrzej's comments in the follow up email sounds like it's not
> > the
> > stored field issue. I'll keep investigating...
> >
> > thanks,
> > Scott
> >
> > On Thu, Oct 8, 2009 at 8:06 AM, Erick Erickson <erickerickson@gmail.com
> > >wrote:
> >
> > > I suspect your problem here is the line:
> > > document = indexReader.document( doc );
> > >
> > > See the caution in the docs
> > >
> > > You could try using lazy loading (so you don't load all
> > > the terms of the document, just those you're interested
> > > in). And I *think* (but it's been a while) that if the terms
> > > you load are indexed that'll help. But this is mostly
> > > a guess.
> > >
> > > What version of Lucene are you using???
> > >
> > > Good luck!
> > > Erick
> > >
> > > On Thu, Oct 8, 2009 at 10:56 AM, scott w <sc...@gmail.com> wrote:
> > >
> > > > Oops, forgot to include the class I mentioned. Here it is:
> > > >
> > > > public class QueryTermBoostingQuery extends CustomScoreQuery {
> > > >  private Map<String, Float> queryTermWeights;
> > > >  private float bias;
> > > >  private IndexReader indexReader;
> > > >
> > > >  public QueryTermBoostingQuery( Query q, Map<String, Float>
> > termWeights,
> > > > IndexReader indexReader, float bias) {
> > > >    super( q );
> > > >    this.indexReader = indexReader;
> > > >    if (bias < 0 || bias > 1) {
> > > >      throw new IllegalArgumentException( "Bias must be between 0 and
> 1"
> > > );
> > > >    }
> > > >    this.bias = bias;
> > > >    queryTermWeights = termWeights;
> > > >  }
> > > >
> > > >  @Override
> > > >  public float customScore( int doc, float subQueryScore, float
> > > valSrcScore
> > > > ) {
> > > >    Document document;
> > > >    try {
> > > >      document = indexReader.document( doc );
> > > >    } catch (IOException e) {
> > > >      throw new SearchException( e );
> > > >    }
> > > >    float termWeightedScore = 0;
> > > >
> > > >    for (String field : queryTermWeights.keySet()) {
> > > >      String docFieldValue = document.get( field );
> > > >      if (docFieldValue != null) {
> > > >        Float weight = queryTermWeights.get( field );
> > > >        if (weight != null) {
> > > >          termWeightedScore += weight * Float.parseFloat(
> docFieldValue
> > );
> > > >        }
> > > >      }
> > > >    }
> > > >    return bias * subQueryScore + (1 - bias) * termWeightedScore;
> > > >   }
> > > > }
> > > >
> > > > On Thu, Oct 8, 2009 at 7:54 AM, scott w <sc...@gmail.com>
> wrote:
> > > >
> > > > > I am trying to come up with a performant query that will allow me
> to
> > > use
> > > > a
> > > > > custom score where the custom score is a sum-product over a set of
> > > query
> > > > > time weights where each weight gets applied only if the query time
> > term
> > > > > exists in the document . So for example if I have a doc with three
> > > > fields:
> > > > > company=Microsoft, city=Redmond, and size=large, I may want to
> score
> > > that
> > > > > document according to the following function: city==Microsoft ? .3
> :
> > 0
> > > *
> > > > > size ==large ? 0.5 : 0 to get a score of 0.8. Attached is a
> subclass
> > I
> > > > have
> > > > > tested that implements this with one extra component which is that
> it
> > > > allow
> > > > > the relevance score to be combined in.
> > > > >
> > > > > The problem is this custom score is not performant at all. For
> > example,
> > > > on
> > > > > a small index of 5 million documents with 10 weights passed in it
> > does
> > > > 0.01
> > > > > req/sec.
> > > > >
> > > > > Are there ways to make to compute the same custom score but in a
> much
> > > > more
> > > > > performant way?
> > > > >
> > > > > thanks,
> > > > > Scott
> > > > >
> > > >
> > >
> >
>

Re: Question about how to speed up custom scoring

Posted by Jake Mannix <ja...@gmail.com>.

Scott,

  To reiterate what Erick and Andrzej's said: calling
IndexReader.document(docId)
in your inner scoring loop is the source of your performance problem -
iterating
over all these stored fields is what is killing you.

  To do this a better way, can you try to explain exactly what this Scorer
is
supposed to be doing?  You're extending CustomScoreQuery and which
is usually used with ValueSourceQuery, but you don't use that part, and
ignore the valSrcScore in your computation.

  Where are the parts of your score coming from?  The termWeight map
is used how exactly?

  -jake

On Fri, Oct 9, 2009 at 10:30 AM, scott w <sc...@gmail.com> wrote:

> Thanks for the suggestions Erick. I am using Lucene 2.3. Terms are stored
> and given Andrzej's comments in the follow up email sounds like it's not
> the
> stored field issue. I'll keep investigating...
>
> thanks,
> Scott
>
> On Thu, Oct 8, 2009 at 8:06 AM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > I suspect your problem here is the line:
> > document = indexReader.document( doc );
> >
> > See the caution in the docs
> >
> > You could try using lazy loading (so you don't load all
> > the terms of the document, just those you're interested
> > in). And I *think* (but it's been a while) that if the terms
> > you load are indexed that'll help. But this is mostly
> > a guess.
> >
> > What version of Lucene are you using???
> >
> > Good luck!
> > Erick
> >
> > On Thu, Oct 8, 2009 at 10:56 AM, scott w <sc...@gmail.com> wrote:
> >
> > > Oops, forgot to include the class I mentioned. Here it is:
> > >
> > > public class QueryTermBoostingQuery extends CustomScoreQuery {
> > >  private Map<String, Float> queryTermWeights;
> > >  private float bias;
> > >  private IndexReader indexReader;
> > >
> > >  public QueryTermBoostingQuery( Query q, Map<String, Float>
> termWeights,
> > > IndexReader indexReader, float bias) {
> > >    super( q );
> > >    this.indexReader = indexReader;
> > >    if (bias < 0 || bias > 1) {
> > >      throw new IllegalArgumentException( "Bias must be between 0 and 1"
> > );
> > >    }
> > >    this.bias = bias;
> > >    queryTermWeights = termWeights;
> > >  }
> > >
> > >  @Override
> > >  public float customScore( int doc, float subQueryScore, float
> > valSrcScore
> > > ) {
> > >    Document document;
> > >    try {
> > >      document = indexReader.document( doc );
> > >    } catch (IOException e) {
> > >      throw new SearchException( e );
> > >    }
> > >    float termWeightedScore = 0;
> > >
> > >    for (String field : queryTermWeights.keySet()) {
> > >      String docFieldValue = document.get( field );
> > >      if (docFieldValue != null) {
> > >        Float weight = queryTermWeights.get( field );
> > >        if (weight != null) {
> > >          termWeightedScore += weight * Float.parseFloat( docFieldValue
> );
> > >        }
> > >      }
> > >    }
> > >    return bias * subQueryScore + (1 - bias) * termWeightedScore;
> > >   }
> > > }
> > >
> > > On Thu, Oct 8, 2009 at 7:54 AM, scott w <sc...@gmail.com> wrote:
> > >
> > > > I am trying to come up with a performant query that will allow me to
> > use
> > > a
> > > > custom score where the custom score is a sum-product over a set of
> > query
> > > > time weights where each weight gets applied only if the query time
> term
> > > > exists in the document . So for example if I have a doc with three
> > > fields:
> > > > company=Microsoft, city=Redmond, and size=large, I may want to score
> > that
> > > > document according to the following function: city==Microsoft ? .3 :
> 0
> > *
> > > > size ==large ? 0.5 : 0 to get a score of 0.8. Attached is a subclass
> I
> > > have
> > > > tested that implements this with one extra component which is that it
> > > allow
> > > > the relevance score to be combined in.
> > > >
> > > > The problem is this custom score is not performant at all. For
> example,
> > > on
> > > > a small index of 5 million documents with 10 weights passed in it
> does
> > > 0.01
> > > > req/sec.
> > > >
> > > > Are there ways to make to compute the same custom score but in a much
> > > more
> > > > performant way?
> > > >
> > > > thanks,
> > > > Scott
> > > >
> > >
> >
>

Re: Question about how to speed up custom scoring

Posted by scott w <sc...@gmail.com>.

Thanks for the suggestions Erick. I am using Lucene 2.3. Terms are stored
and given Andrzej's comments in the follow up email sounds like it's not the
stored field issue. I'll keep investigating...

thanks,
Scott

On Thu, Oct 8, 2009 at 8:06 AM, Erick Erickson <er...@gmail.com>wrote:

> I suspect your problem here is the line:
> document = indexReader.document( doc );
>
> See the caution in the docs
>
> You could try using lazy loading (so you don't load all
> the terms of the document, just those you're interested
> in). And I *think* (but it's been a while) that if the terms
> you load are indexed that'll help. But this is mostly
> a guess.
>
> What version of Lucene are you using???
>
> Good luck!
> Erick
>
> On Thu, Oct 8, 2009 at 10:56 AM, scott w <sc...@gmail.com> wrote:
>
> > Oops, forgot to include the class I mentioned. Here it is:
> >
> > public class QueryTermBoostingQuery extends CustomScoreQuery {
> >  private Map<String, Float> queryTermWeights;
> >  private float bias;
> >  private IndexReader indexReader;
> >
> >  public QueryTermBoostingQuery( Query q, Map<String, Float> termWeights,
> > IndexReader indexReader, float bias) {
> >    super( q );
> >    this.indexReader = indexReader;
> >    if (bias < 0 || bias > 1) {
> >      throw new IllegalArgumentException( "Bias must be between 0 and 1"
> );
> >    }
> >    this.bias = bias;
> >    queryTermWeights = termWeights;
> >  }
> >
> >  @Override
> >  public float customScore( int doc, float subQueryScore, float
> valSrcScore
> > ) {
> >    Document document;
> >    try {
> >      document = indexReader.document( doc );
> >    } catch (IOException e) {
> >      throw new SearchException( e );
> >    }
> >    float termWeightedScore = 0;
> >
> >    for (String field : queryTermWeights.keySet()) {
> >      String docFieldValue = document.get( field );
> >      if (docFieldValue != null) {
> >        Float weight = queryTermWeights.get( field );
> >        if (weight != null) {
> >          termWeightedScore += weight * Float.parseFloat( docFieldValue );
> >        }
> >      }
> >    }
> >    return bias * subQueryScore + (1 - bias) * termWeightedScore;
> >   }
> > }
> >
> > On Thu, Oct 8, 2009 at 7:54 AM, scott w <sc...@gmail.com> wrote:
> >
> > > I am trying to come up with a performant query that will allow me to
> use
> > a
> > > custom score where the custom score is a sum-product over a set of
> query
> > > time weights where each weight gets applied only if the query time term
> > > exists in the document . So for example if I have a doc with three
> > fields:
> > > company=Microsoft, city=Redmond, and size=large, I may want to score
> that
> > > document according to the following function: city==Microsoft ? .3 : 0
> *
> > > size ==large ? 0.5 : 0 to get a score of 0.8. Attached is a subclass I
> > have
> > > tested that implements this with one extra component which is that it
> > allow
> > > the relevance score to be combined in.
> > >
> > > The problem is this custom score is not performant at all. For example,
> > on
> > > a small index of 5 million documents with 10 weights passed in it does
> > 0.01
> > > req/sec.
> > >
> > > Are there ways to make to compute the same custom score but in a much
> > more
> > > performant way?
> > >
> > > thanks,
> > > Scott
> > >
> >
>

Re: Question about how to speed up custom scoring

Posted by Erick Erickson <er...@gmail.com>.

I suspect your problem here is the line:
document = indexReader.document( doc );

See the caution in the docs

You could try using lazy loading (so you don't load all
the terms of the document, just those you're interested
in). And I *think* (but it's been a while) that if the terms
you load are indexed that'll help. But this is mostly
a guess.

What version of Lucene are you using???

Good luck!
Erick

On Thu, Oct 8, 2009 at 10:56 AM, scott w <sc...@gmail.com> wrote:

> Oops, forgot to include the class I mentioned. Here it is:
>
> public class QueryTermBoostingQuery extends CustomScoreQuery {
>  private Map<String, Float> queryTermWeights;
>  private float bias;
>  private IndexReader indexReader;
>
>  public QueryTermBoostingQuery( Query q, Map<String, Float> termWeights,
> IndexReader indexReader, float bias) {
>    super( q );
>    this.indexReader = indexReader;
>    if (bias < 0 || bias > 1) {
>      throw new IllegalArgumentException( "Bias must be between 0 and 1" );
>    }
>    this.bias = bias;
>    queryTermWeights = termWeights;
>  }
>
>  @Override
>  public float customScore( int doc, float subQueryScore, float valSrcScore
> ) {
>    Document document;
>    try {
>      document = indexReader.document( doc );
>    } catch (IOException e) {
>      throw new SearchException( e );
>    }
>    float termWeightedScore = 0;
>
>    for (String field : queryTermWeights.keySet()) {
>      String docFieldValue = document.get( field );
>      if (docFieldValue != null) {
>        Float weight = queryTermWeights.get( field );
>        if (weight != null) {
>          termWeightedScore += weight * Float.parseFloat( docFieldValue );
>        }
>      }
>    }
>    return bias * subQueryScore + (1 - bias) * termWeightedScore;
>   }
> }
>
> On Thu, Oct 8, 2009 at 7:54 AM, scott w <sc...@gmail.com> wrote:
>
> > I am trying to come up with a performant query that will allow me to use
> a
> > custom score where the custom score is a sum-product over a set of query
> > time weights where each weight gets applied only if the query time term
> > exists in the document . So for example if I have a doc with three
> fields:
> > company=Microsoft, city=Redmond, and size=large, I may want to score that
> > document according to the following function: city==Microsoft ? .3 : 0 *
> > size ==large ? 0.5 : 0 to get a score of 0.8. Attached is a subclass I
> have
> > tested that implements this with one extra component which is that it
> allow
> > the relevance score to be combined in.
> >
> > The problem is this custom score is not performant at all. For example,
> on
> > a small index of 5 million documents with 10 weights passed in it does
> 0.01
> > req/sec.
> >
> > Are there ways to make to compute the same custom score but in a much
> more
> > performant way?
> >
> > thanks,
> > Scott
> >
>

Re: Question about how to speed up custom scoring

Posted by scott w <sc...@gmail.com>.

Oops, forgot to include the class I mentioned. Here it is:

public class QueryTermBoostingQuery extends CustomScoreQuery {
  private Map<String, Float> queryTermWeights;
  private float bias;
  private IndexReader indexReader;

  public QueryTermBoostingQuery( Query q, Map<String, Float> termWeights,
IndexReader indexReader, float bias) {
    super( q );
    this.indexReader = indexReader;
    if (bias < 0 || bias > 1) {
      throw new IllegalArgumentException( "Bias must be between 0 and 1" );
    }
    this.bias = bias;
    queryTermWeights = termWeights;
  }

  @Override
  public float customScore( int doc, float subQueryScore, float valSrcScore
) {
    Document document;
    try {
      document = indexReader.document( doc );
    } catch (IOException e) {
      throw new SearchException( e );
    }
    float termWeightedScore = 0;

    for (String field : queryTermWeights.keySet()) {
      String docFieldValue = document.get( field );
      if (docFieldValue != null) {
        Float weight = queryTermWeights.get( field );
        if (weight != null) {
          termWeightedScore += weight * Float.parseFloat( docFieldValue );
        }
      }
    }
    return bias * subQueryScore + (1 - bias) * termWeightedScore;
  }
}

On Thu, Oct 8, 2009 at 7:54 AM, scott w <sc...@gmail.com> wrote:

> I am trying to come up with a performant query that will allow me to use a
> custom score where the custom score is a sum-product over a set of query
> time weights where each weight gets applied only if the query time term
> exists in the document . So for example if I have a doc with three fields:
> company=Microsoft, city=Redmond, and size=large, I may want to score that
> document according to the following function: city==Microsoft ? .3 : 0 *
> size ==large ? 0.5 : 0 to get a score of 0.8. Attached is a subclass I have
> tested that implements this with one extra component which is that it allow
> the relevance score to be combined in.
>
> The problem is this custom score is not performant at all. For example, on
> a small index of 5 million documents with 10 weights passed in it does 0.01
> req/sec.
>
> Are there ways to make to compute the same custom score but in a much more
> performant way?
>
> thanks,
> Scott
>