You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Jeremy Volkman <jv...@gmail.com> on 2010/04/24 05:40:42 UTC

Base score to use for custom query?

I have a situation similar to the following that I'm trying to solve:

I have a field in my document that contains a range of numbers. Say, for
example, the universe of numbers is the range of integers from 0-100. My
field represents a subrange of those numbers in a token stream. So, for
example, if one document contains 20-30, it's token stream contains the
terms [20, 21, 22, ..., 29]. Now I can quickly find all documents that
contain some number.

The next part of the problem is searching for all documents that intersect
with some subrange of numbers. Somewhat like a range query, but not exactly.
Say I want to search for all documents that touch the range [10, 30]. My
original implementation was to simply create a BooleanQuery full of
TermQuerys for each term in the range i was searching for. While this
returned the proper results, it did so with skewed scores. I'd prefer
documents containing numbers towards the beginning of my search range to be
scored higher than docs towards the end. So, if I had two documents, one
with 10-20, and one with 20-30, and I searched for [19,30], both documents
would be returned, but the second would be much more highly scored due to
its higher number of matched terms.

So, my plan is to write a custom query which matches documents documents in
my range in a way such as:

for (term : queryRange) {
TermDocs td = searcher.termDocs(term);
while (td.next()) {
...
}
}

And for each document, set the score to some vale proportional to the
matching term's distance from the beginning of the queried range.

My question is: what score should I start at, and what score should I end
at? If i assume that all documents matching the first term in my queried
range have score scoreMax, and all documents matching the last term have
scoreMin, and all documents matching in-between terms have a score between
scoreMax and scoreMin proportional to where they fall within the range, what
should scoreMax and scoreMin be?

My current thought is to start with the value passed to my Weight's
normalize() method, and work down to 0.0.

Thanks,
Jeremy

Re: Base score to use for custom query?

Posted by Jeremy Volkman <jv...@gmail.com>.

Hi Hoss,

I didn't end up writing my own query (well I did, but all it does is rewrite
into another query). I found DisjunctionMaxQuery, which seemed a good fit
for what I was trying to do. Instead of TermQuery, I used ConstantScoreQuery
combined with TermsFilter to create queries that weren't dependent upon the
Term's scores. For each ConstantScoreQuery, I set the boost much as you
suggested.

What's the difference in this case between using a DisjunctionMaxQuery,
which is what I've done, and using a BooleanQuery with disabled coord? And,
if I set omit norms, will TermQuery essential return constant scores for
terms? Does the use of DMQ + CSQ + TermsFilter throw up any red flags in
your experience?

Thanks again,
Jeremy

On Tue, Apr 27, 2010 at 2:14 PM, Chris Hostetter
<ho...@fucit.org>wrote:

>
> First off: if you haven't already make sure you OMIT_NORMS when indexing
> this field, that way you don't have to worry about docs with "lots" of
> numbers scoring low purely because of hte fieldNorm.
>
> Second: i wouldn't bother with a custom query, i would stick with your
> BooleanQuery appraoch, but make sure you do two things:
>
> 1) add boosts to all of your TermQueries a boost based on how far they are
> from the end of hte range. so if you have a rangle like [10 19] give the
> 19 clauses a boost of 1, the 18 clause a boost of 2, the 17 clause a boost
> of 3, etc...
>
> 2) disable the coord.  there is an option on BooleanQuery to do this, and
> it will make sure docs that only match one clause in your BooleanQuery
> dont' get a penalty compared to clauses that match many clauses in your
> BooleanQuery -- which is going to be important in ensuring that your
> boosts are useful.
>
> That should get you what you want, and if not then take a look at the
> score explaiantions and see if anything obvious jumps out -- post a
> followup with your code and the score explanations if you can't solve it
> to your liking.
>
> : I have a field in my document that contains a range of numbers. Say, for
> : example, the universe of numbers is the range of integers from 0-100. My
> : field represents a subrange of those numbers in a token stream. So, for
> : example, if one document contains 20-30, it's token stream contains the
> : terms [20, 21, 22, ..., 29]. Now I can quickly find all documents that
> : contain some number.
> :
> : The next part of the problem is searching for all documents that
> intersect
> : with some subrange of numbers. Somewhat like a range query, but not
> exactly.
> : Say I want to search for all documents that touch the range [10, 30]. My
> : original implementation was to simply create a BooleanQuery full of
> : TermQuerys for each term in the range i was searching for. While this
> : returned the proper results, it did so with skewed scores. I'd prefer
> : documents containing numbers towards the beginning of my search range to
> be
> : scored higher than docs towards the end. So, if I had two documents, one
> : with 10-20, and one with 20-30, and I searched for [19,30], both
> documents
> : would be returned, but the second would be much more highly scored due to
> : its higher number of matched terms.
> :
> : So, my plan is to write a custom query which matches documents documents
> in
> : my range in a way such as:
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Base score to use for custom query?

Posted by Chris Hostetter <ho...@fucit.org>.

First off: if you haven't already make sure you OMIT_NORMS when indexing 
this field, that way you don't have to worry about docs with "lots" of 
numbers scoring low purely because of hte fieldNorm.

Second: i wouldn't bother with a custom query, i would stick with your 
BooleanQuery appraoch, but make sure you do two things:

1) add boosts to all of your TermQueries a boost based on how far they are 
from the end of hte range. so if you have a rangle like [10 19] give the 
19 clauses a boost of 1, the 18 clause a boost of 2, the 17 clause a boost 
of 3, etc...

2) disable the coord.  there is an option on BooleanQuery to do this, and 
it will make sure docs that only match one clause in your BooleanQuery 
dont' get a penalty compared to clauses that match many clauses in your 
BooleanQuery -- which is going to be important in ensuring that your 
boosts are useful.

That should get you what you want, and if not then take a look at the 
score explaiantions and see if anything obvious jumps out -- post a 
followup with your code and the score explanations if you can't solve it 
to your liking.

: I have a field in my document that contains a range of numbers. Say, for
: example, the universe of numbers is the range of integers from 0-100. My
: field represents a subrange of those numbers in a token stream. So, for
: example, if one document contains 20-30, it's token stream contains the
: terms [20, 21, 22, ..., 29]. Now I can quickly find all documents that
: contain some number.
: 
: The next part of the problem is searching for all documents that intersect
: with some subrange of numbers. Somewhat like a range query, but not exactly.
: Say I want to search for all documents that touch the range [10, 30]. My
: original implementation was to simply create a BooleanQuery full of
: TermQuerys for each term in the range i was searching for. While this
: returned the proper results, it did so with skewed scores. I'd prefer
: documents containing numbers towards the beginning of my search range to be
: scored higher than docs towards the end. So, if I had two documents, one
: with 10-20, and one with 20-30, and I searched for [19,30], both documents
: would be returned, but the second would be much more highly scored due to
: its higher number of matched terms.
: 
: So, my plan is to write a custom query which matches documents documents in
: my range in a way such as:


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org