You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Gili Nachum <gi...@gmail.com> on 2013/05/04 19:46:58 UTC

Best practices in boosting by proximity?

Hi. *I would like for hits that contain the search terms in proximity to
each other to be ranked higher than hits in which the terms are scattered
across the doc.
Wondering if there's a best practice to achieve that?*
I also want that all hits will contain all of the search terms (implicit
AND):

*Example:* when users search for: "lannisters always pay their debts", the
4 matching results should be ranked the following (for simplicity, assume
equal field norms, and TF/IDF, in all hits):
1. "It is known that *Lannisters always pay their debts*"
2. "... Lannisters ... they sometimes *pay their debts* ... always with you"
3. *"Lannisters always *win ... debts ... pay tax ... their nature"
4. "Lannisters ... always ... pay ... their ... debts"

The first result has all 5 terms in proximity to each other.
The second has 3 terms in proximity.
The third has 2 terms in proximity.
The forth has none of the terms in proximity to each other.

My current AND query that ignores proximity is: +lannisters +always +pay
+their +debts
So if there are M terms, I was thinking that I could add M-1 SHOULD phrase
queries to the original query:
"lannisters always" "always pay" "pay their" "their debts".

What are the pros and cons? Are there alternatives to consider?
Any Lucene class that helps achieve this?

Thx!

Re: Best practices in boosting by proximity?

Posted by Gili Nachum <gi...@gmail.com>.

Hi Karl,

I guess I must have individual terms in my query, along side the SHOULD
phrases with slops, since I don't want to miss on results , even if the
terms distance is huge.

Slop - will enrich the phrases with them.
Shingles - Good idea. I'll index bi-grams if performance because an issue.

Indeed, I've used query parser syntax but that's just for communication
sake, I'll probably implement this programmatically.

Cheers! (even during responding to the Lucene group ;).


On Sat, May 4, 2013 at 9:51 PM, Karl Wettin <ka...@kodapan.se> wrote:

> I just realized this mail contained several incomplete sentences. I blame
> norwegian beers. Please allow me to try it once again:
>
> The most simple solution is to make use of slop in PhraseQuery,
> SpanNearQuery, etc(?). Also consider permutations of  #isInOrder() with
> alternative query boosts.
>
> Even though slop will create a greater score the closer the terms are, it
> might still in some cases (usually when combined with other subqueries)
> make sense to create a BooleanQuery that contains the same query but with a
> greater boost to a smaller slop.
>
> You could also consider using shingles (even in combination with the
> above) for matching documents where the distance between two terms is zero.
> Generally it's hard to define a best practice. It depends on the corpora
> your index represents, your queries and your needs.
>
> Given your question it looks like you're using the query parser. Try
> something like "your proximity query"~20, but consider the cost of a great
> slop.
>
> 4 maj 2013 kl. 20:41 skrev Karl Wettin:
>
> > The most simple solution is to use of slop in PhraseQuery,
> SpanNearQuery, etc(?). Also consider permutations of  #isInOrder() with
> alternative query boosts.
> >
> > Even though slop will create a greater score the closer the terms are,
> it might still in some cases (usually when combined with other subqueries)
>  make sense to create a BooleanQuery that contains the same query but with
> a greater boost to a smaller slop.
> >
> > You could also consider using shingles (even in combination with above)
> for matching documents where the distance between two terms are. Generally
> it's hard to define a best practice. It depends on the corpora your index
> represents, your queries and your needs.
> >
> > Given your question it looks like you're using the query parser. Try
> something like "your proximity query"~20, but consider the cost of a great
> slop.
> >
> >
> >               karl
> >
> > 4 maj 2013 kl. 19:46 skrev Gili Nachum:
> >
> >> Hi. *I would like for hits that contain the search terms in proximity to
> >> each other to be ranked higher than hits in which the terms are
> scattered
> >> across the doc.
> >> Wondering if there's a best practice to achieve that?*
> >> I also want that all hits will contain all of the search terms (implicit
> >> AND):
> >>
> >> *Example:* when users search for: "lannisters always pay their debts",
> the
> >> 4 matching results should be ranked the following (for simplicity,
> assume
> >> equal field norms, and TF/IDF, in all hits):
> >> 1. "It is known that *Lannisters always pay their debts*"
> >> 2. "... Lannisters ... they sometimes *pay their debts* ... always with
> you"
> >> 3. *"Lannisters always *win ... debts ... pay tax ... their nature"
> >> 4. "Lannisters ... always ... pay ... their ... debts"
> >>
> >> The first result has all 5 terms in proximity to each other.
> >> The second has 3 terms in proximity.
> >> The third has 2 terms in proximity.
> >> The forth has none of the terms in proximity to each other.
> >>
> >> My current AND query that ignores proximity is: +lannisters +always +pay
> >> +their +debts
> >> So if there are M terms, I was thinking that I could add M-1 SHOULD
> phrase
> >> queries to the original query:
> >> "lannisters always" "always pay" "pay their" "their debts".
> >>
> >> What are the pros and cons? Are there alternatives to consider?
> >> Any Lucene class that helps achieve this?
> >>
> >> Thx!
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Best practices in boosting by proximity?

Posted by Karl Wettin <ka...@kodapan.se>.

I just realized this mail contained several incomplete sentences. I blame norwegian beers. Please allow me to try it once again:

The most simple solution is to make use of slop in PhraseQuery, SpanNearQuery, etc(?). Also consider permutations of  #isInOrder() with alternative query boosts.

Even though slop will create a greater score the closer the terms are, it might still in some cases (usually when combined with other subqueries) make sense to create a BooleanQuery that contains the same query but with a greater boost to a smaller slop. 

You could also consider using shingles (even in combination with the above) for matching documents where the distance between two terms is zero. Generally it's hard to define a best practice. It depends on the corpora your index represents, your queries and your needs.

Given your question it looks like you're using the query parser. Try something like "your proximity query"~20, but consider the cost of a great slop.

4 maj 2013 kl. 20:41 skrev Karl Wettin:

> The most simple solution is to use of slop in PhraseQuery, SpanNearQuery, etc(?). Also consider permutations of  #isInOrder() with alternative query boosts.
> 
> Even though slop will create a greater score the closer the terms are, it might still in some cases (usually when combined with other subqueries)  make sense to create a BooleanQuery that contains the same query but with a greater boost to a smaller slop. 
> 
> You could also consider using shingles (even in combination with above) for matching documents where the distance between two terms are. Generally it's hard to define a best practice. It depends on the corpora your index represents, your queries and your needs.
> 
> Given your question it looks like you're using the query parser. Try something like "your proximity query"~20, but consider the cost of a great slop.
> 
> 
> 		karl 
> 
> 4 maj 2013 kl. 19:46 skrev Gili Nachum:
> 
>> Hi. *I would like for hits that contain the search terms in proximity to
>> each other to be ranked higher than hits in which the terms are scattered
>> across the doc.
>> Wondering if there's a best practice to achieve that?*
>> I also want that all hits will contain all of the search terms (implicit
>> AND):
>> 
>> *Example:* when users search for: "lannisters always pay their debts", the
>> 4 matching results should be ranked the following (for simplicity, assume
>> equal field norms, and TF/IDF, in all hits):
>> 1. "It is known that *Lannisters always pay their debts*"
>> 2. "... Lannisters ... they sometimes *pay their debts* ... always with you"
>> 3. *"Lannisters always *win ... debts ... pay tax ... their nature"
>> 4. "Lannisters ... always ... pay ... their ... debts"
>> 
>> The first result has all 5 terms in proximity to each other.
>> The second has 3 terms in proximity.
>> The third has 2 terms in proximity.
>> The forth has none of the terms in proximity to each other.
>> 
>> My current AND query that ignores proximity is: +lannisters +always +pay
>> +their +debts
>> So if there are M terms, I was thinking that I could add M-1 SHOULD phrase
>> queries to the original query:
>> "lannisters always" "always pay" "pay their" "their debts".
>> 
>> What are the pros and cons? Are there alternatives to consider?
>> Any Lucene class that helps achieve this?
>> 
>> Thx!
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Best practices in boosting by proximity?

Posted by Karl Wettin <ka...@kodapan.se>.

The most simple solution is to use of slop in PhraseQuery, SpanNearQuery, etc(?). Also consider permutations of  #isInOrder() with alternative query boosts.

Even though slop will create a greater score the closer the terms are, it might still in some cases (usually when combined with other subqueries)  make sense to create a BooleanQuery that contains the same query but with a greater boost to a smaller slop. 

You could also consider using shingles (even in combination with above) for matching documents where the distance between two terms are. Generally it's hard to define a best practice. It depends on the corpora your index represents, your queries and your needs.

Given your question it looks like you're using the query parser. Try something like "your proximity query"~20, but consider the cost of a great slop.


		karl 

4 maj 2013 kl. 19:46 skrev Gili Nachum:

> Hi. *I would like for hits that contain the search terms in proximity to
> each other to be ranked higher than hits in which the terms are scattered
> across the doc.
> Wondering if there's a best practice to achieve that?*
> I also want that all hits will contain all of the search terms (implicit
> AND):
> 
> *Example:* when users search for: "lannisters always pay their debts", the
> 4 matching results should be ranked the following (for simplicity, assume
> equal field norms, and TF/IDF, in all hits):
> 1. "It is known that *Lannisters always pay their debts*"
> 2. "... Lannisters ... they sometimes *pay their debts* ... always with you"
> 3. *"Lannisters always *win ... debts ... pay tax ... their nature"
> 4. "Lannisters ... always ... pay ... their ... debts"
> 
> The first result has all 5 terms in proximity to each other.
> The second has 3 terms in proximity.
> The third has 2 terms in proximity.
> The forth has none of the terms in proximity to each other.
> 
> My current AND query that ignores proximity is: +lannisters +always +pay
> +their +debts
> So if there are M terms, I was thinking that I could add M-1 SHOULD phrase
> queries to the original query:
> "lannisters always" "always pay" "pay their" "their debts".
> 
> What are the pros and cons? Are there alternatives to consider?
> Any Lucene class that helps achieve this?
> 
> Thx!


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org