You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Pelit Mamani <Pe...@timetoknow.com> on 2011/01/16 15:33:10 UTC

Result ordering

Hi,

I'm maintaining some Lucene-based code, and we're trying to get control over result ordering (users aren't happy with the default).
I know how to boost a Field or Document (very useful).
But:


1)      Is there a way to boost "OR" queries, based on the number of matched terms?
So the OR query "lord rings" will first show the document "LORD of the RINGS" (which holds both words), and only later "selected jewels and RINGS" (which only holds one word).
Is that what you call "Term Frequency"? And how do you boost it further?
I did a bit of tinkering and got the impression Lucene would boost it by default, but not enough - it's sometimes overridden by other boosting factors (maybe the boost for short expressions).


2)      Is there a way to boost based on "positions"?
So "LORD of the RINGS" gets precedence over "LORD of the funny golden RINGS", because the search words are positioned closer to each other?


3)      With wildcard searches, is there a way to boost documents that hold an exact match.
So if I search for "ring*", I first see the exact match "story of a RING", and only later "a RINGING failure"

Thanx a bunch.

RE: Result ordering

Posted by Pelit Mamani <Pe...@timetoknow.com>.

Thank you both, Umesh Prasad and Anshum.
You've been a great help.

-----Original Message-----
From: Anshum [mailto:anshumg@gmail.com] 
Sent: Sunday, January 16, 2011 5:26 PM
To: java-user@lucene.apache.org
Subject: Re: Result ordering

Hi Pelit,
Firstly, number of words that match a query in a document is not term frequency. You may get some more idea on the terminologies used in search at http://www.miislita.com/term-vector/term-vector-3.html

Looking at what you're trying to achieve, a few solutions to you would be.
Below, I am assuming your query to be for terms "A B":
1. Look at phrase queries and setting a high slop value so matches like "A B" are weighted more than "A .... B" (separated by a few positions).
2. Also you could have a custom similarity (advanced) http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/SimilarityDelegator.html
and
use the coord value.
3. For wildcard searches, a brute force mechanism could be, you may have an or query  (finally the query is expanded to an OR query anyways) for RING OR
RING* and boost the former part.

Looking at your current query seems like you'd need more understanding on lucene and getting a copy of "Lucene In Action 2nd Ed<http://www.manning.com/hatcher3/>."
would be  a good idea for you and everyone in your position.
Hope that helps.

--
Anshum Gupta
http://ai-cafe.blogspot.com

On Sun, Jan 16, 2011 at 8:03 PM, Pelit Mamani
<Pe...@timetoknow.com>wrote:

> Hi,
>
> I'm maintaining some Lucene-based code, and we're trying to get 
> control over result ordering (users aren't happy with the default).
> I know how to boost a Field or Document (very useful).
> But:
>
>
> 1)      Is there a way to boost "OR" queries, based on the number of
> matched terms?
> So the OR query "lord rings" will first show the document "LORD of the 
> RINGS" (which holds both words), and only later "selected jewels and RINGS"
> (which only holds one word).
> Is that what you call "Term Frequency"? And how do you boost it further?
> I did a bit of tinkering and got the impression Lucene would boost it 
> by default, but not enough - it's sometimes overridden by other 
> boosting factors (maybe the boost for short expressions).
>
>
> 2)      Is there a way to boost based on "positions"?
> So "LORD of the RINGS" gets precedence over "LORD of the funny golden 
> RINGS", because the search words are positioned closer to each other?
>
>
> 3)      With wildcard searches, is there a way to boost documents that hold
> an exact match.
> So if I search for "ring*", I first see the exact match "story of a 
> RING", and only later "a RINGING failure"
>
> Thanx a bunch.
>
>
>
>

************************************************************************************
This footnote confirms that this email message has been scanned by PineApp Mail-SeCure for the presence of malicious code, vandals & computer viruses.
************************************************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Result ordering

Posted by Anshum <an...@gmail.com>.

Hi Pelit,
Firstly, number of words that match a query in a document is not term
frequency. You may get some more idea on the terminologies used in search
at
http://www.miislita.com/term-vector/term-vector-3.html

Looking at what you're trying to achieve, a few solutions to you would be.
Below, I am assuming your query to be for terms "A B":
1. Look at phrase queries and setting a high slop value so matches like "A
B" are weighted more than "A .... B" (separated by a few positions).
2. Also you could have a custom similarity (advanced)
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/SimilarityDelegator.html
and
use the coord value.
3. For wildcard searches, a brute force mechanism could be, you may have an
or query  (finally the query is expanded to an OR query anyways) for RING OR
RING* and boost the former part.

Looking at your current query seems like you'd need more understanding on
lucene and getting a copy of "Lucene In Action 2nd
Ed<http://www.manning.com/hatcher3/>."
would be  a good idea for you and everyone in your position.
Hope that helps.

--
Anshum Gupta
http://ai-cafe.blogspot.com

On Sun, Jan 16, 2011 at 8:03 PM, Pelit Mamani
<Pe...@timetoknow.com>wrote:

> Hi,
>
> I'm maintaining some Lucene-based code, and we're trying to get control
> over result ordering (users aren't happy with the default).
> I know how to boost a Field or Document (very useful).
> But:
>
>
> 1)      Is there a way to boost "OR" queries, based on the number of
> matched terms?
> So the OR query "lord rings" will first show the document "LORD of the
> RINGS" (which holds both words), and only later "selected jewels and RINGS"
> (which only holds one word).
> Is that what you call "Term Frequency"? And how do you boost it further?
> I did a bit of tinkering and got the impression Lucene would boost it by
> default, but not enough - it's sometimes overridden by other boosting
> factors (maybe the boost for short expressions).
>
>
> 2)      Is there a way to boost based on "positions"?
> So "LORD of the RINGS" gets precedence over "LORD of the funny golden
> RINGS", because the search words are positioned closer to each other?
>
>
> 3)      With wildcard searches, is there a way to boost documents that hold
> an exact match.
> So if I search for "ring*", I first see the exact match "story of a RING",
> and only later "a RINGING failure"
>
> Thanx a bunch.
>
>
>
>

Re: Result ordering

Posted by Umesh Prasad <um...@gmail.com>.

Hi Pelit,
 My comments are inline.


On Sun, Jan 16, 2011 at 8:03 PM, Pelit Mamani
<Pe...@timetoknow.com>wrote:

> Hi,
>
> I'm maintaining some Lucene-based code, and we're trying to get control
> over result ordering (users aren't happy with the default).
> I know how to boost a Field or Document (very useful).
> But:
>
>
> 1)      Is there a way to boost "OR" queries, based on the number of
> matched terms?
> So the OR query "lord rings" will first show the document "LORD of the
> RINGS" (which holds both words), and only later "selected jewels and RINGS"
> (which only holds one word).
> Is that what you call "Term Frequency"? And how do you boost it further?
> I did a bit of tinkering and got the impression Lucene would boost it by
> default, but not enough - it's sometimes overridden by other boosting
> factors (maybe the boost for short expressions).
>
>
The score of a document does get affected by the fraction of query terms it
actually contains. See
It is encapsulated in
Similarity.cord()<http://lucene.apache.org/java/2_9_0/api/core/org/apache/lucene/search/Similarity.html#coord%28int,%20int%29>
You can override it as per your needs.


2)      Is there a way to boost based on "positions"?
> So "LORD of the RINGS" gets precedence over "LORD of thet  funny golden
> RINGS", because the search words are positioned closer to each other?
>
>    SpanQueries support positional/proximity matches and allow you to
specify slopes, but checkout  if it assigns a higher score if terms are
closer.  Else if you are creating your queries programatically, then you can
create multiple span queries with different slopes, assign different boost
to them, and join them together as OR clauses.


>
> 3)      With wildcard searches, is there a way to boost documents that hold
> an exact match.
> So if I search for "ring*", I first see the exact match "story of a RING",
> and only later "a RINGING failure"
>
> Thanx a bunch.
>
>
>
>


-- 
---
Thanks & Regards
Umesh Prasad