You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Andy Lee <ag...@earthlink.net> on 2005/10/28 01:13:22 UTC

trying to boost a phrase higher than its individual words

I have a situation where I want to search for individual words in a  
phrase as well as the phrase itself.  For example, if the user enters  
["classical music"] (with quotes) I want to find documents that  
contain "classical music" (the phrase) *and* the individual words  
"classical" and "music".

Of course, I could just search for the individual words and the  
phrase would get found as a consequence.  But I want documents  
containing the phrase to appear first in the search results, since  
the phrase is the user's primary interest.

I've constructed the following query, using boost values...

     [+(content:"classical music"^5.0 content:classical^0.1  
content:music^0.1)]

...but the boost values don't seem to affect the order of the search  
results.

Am I misunderstanding the purpose or proper usage of boosts, and if  
so, can someone explain (at least roughly) how to achieve the desired  
result?

--Andy


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: trying to boost a phrase higher than its individual words

Posted by jian chen <ch...@gmail.com>.

Hi,

It seems what you want to achieve could be implemented using the Cover
Density algorithm. I am not sure if any existing query classes in the Lucene
distribution does this already. But in case not, this is what I am think
about:

Make a custom query class, called CoverDensityQuery, which is modeled after
PhraseQuery.

The CoverDensityQuery could accept two arguments as its constructor, the
Terms and the numOfTermsMatched.

For example, to search for "classical music", you will first construct
CoverDensityQuery like:
new CoverDensityQuery(new String[]{"classical", "music"}, 2);

This should return all documents that contain both "classical" and "music".
The ranking will be based on covers, each cover is a span with the two terms
at each end. The shorter the cover, the higher the rank, the more the
covers, the higher the rank.

If the returned documents are not enough, then, do another query like:
new CoverDensityQuery(new String[]{"classical", "music"}, 1);

This should return documents either containing "classical" or "music", but
not both.

The detailed algorithm will be constructed similar to PhraseQuery.

I will write such a query class in the future, just as a proof of concept
for cover density algorithm.

Cheers,

Jian

On 10/27/05, Andy Lee <ag...@earthlink.net> wrote:
>
> I have a situation where I want to search for individual words in a
> phrase as well as the phrase itself. For example, if the user enters
> ["classical music"] (with quotes) I want to find documents that
> contain "classical music" (the phrase) *and* the individual words
> "classical" and "music".
>
> Of course, I could just search for the individual words and the
> phrase would get found as a consequence. But I want documents
> containing the phrase to appear first in the search results, since
> the phrase is the user's primary interest.
>
> I've constructed the following query, using boost values...
>
> [+(content:"classical music"^5.0 content:classical^0.1
> content:music^0.1)]
>
> ...but the boost values don't seem to affect the order of the search
> results.
>
> Am I misunderstanding the purpose or proper usage of boosts, and if
> so, can someone explain (at least roughly) how to achieve the desired
> result?
>
> --Andy
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: trying to boost a phrase higher than its individual words

Posted by Chris Hostetter <ho...@fucit.org>.

: I've constructed the following query, using boost values...
:
:      [+(content:"classical music"^5.0 content:classical^0.1
: content:music^0.1)]
:
: ...but the boost values don't seem to affect the order of the search
: results.

What you're describing should work.

What does the value of IndexSearcher.explain(q,docId).toString() look like
for the docIds of documents you expect to be early in the list compared to
the output for a docId that is early in the list?



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: trying to boost a phrase higher than its individual words

Posted by Doug Cutting <cu...@apache.org>.

Erik Hatcher wrote:
> On 28 Oct 2005, at 22:31, Andy Lee wrote:
> 
>> You know what, I was confusing Nutch and Lucene classes (as I've  done 
>> before), in this case the IndexSearcher classes.

Sorry.  The Nutch names are bad.

> I'm continually amazed at Doug's ability to build these using 
> only emacs - how he keeps it all straight himself is unbelievable.  :)

The trick is to forget all phone numbers, people's names, and what time 
you need to pick up kids from school!

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: trying to boost a phrase higher than its individual words

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On 28 Oct 2005, at 22:31, Andy Lee wrote:
> You know what, I was confusing Nutch and Lucene classes (as I've  
> done before), in this case the IndexSearcher classes.  All I could  
> find was the *Nutch* IndexSearcher's getExplanation() method, which  
> I see sends toHtml() rather than toString() to its internal Lucene  
> IndexSearcher.

Good point.  I've found the duplication of class names between Nutch  
and Lucene extremely confusing myself.  I'm continually amazed at  
Doug's brilliance and ability to build these two beautiful tools  
using only emacs - how he keeps it all straight himself is  
unbelievable.  :)

     Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: trying to boost a phrase higher than its individual words

Posted by Andy Lee <ag...@earthlink.net>.

On Oct 28, 2005, at 8:17 PM, Chris Hostetter wrote:
> One thing to keep in mind is that if you have things you are adding  
> to hte
> query to restrict the results, but you don't want them to  
> contribute to
> the score, then try using a Filter instead.  If you can't find an  
> easy way
> to replace a query by a filter, try using a boost of 0.0001 ( i'd  
> say use
> a boost of 0, but I'm not sure that all query types handle that as
> correctly as they should)

Thanks for the advice.  I hadn't even noticed the Filter classes  
until very recently.  I really need to take the time to work  
methodically through LIA...

> Really? .. the LIA example i found was in 3.3.1, it just printed out
> explanation.toString() ... that should still work just fine even  
> with the
> trunk of SVN.

You know what, I was confusing Nutch and Lucene classes (as I've done  
before), in this case the IndexSearcher classes.  All I could find  
was the *Nutch* IndexSearcher's getExplanation() method, which I see  
sends toHtml() rather than toString() to its internal Lucene  
IndexSearcher.

--Andy

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: trying to boost a phrase higher than its individual words

Posted by Chris Hostetter <ho...@fucit.org>.

: Okay, I looked at the explanations and realized part of the problem
: was that I was applying a sort field to the search results, which I

that would definitely cause the boosts to be un-useful :)

: But I also do need to do some tuning, because I'm adding other stuff
: to the query that is also skewing the ranking.

One thing to keep in mind is that if you have things you are adding to hte
query to restrict the results, but you don't want them to contribute to
the score, then try using a Filter instead.  If you can't find an easy way
to replace a query by a filter, try using a boost of 0.0001 ( i'd say use
a boost of 0, but I'm not sure that all query types handle that as
correctly as they should)

: It took me a while to figure out the differences between the
: searcher.explain() example in LIA and the latest changes to the API.
: It was a little annoying that I couldn't find a way to get plain text

Really? .. the LIA example i found was in 3.3.1, it just printed out
explanation.toString() ... that should still work just fine even with the
trunk of SVN.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: trying to boost a phrase higher than its individual words

Posted by Andy Lee <ag...@earthlink.net>.

On Oct 28, 2005, at 10:38 AM, Erik Hatcher wrote:
> So in this case a matching document must have both terms?  Or could  
> it just have one or the other?  If it must have both, you could try  
> a PhraseQuery with a slop of Integer.MAX_VALUE.  PhraseQuery scores  
> closer matches higher.

Good to know, thanks.  I saw references to slop but didn't know what  
they meant.  I'll see if this is one way I could solve my problem.

> But as Chris suggested - check the IndexSearcher.explain() for some  
> documents you feel should be ranked higher and work from there.   
> You're on the right track, but some tuning appears necessary.

Okay, I looked at the explanations and realized part of the problem  
was that I was applying a sort field to the search results, which I  
had forgotten.  So of course that affected the display order, duh.   
But I also do need to do some tuning, because I'm adding other stuff  
to the query that is also skewing the ranking.

It took me a while to figure out the differences between the  
searcher.explain() example in LIA and the latest changes to the API.   
It was a little annoying that I couldn't find a way to get plain text  
output -- it seems to be only HTML now.  Finally I wrote a  
convenience method that dumps the HTML to a file, which I view in a  
browser.

Thanks, Chris and Erik!

--Andy

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: trying to boost a phrase higher than its individual words

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On 27 Oct 2005, at 19:13, Andy Lee wrote:

> I have a situation where I want to search for individual words in a  
> phrase as well as the phrase itself.  For example, if the user  
> enters ["classical music"] (with quotes) I want to find documents  
> that contain "classical music" (the phrase) *and* the individual  
> words "classical" and "music".

So in this case a matching document must have both terms?  Or could  
it just have one or the other?  If it must have both, you could try a  
PhraseQuery with a slop of Integer.MAX_VALUE.  PhraseQuery scores  
closer matches higher.

> Of course, I could just search for the individual words and the  
> phrase would get found as a consequence.  But I want documents  
> containing the phrase to appear first in the search results, since  
> the phrase is the user's primary interest.
>
> I've constructed the following query, using boost values...
>
>     [+(content:"classical music"^5.0 content:classical^0.1  
> content:music^0.1)]
>
> ...but the boost values don't seem to affect the order of the  
> search results.
>
> Am I misunderstanding the purpose or proper usage of boosts, and if  
> so, can someone explain (at least roughly) how to achieve the  
> desired result?

But as Chris suggested - check the IndexSearcher.explain() for some  
documents you feel should be ranked higher and work from there.   
You're on the right track, but some tuning appears necessary.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org