You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Brian Goetz <br...@quiotix.com> on 2002/09/21 02:55:12 UTC

Keyword boosting

I've got a searching problem which I know lots of other people have run 
across too.  We've got documents which have keywords (which we extract and 
put into a 'keywords' field) and also have body text (which we put in a 
'body' field.)

Lets say we search for "text retrieval".  We want to find documents that 
have "text retrieval" in the body OR in the keywords, but we want to weight 
hits on the keywords more heavily.  I can't boost the tokens in the index 
base, so I have to do that through the query.

If I convert a query for phrase Q into this:
   body:Q OR keywords:Q^n
does that do what I want?

How should I select the boost factor N?  Are there negative consequences to 
this strategy?  Am I better off doing two queries and merging the results 
myself?

--
Brian Goetz
Quiotix Corporation
brian@quiotix.com           Tel: 650-843-1300            Fax: 650-324-8032

http://www.quiotix.com


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Keyword boosting

Posted by Doug Cutting <cu...@lucene.com>.
Brian Goetz wrote:
> Lets say we search for "text retrieval".  We want to find documents that 
> have "text retrieval" in the body OR in the keywords, but we want to 
> weight hits on the keywords more heavily.  I can't boost the tokens in 
> the index base, so I have to do that through the query.

Tokens in a keyword field will naturally tend to impact a hit more than 
tokens in the body since the keyword field tends to be shorter, and 
Lucene normalizes for the length of the field.

If that's not enough, in the latest CVS version you can boost each field 
of a document separately.

I've been thinking through a re-design of the way Lucene does scoring, 
both in order to provide an API so that folks can change the scoring, 
and to provide more powerful scoring mechanisms.  Stay tuned.

Doug


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>