You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Rishabh Bajpai <r_...@lycos.com> on 2003/02/07 15:21:24 UTC

boosting of terms while retrieving results? |

in one of the previous responses, the score associated with each of the term was interpreted as described below. 
Can someone tell me how do I achieve point (3) mentioned below-There is a boost factor which allows you to boost certain terms at query time (e.g. you value matches in title field more than the body field boost title field queries)? 
Is this referring to the ^ wild-card, that we can use in constructing the query? 
if no, then how do we acheive this?
if yes, then what i am not very clear on is - whether this has to necessarily be user-supplied or can it be what we construct in the query-parser? (does the user have to enter "this is lucene^ powered" or can it be internally (in the query parser)decided that what we want to boost b4 we fetch the results?)

----------------------------------------------------------------
1. The more frequent the term in a collection, the lower its weight (significance).  As very popular words don't distinguish one document from the other much, because they are present in so many docs.
2. The more frequent a word in a single document, the higher the documents 'value' when the query contains that word.  So the score goes up for frequent words in a document, esp. if they are not frequent in other documents in the collection.
3. There is a boost factor which allows you to boost certain terms at query time (e.g. you value matches in title field more than the body field boost title field queries).
4. Normalization factor normalizes things so that longer documents don't have advantage over shorter ones.
5. You can boost fields at index time (you'll have to use the nightly build for that instead of the 1.2 release to get this).
6. Also, a keyword in a shorter document is deemed more significant than in a longer one because it represents a larger percentage of the document.
----------------------------------------------------------
On the official FAQ page of the Lucene site, a formula is listed as 
score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) * coord_q_d
where:
  score_d   : score for document d
  sum_t     : sum for all terms t
  tf_q      : the square root of the frequency of t in the query
  tf_d      : the square root of the frequency of t in d
  idf_t     : log(numDocs/docFreq_t+1) + 1.0
  numDocs   : number of documents in index
  docFreq_t : number of documents containing t
  norm_q    : sqrt(sum_t((tf_q*idf_t)^2))
  norm_d_t  : square root of number of tokens in d in the same field as t
  boost_t    : the user-specified boost for term t
  coord_q_d  : number of terms in both query and document / number of terms in query


-rishabh




_____________________________________________________________
Get 25MB, POP3, Spam Filtering with LYCOS MAIL PLUS for $19.95/year.
http://login.mail.lycos.com/brandPage.shtml?pageId=plus&ref=lmtplus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org