You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Eugene <ec...@gmail.com> on 2006/03/06 20:06:28 UTC

sumOfSquaredWeights for lengthNorm

Hi,

I would like to override the Similarity class lengthNorm(String 
fieldName, int numTerms) so that it behaves similar to queryNorm(float 
sumOfSquaredWeights).  So the method signature becomes lengthNorm(String 
fieldName, float sumOfSquaredWeights) where sumOfSquaredWeights = sum of 
the squares of doc term weights.

Looking at the way sumOfSquaredWeights was used in 
org.apache.lucene.search.Query weight method, I would like to have a 
weight method in org.apache.lucene.document.Field (or may be in 
org.apache.lucene.document.Document) which returns the weight based on 
the terms in the Field. Can anyone tell me how to start?

Thanks.

--
Eugene

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: sumOfSquaredWeights for lengthNorm

Posted by Chris Hostetter <ho...@fucit.org>.
: > 1) the "boosts" associated with Fields and Documents at indexing time,
: > which are combined with the lengthNorm at index time to determine a single
: > "norm"  value for the doc/field pair.
:
: I don;t think this is what I want because the lengthNorm is still using
: the # of terms.

You can override the lengthNorm function to return "1.0" for all fields
regardless of length, and then then the norm will consist soley of hte
boost.

: Yes, I noticed this, but this is not what I want because its using "idf
: of the terms being queried". What I want fieldWeight to be is to use the
: 1/ sqrt(sumOfSquaredWeights),  where  sumOfSquaredWeights = tf^2 over
: all terms in the field.

Ah, but when you are building your index, how can lucene know what the tf
for all of the terms in the field are? ... you still have more documents
to add that can affect the tf.

if you know what the frequencies are when you add the document, then you
can square that and use it as the field boost and you should have what you
want

: 3) I got another issue with the explanation, which seems to be a bug.
: Below, I;ve given a printout of the explanation.  There's something
: strange when I use my own Similarity it prints out all query terms
: despite some them not appearing in the doc (See for "formulation" the
: docFreq = 0  but it appears in the explanation).

It looks like your tf function is returning non-zero values when the input
is 0, which is going to give you real weird behavior -- including saying
that certain docs match a clause even when they don't.

Even if you want a tf to return a constant values, you have to keep in
mind the "non-matching" case and reutrn 0 when the input is 0.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: sumOfSquaredWeights for lengthNorm

Posted by Eugene <ec...@gmail.com>.
Hi,

My comments in-line.

Chris Hostetter wrote:
> : I would like to override the Similarity class lengthNorm(String
> : fieldName, int numTerms) so that it behaves similar to queryNorm(float
> : sumOfSquaredWeights).  So the method signature becomes lengthNorm(String
> : fieldName, float sumOfSquaredWeights) where sumOfSquaredWeights = sum of
> : the squares of doc term weights.
> :
> : Looking at the way sumOfSquaredWeights was used in
> : org.apache.lucene.search.Query weight method, I would like to have a
> : weight method in org.apache.lucene.document.Field (or may be in
> : org.apache.lucene.document.Document) which returns the weight based on
> : the terms in the Field. Can anyone tell me how to start?
> 
> can you explain more what you mean by "doc term weights" ?

> It seems like what you are interested in doing is changing the way norm
> value of a doc/field is determined so that it's determined not just by the
> number of terms in the field, but also by the "weight" or some terms --
> i'm not sure if you mean the terms being queried on, or the terms stored
> in the field for the document

Yes, you got the idea, i mean the terms in the field. I think term 
weights of the query are already factored in in queryNorm. I want to 
normalize based on the field's terms' weights too.

> Two concepts that already exist (and may be useful to you) are:
> 
> 1) the "boosts" associated with Fields and Documents at indexing time,
> which are combined with the lengthNorm at index time to determine a single
> "norm"  value for the doc/field pair.

I don;t think this is what I want because the lengthNorm is still using 
the # of terms.

> 2) the idf of the terms being queried on, which is multiplied by the field
> norm as part of the query time scoring (you can see it in the
> fieldWeight in a score Explanation)

Yes, I noticed this, but this is not what I want because its using "idf 
of the terms being queried". What I want fieldWeight to be is to use the
1/ sqrt(sumOfSquaredWeights),  where  sumOfSquaredWeights = tf^2 over 
all terms in the field.


3) I got another issue with the explanation, which seems to be a bug. 
Below, I;ve given a printout of the explanation.  There's something 
strange when I use my own Similarity it prints out all query terms 
despite some them not appearing in the doc (See for "formulation" the 
docFreq = 0  but it appears in the explanation).

Also the scores don;t tally. I printed out the raw score for doc 21 
using the HitCollector and it returns 1.4241531. I printout explanation 
the score is 2.731636. Shouldn't this be the same since both aren't 
normalized scores?

------  Explanation --------
doc id:21      score = 1.4241531

Explanation = 2.731636 = sum of:
......
   0.30496213 = weight(Contents:formulation in 21), product of:
     0.40874794 = queryWeight(Contents:formulation), product of:
       5.9687076 = idf(docFreq=0)
       0.06848182 = queryNorm
     0.74608845 = fieldWeight(Contents:formulation in 21), product of:
       1.0 = tf(termFreq(Contents:formulation)=0)
       5.9687076 = idf(docFreq=0)
       0.125 = fieldNorm(field=Contents, doc=21)
......
------ End of Explanation --------


Thanks.

--
Eugene

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: sumOfSquaredWeights for lengthNorm

Posted by Chris Hostetter <ho...@fucit.org>.
: I would like to override the Similarity class lengthNorm(String
: fieldName, int numTerms) so that it behaves similar to queryNorm(float
: sumOfSquaredWeights).  So the method signature becomes lengthNorm(String
: fieldName, float sumOfSquaredWeights) where sumOfSquaredWeights = sum of
: the squares of doc term weights.
:
: Looking at the way sumOfSquaredWeights was used in
: org.apache.lucene.search.Query weight method, I would like to have a
: weight method in org.apache.lucene.document.Field (or may be in
: org.apache.lucene.document.Document) which returns the weight based on
: the terms in the Field. Can anyone tell me how to start?

can you explain more what you mean by "doc term weights" ?

It seems like what you are interested in doing is changing the way norm
value of a doc/field is determined so that it's determined not just by the
number of terms in the field, but also by the "weight" or some terms --
i'm not sure if you mean the terms being queried on, or the terms stored
in the field for the document

Two concepts that already exist (and may be useful to you) are:

1) the "boosts" associated with Fields and Documents at indexing time,
which are combined with the lengthNorm at index time to determine a single
"norm"  value for the doc/field pair.

2) the idf of the terms being queried on, which is multiplied by the field
norm as part of the query time scoring (you can see it in the
fieldWeight in a score Explanation)



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org