You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Eugene <ec...@gmail.com> on 2006/03/06 20:06:28 UTC
sumOfSquaredWeights for lengthNorm
Hi,
I would like to override the Similarity class lengthNorm(String
fieldName, int numTerms) so that it behaves similar to queryNorm(float
sumOfSquaredWeights). So the method signature becomes lengthNorm(String
fieldName, float sumOfSquaredWeights) where sumOfSquaredWeights = sum of
the squares of doc term weights.
Looking at the way sumOfSquaredWeights was used in
org.apache.lucene.search.Query weight method, I would like to have a
weight method in org.apache.lucene.document.Field (or may be in
org.apache.lucene.document.Document) which returns the weight based on
the terms in the Field. Can anyone tell me how to start?
Thanks.
--
Eugene
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: sumOfSquaredWeights for lengthNorm
Posted by Chris Hostetter <ho...@fucit.org>.
: > 1) the "boosts" associated with Fields and Documents at indexing time,
: > which are combined with the lengthNorm at index time to determine a single
: > "norm" value for the doc/field pair.
:
: I don;t think this is what I want because the lengthNorm is still using
: the # of terms.
You can override the lengthNorm function to return "1.0" for all fields
regardless of length, and then then the norm will consist soley of hte
boost.
: Yes, I noticed this, but this is not what I want because its using "idf
: of the terms being queried". What I want fieldWeight to be is to use the
: 1/ sqrt(sumOfSquaredWeights), where sumOfSquaredWeights = tf^2 over
: all terms in the field.
Ah, but when you are building your index, how can lucene know what the tf
for all of the terms in the field are? ... you still have more documents
to add that can affect the tf.
if you know what the frequencies are when you add the document, then you
can square that and use it as the field boost and you should have what you
want
: 3) I got another issue with the explanation, which seems to be a bug.
: Below, I;ve given a printout of the explanation. There's something
: strange when I use my own Similarity it prints out all query terms
: despite some them not appearing in the doc (See for "formulation" the
: docFreq = 0 but it appears in the explanation).
It looks like your tf function is returning non-zero values when the input
is 0, which is going to give you real weird behavior -- including saying
that certain docs match a clause even when they don't.
Even if you want a tf to return a constant values, you have to keep in
mind the "non-matching" case and reutrn 0 when the input is 0.
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: sumOfSquaredWeights for lengthNorm
Posted by Eugene <ec...@gmail.com>.
Hi,
My comments in-line.
Chris Hostetter wrote:
> : I would like to override the Similarity class lengthNorm(String
> : fieldName, int numTerms) so that it behaves similar to queryNorm(float
> : sumOfSquaredWeights). So the method signature becomes lengthNorm(String
> : fieldName, float sumOfSquaredWeights) where sumOfSquaredWeights = sum of
> : the squares of doc term weights.
> :
> : Looking at the way sumOfSquaredWeights was used in
> : org.apache.lucene.search.Query weight method, I would like to have a
> : weight method in org.apache.lucene.document.Field (or may be in
> : org.apache.lucene.document.Document) which returns the weight based on
> : the terms in the Field. Can anyone tell me how to start?
>
> can you explain more what you mean by "doc term weights" ?
> It seems like what you are interested in doing is changing the way norm
> value of a doc/field is determined so that it's determined not just by the
> number of terms in the field, but also by the "weight" or some terms --
> i'm not sure if you mean the terms being queried on, or the terms stored
> in the field for the document
Yes, you got the idea, i mean the terms in the field. I think term
weights of the query are already factored in in queryNorm. I want to
normalize based on the field's terms' weights too.
> Two concepts that already exist (and may be useful to you) are:
>
> 1) the "boosts" associated with Fields and Documents at indexing time,
> which are combined with the lengthNorm at index time to determine a single
> "norm" value for the doc/field pair.
I don;t think this is what I want because the lengthNorm is still using
the # of terms.
> 2) the idf of the terms being queried on, which is multiplied by the field
> norm as part of the query time scoring (you can see it in the
> fieldWeight in a score Explanation)
Yes, I noticed this, but this is not what I want because its using "idf
of the terms being queried". What I want fieldWeight to be is to use the
1/ sqrt(sumOfSquaredWeights), where sumOfSquaredWeights = tf^2 over
all terms in the field.
3) I got another issue with the explanation, which seems to be a bug.
Below, I;ve given a printout of the explanation. There's something
strange when I use my own Similarity it prints out all query terms
despite some them not appearing in the doc (See for "formulation" the
docFreq = 0 but it appears in the explanation).
Also the scores don;t tally. I printed out the raw score for doc 21
using the HitCollector and it returns 1.4241531. I printout explanation
the score is 2.731636. Shouldn't this be the same since both aren't
normalized scores?
------ Explanation --------
doc id:21 score = 1.4241531
Explanation = 2.731636 = sum of:
......
0.30496213 = weight(Contents:formulation in 21), product of:
0.40874794 = queryWeight(Contents:formulation), product of:
5.9687076 = idf(docFreq=0)
0.06848182 = queryNorm
0.74608845 = fieldWeight(Contents:formulation in 21), product of:
1.0 = tf(termFreq(Contents:formulation)=0)
5.9687076 = idf(docFreq=0)
0.125 = fieldNorm(field=Contents, doc=21)
......
------ End of Explanation --------
Thanks.
--
Eugene
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: sumOfSquaredWeights for lengthNorm
Posted by Chris Hostetter <ho...@fucit.org>.
: I would like to override the Similarity class lengthNorm(String
: fieldName, int numTerms) so that it behaves similar to queryNorm(float
: sumOfSquaredWeights). So the method signature becomes lengthNorm(String
: fieldName, float sumOfSquaredWeights) where sumOfSquaredWeights = sum of
: the squares of doc term weights.
:
: Looking at the way sumOfSquaredWeights was used in
: org.apache.lucene.search.Query weight method, I would like to have a
: weight method in org.apache.lucene.document.Field (or may be in
: org.apache.lucene.document.Document) which returns the weight based on
: the terms in the Field. Can anyone tell me how to start?
can you explain more what you mean by "doc term weights" ?
It seems like what you are interested in doing is changing the way norm
value of a doc/field is determined so that it's determined not just by the
number of terms in the field, but also by the "weight" or some terms --
i'm not sure if you mean the terms being queried on, or the terms stored
in the field for the document
Two concepts that already exist (and may be useful to you) are:
1) the "boosts" associated with Fields and Documents at indexing time,
which are combined with the lengthNorm at index time to determine a single
"norm" value for the doc/field pair.
2) the idf of the terms being queried on, which is multiplied by the field
norm as part of the query time scoring (you can see it in the
fieldWeight in a score Explanation)
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org