You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Andy Newby <an...@gmail.com> on 2011/03/13 21:38:41 UTC

Results driving me nuts!

Hi,

Ok, I'm really really trying to get my head around this, but I just can't :/

Here are 2 example records, both using the query "st patricks" to
search on (matches for the keywords are in **stars** like so, to make
a point of what SHOULD be matching);

keywords: animations mini alphabets **st** **patricks** animated 1
clover  animations mini alphabets **st** **patricks**
description: animated 1 clover

"124966":"
209.23984 = (MATCH) product of:
  418.47968 = (MATCH) sum of:
    418.47968 = (MATCH) sum of:
      212.91336 = (MATCH) weight(keywords:st in 5697), product of:
        0.41379675 = queryWeight(keywords:st), product of:
          7.5798326 = idf(docFreq=233, maxDocs=168578)
          0.05459181 = queryNorm
        514.5361 = (MATCH) fieldWeight(keywords:st in 5697), product of:
          1.4142135 = tf(termFreq(keywords:st)=2)
          7.5798326 = idf(docFreq=233, maxDocs=168578)
          48.0 = fieldNorm(field=keywords, doc=5697)
      205.56633 = (MATCH) weight(keywords:patricks in 5697), product of:
        0.4065946 = queryWeight(keywords:patricks), product of:
          7.447905 = idf(docFreq=266, maxDocs=168578)
          0.05459181 = queryNorm
        505.58057 = (MATCH) fieldWeight(keywords:patricks in 5697), product of:
          1.4142135 = tf(termFreq(keywords:patricks)=2)
          7.447905 = idf(docFreq=266, maxDocs=168578)
          48.0 = fieldNorm(field=keywords, doc=5697)
  0.5 = coord(1/2)

The other one:

desc: a black and white mug of beer with a three leaf clover in it
keywords: saint **patricks** day green irish beer   spel132_bw clip
art holidays **st** **patricks** day
handle drink celebrate clip art holidays **st** **patricks** day

5 matches

"145351":"
193.61652 = (MATCH) product of:
  387.23303 = (MATCH) sum of:
    387.23303 = (MATCH) sum of:
      177.4278 = (MATCH) weight(keywords:st in 25380), product of:
        0.41379675 = queryWeight(keywords:st), product of:
          7.5798326 = idf(docFreq=233, maxDocs=168578)
          0.05459181 = queryNorm
        428.78006 = (MATCH) fieldWeight(keywords:st in 25380), product of:
          1.4142135 = tf(termFreq(keywords:st)=2)
          7.5798326 = idf(docFreq=233, maxDocs=168578)
          40.0 = fieldNorm(field=keywords, doc=25380)
      209.80525 = (MATCH) weight(keywords:patricks in 25380), product of:
        0.4065946 = queryWeight(keywords:patricks), product of:
          7.447905 = idf(docFreq=266, maxDocs=168578)
          0.05459181 = queryNorm
        516.006 = (MATCH) fieldWeight(keywords:patricks in 25380), product of:
          1.7320508 = tf(termFreq(keywords:patricks)=3)
          7.447905 = idf(docFreq=266, maxDocs=168578)
          40.0 = fieldNorm(field=keywords, doc=25380)
  0.5 = coord(1/2)


Now the thing thats getting me, is the record which has 5 occurencs of
"st patricks" , is so different in terms of the scores it gives!

209.23984
193.61652

(these should be the other way around)

Can anyone try and explain whats going on with this?

BTW, the queries are matched based on a normal "white space" index,
nothing special.

The actual query being used, is as follows:

(keywords:"st" AND keywords:"patricks") OR (description:"st" AND
description:"patricks")

TIA - I'm hoping someone can save my sanity ;)

Cheers
-- 
Andy Newby
andy@ultranerds.com

RE: Results driving me nuts!

Posted by cb...@job.com.
> -----Original Message-----
> From: Ahmet Arslan [mailto:iorixxx@yahoo.com]
> Sent: Sunday, March 13, 2011 6:25 PM
> To: solr-user@lucene.apache.org; andy.newby@gmail.com
> Subject: Re: Results driving me nuts!
> 
> 
> --- On Sun, 3/13/11, Andy Newby <an...@gmail.com> wrote:
> 
> > From: Andy Newby <an...@gmail.com>
> > Subject: Results driving me nuts!
> > To: solr-user@lucene.apache.org
> > Date: Sunday, March 13, 2011, 10:38 PM
> > Hi,
> >
> > Ok, I'm really really trying to get my head around this,
> > but I just can't :/
> >
> > Here are 2 example records, both using the query "st
> > patricks" to
> > search on (matches for the keywords are in **stars** like
> > so, to make
> > a point of what SHOULD be matching);
> >
> > keywords: animations mini alphabets **st** **patricks**
> > animated 1
> > clover  animations mini alphabets **st** **patricks**
> > description: animated 1 clover
> >
> > "124966":"
> > 209.23984 = (MATCH) product of:
> >   418.47968 = (MATCH) sum of:
> >     418.47968 = (MATCH) sum of:
> >       212.91336 = (MATCH) weight(keywords:st
> > in 5697), product of:
> >         0.41379675 =
> > queryWeight(keywords:st), product of:
> >           7.5798326 =
> > idf(docFreq=233, maxDocs=168578)
> >           0.05459181 = queryNorm
> >         514.5361 = (MATCH)
> > fieldWeight(keywords:st in 5697), product of:
> >           1.4142135 =
> > tf(termFreq(keywords:st)=2)
> >           7.5798326 =
> > idf(docFreq=233, maxDocs=168578)
> >           48.0 =
> > fieldNorm(field=keywords, doc=5697)
> >       205.56633 = (MATCH)
> > weight(keywords:patricks in 5697), product of:
> >         0.4065946 =
> > queryWeight(keywords:patricks), product of:
> >           7.447905 =
> > idf(docFreq=266, maxDocs=168578)
> >           0.05459181 = queryNorm
> >         505.58057 = (MATCH)
> > fieldWeight(keywords:patricks in 5697), product of:
> >           1.4142135 =
> > tf(termFreq(keywords:patricks)=2)
> >           7.447905 =
> > idf(docFreq=266, maxDocs=168578)
> >           48.0 =
> > fieldNorm(field=keywords, doc=5697)
> >   0.5 = coord(1/2)
> >
> > The other one:
> >
> > desc: a black and white mug of beer with a three leaf
> > clover in it
> > keywords: saint **patricks** day green irish
> > beer   spel132_bw clip
> > art holidays **st** **patricks** day
> > handle drink celebrate clip art holidays **st**
> > **patricks** day
> >
> > 5 matches
> >
> > "145351":"
> > 193.61652 = (MATCH) product of:
> >   387.23303 = (MATCH) sum of:
> >     387.23303 = (MATCH) sum of:
> >       177.4278 = (MATCH) weight(keywords:st
> > in 25380), product of:
> >         0.41379675 =
> > queryWeight(keywords:st), product of:
> >           7.5798326 =
> > idf(docFreq=233, maxDocs=168578)
> >           0.05459181 = queryNorm
> >         428.78006 = (MATCH)
> > fieldWeight(keywords:st in 25380), product of:
> >           1.4142135 =
> > tf(termFreq(keywords:st)=2)
> >           7.5798326 =
> > idf(docFreq=233, maxDocs=168578)
> >           40.0 =
> > fieldNorm(field=keywords, doc=25380)
> >       209.80525 = (MATCH)
> > weight(keywords:patricks in 25380), product of:
> >         0.4065946 =
> > queryWeight(keywords:patricks), product of:
> >           7.447905 =
> > idf(docFreq=266, maxDocs=168578)
> >           0.05459181 = queryNorm
> >         516.006 = (MATCH)
> > fieldWeight(keywords:patricks in 25380), product of:
> >           1.7320508 =
> > tf(termFreq(keywords:patricks)=3)
> >           7.447905 =
> > idf(docFreq=266, maxDocs=168578)
> >           40.0 =
> > fieldNorm(field=keywords, doc=25380)
> >   0.5 = coord(1/2)
> >
> >
> > Now the thing thats getting me, is the record which has 5
> > occurencs of
> > "st patricks" , is so different in terms of the scores it
> > gives!
> >
> > 209.23984
> > 193.61652
> >
> > (these should be the other way around)
> >
> > Can anyone try and explain whats going on with this?
> >
> > BTW, the queries are matched based on a normal "white
> > space" index,
> > nothing special.
> >
> > The actual query being used, is as follows:
> >
> > (keywords:"st" AND keywords:"patricks") OR
> > (description:"st" AND
> > description:"patricks")
> >
> > TIA - I'm hoping someone can save my sanity ;)
> 
> Their fieldNorm values are different. Norm consists of index time boost
> and length normalization.
> 
> http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/search/S
> imilarity.html#formula_norm
> 
> I can see that the one with 5 matches is longer than the other. Shorter
> documents are favored in solr/lucene with length normalization factor.
> 
> 
> 

Also the term frequency for patricks is different in each document

For 1st doc termFreq(keywords:st)=2 and for 2nd doc
termFreq(keywords:patricks)=3





Re: Results driving me nuts!

Posted by Markus Jelsma <ma...@openindex.io>.

On Monday 14 March 2011 17:27:05 Jonathan Rochkind wrote:
> Aha.  Yeah, I've read the documentation several times,but still find
> myself confused.
> 
> But do I understand this right now:
> 
> If I do omitNorms="true", but still leave "term freq and positions" in
> default case (ie, NOT omitTermFreqAndPositions="true") ... then a
> document with more occurences of a search term will still score higher,
> but it'll just be a factor of the raw number of times it occurs, and not
> the percentage of the total field it covers -- that is, N occurences in
> a short field value will be scored exactly the same as N occurences in a
> different document with a longer field value.

Yes, if you omitNorms but still use TF (which you do) then (without 
considering other score influencing parameters) documents with the same number 
of occurences will have the same score.

In debugQuery you'll always see tf=1 if you use omitTermFreqAndPositions. If 
you use omitNorms you'll always see a norm of 1.

> 
> Phew, this stuff is hard for me to talk about clearly. If that made any
> sense, do I have it right?  If so, that's exactly what I want to try
> out, excellent.
> 
> On 3/14/2011 10:48 AM, Markus Jelsma wrote:
> > You can use omitNorms="true" for any given field. Length normalization
> > will be disabled and index-time boosting will not be available any more.
> > 
> > TermFrequencies can also be disabled by setting
> > omitTermFreqAndPositions="true" for any given field. Omitting TF can be
> > very useful if you need an easy way to prevent spam documents from
> > hijacking the score (if you sort on score of course).
> > 
> > http://wiki.apache.org/solr/SchemaXml
> > 
> > On Monday 14 March 2011 15:39:47 Jonathan Rochkind wrote:
> >> On 3/13/2011 6:24 PM, Ahmet Arslan wrote:
> >>> http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/search/S
> >>> im ilarity.html#formula_norm
> >>> 
> >>> I can see that the one with 5 matches is longer than the other. Shorter
> >>> documents are favored in solr/lucene with length normalization factor.
> >> 
> >> Is there any easy way to turn this off for a given field?  That is, I
> >> think, to still have the iDF be used, but not the TF. Maybe that's it.
> >> but anyway, to turn off document length normalization, but only for a
> >> certain field?
> >> 
> >> I'm not sure if that's what useNorms does, or if useNorms does _more_
> >> than this, including some things I wouldn't want, or if there is some
> >> other parameter that would do this instead?
> >> 
> >> Thanks for any advice,
> >> 
> >> Jonathan

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Results driving me nuts!

Posted by Jonathan Rochkind <ro...@jhu.edu>.
Aha.  Yeah, I've read the documentation several times,but still find 
myself confused.

But do I understand this right now:

If I do omitNorms="true", but still leave "term freq and positions" in 
default case (ie, NOT omitTermFreqAndPositions="true") ... then a 
document with more occurences of a search term will still score higher, 
but it'll just be a factor of the raw number of times it occurs, and not 
the percentage of the total field it covers -- that is, N occurences in 
a short field value will be scored exactly the same as N occurences in a 
different document with a longer field value.

Phew, this stuff is hard for me to talk about clearly. If that made any 
sense, do I have it right?  If so, that's exactly what I want to try 
out, excellent.

On 3/14/2011 10:48 AM, Markus Jelsma wrote:
> You can use omitNorms="true" for any given field. Length normalization will be
> disabled and index-time boosting will not be available any more.
>
> TermFrequencies can also be disabled by setting
> omitTermFreqAndPositions="true" for any given field. Omitting TF can be very
> useful if you need an easy way to prevent spam documents from hijacking the
> score (if you sort on score of course).
>
> http://wiki.apache.org/solr/SchemaXml
>
> On Monday 14 March 2011 15:39:47 Jonathan Rochkind wrote:
>> On 3/13/2011 6:24 PM, Ahmet Arslan wrote:
>>> http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/search/Sim
>>> ilarity.html#formula_norm
>>>
>>> I can see that the one with 5 matches is longer than the other. Shorter
>>> documents are favored in solr/lucene with length normalization factor.
>> Is there any easy way to turn this off for a given field?  That is, I
>> think, to still have the iDF be used, but not the TF. Maybe that's it.
>> but anyway, to turn off document length normalization, but only for a
>> certain field?
>>
>> I'm not sure if that's what useNorms does, or if useNorms does _more_
>> than this, including some things I wouldn't want, or if there is some
>> other parameter that would do this instead?
>>
>> Thanks for any advice,
>>
>> Jonathan

Re: Results driving me nuts!

Posted by Markus Jelsma <ma...@openindex.io>.
You can use omitNorms="true" for any given field. Length normalization will be 
disabled and index-time boosting will not be available any more.

TermFrequencies can also be disabled by setting 
omitTermFreqAndPositions="true" for any given field. Omitting TF can be very 
useful if you need an easy way to prevent spam documents from hijacking the 
score (if you sort on score of course).

http://wiki.apache.org/solr/SchemaXml

On Monday 14 March 2011 15:39:47 Jonathan Rochkind wrote:
> On 3/13/2011 6:24 PM, Ahmet Arslan wrote:
> > http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/search/Sim
> > ilarity.html#formula_norm
> > 
> > I can see that the one with 5 matches is longer than the other. Shorter
> > documents are favored in solr/lucene with length normalization factor.
> 
> Is there any easy way to turn this off for a given field?  That is, I
> think, to still have the iDF be used, but not the TF. Maybe that's it.
> but anyway, to turn off document length normalization, but only for a
> certain field?
> 
> I'm not sure if that's what useNorms does, or if useNorms does _more_
> than this, including some things I wouldn't want, or if there is some
> other parameter that would do this instead?
> 
> Thanks for any advice,
> 
> Jonathan

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Results driving me nuts!

Posted by Jonathan Rochkind <ro...@jhu.edu>.
On 3/13/2011 6:24 PM, Ahmet Arslan wrote:
>
> http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/search/Similarity.html#formula_norm
>
> I can see that the one with 5 matches is longer than the other. Shorter documents are favored in solr/lucene with length normalization factor.

Is there any easy way to turn this off for a given field?  That is, I 
think, to still have the iDF be used, but not the TF. Maybe that's it.  
but anyway, to turn off document length normalization, but only for a 
certain field?

I'm not sure if that's what useNorms does, or if useNorms does _more_ 
than this, including some things I wouldn't want, or if there is some 
other parameter that would do this instead?

Thanks for any advice,

Jonathan

Re: Results driving me nuts!

Posted by Ahmet Arslan <io...@yahoo.com>.
--- On Sun, 3/13/11, Andy Newby <an...@gmail.com> wrote:

> From: Andy Newby <an...@gmail.com>
> Subject: Results driving me nuts!
> To: solr-user@lucene.apache.org
> Date: Sunday, March 13, 2011, 10:38 PM
> Hi,
> 
> Ok, I'm really really trying to get my head around this,
> but I just can't :/
> 
> Here are 2 example records, both using the query "st
> patricks" to
> search on (matches for the keywords are in **stars** like
> so, to make
> a point of what SHOULD be matching);
> 
> keywords: animations mini alphabets **st** **patricks**
> animated 1
> clover  animations mini alphabets **st** **patricks**
> description: animated 1 clover
> 
> "124966":"
> 209.23984 = (MATCH) product of:
>   418.47968 = (MATCH) sum of:
>     418.47968 = (MATCH) sum of:
>       212.91336 = (MATCH) weight(keywords:st
> in 5697), product of:
>         0.41379675 =
> queryWeight(keywords:st), product of:
>           7.5798326 =
> idf(docFreq=233, maxDocs=168578)
>           0.05459181 = queryNorm
>         514.5361 = (MATCH)
> fieldWeight(keywords:st in 5697), product of:
>           1.4142135 =
> tf(termFreq(keywords:st)=2)
>           7.5798326 =
> idf(docFreq=233, maxDocs=168578)
>           48.0 =
> fieldNorm(field=keywords, doc=5697)
>       205.56633 = (MATCH)
> weight(keywords:patricks in 5697), product of:
>         0.4065946 =
> queryWeight(keywords:patricks), product of:
>           7.447905 =
> idf(docFreq=266, maxDocs=168578)
>           0.05459181 = queryNorm
>         505.58057 = (MATCH)
> fieldWeight(keywords:patricks in 5697), product of:
>           1.4142135 =
> tf(termFreq(keywords:patricks)=2)
>           7.447905 =
> idf(docFreq=266, maxDocs=168578)
>           48.0 =
> fieldNorm(field=keywords, doc=5697)
>   0.5 = coord(1/2)
> 
> The other one:
> 
> desc: a black and white mug of beer with a three leaf
> clover in it
> keywords: saint **patricks** day green irish
> beer   spel132_bw clip
> art holidays **st** **patricks** day
> handle drink celebrate clip art holidays **st**
> **patricks** day
> 
> 5 matches
> 
> "145351":"
> 193.61652 = (MATCH) product of:
>   387.23303 = (MATCH) sum of:
>     387.23303 = (MATCH) sum of:
>       177.4278 = (MATCH) weight(keywords:st
> in 25380), product of:
>         0.41379675 =
> queryWeight(keywords:st), product of:
>           7.5798326 =
> idf(docFreq=233, maxDocs=168578)
>           0.05459181 = queryNorm
>         428.78006 = (MATCH)
> fieldWeight(keywords:st in 25380), product of:
>           1.4142135 =
> tf(termFreq(keywords:st)=2)
>           7.5798326 =
> idf(docFreq=233, maxDocs=168578)
>           40.0 =
> fieldNorm(field=keywords, doc=25380)
>       209.80525 = (MATCH)
> weight(keywords:patricks in 25380), product of:
>         0.4065946 =
> queryWeight(keywords:patricks), product of:
>           7.447905 =
> idf(docFreq=266, maxDocs=168578)
>           0.05459181 = queryNorm
>         516.006 = (MATCH)
> fieldWeight(keywords:patricks in 25380), product of:
>           1.7320508 =
> tf(termFreq(keywords:patricks)=3)
>           7.447905 =
> idf(docFreq=266, maxDocs=168578)
>           40.0 =
> fieldNorm(field=keywords, doc=25380)
>   0.5 = coord(1/2)
> 
> 
> Now the thing thats getting me, is the record which has 5
> occurencs of
> "st patricks" , is so different in terms of the scores it
> gives!
> 
> 209.23984
> 193.61652
> 
> (these should be the other way around)
> 
> Can anyone try and explain whats going on with this?
> 
> BTW, the queries are matched based on a normal "white
> space" index,
> nothing special.
> 
> The actual query being used, is as follows:
> 
> (keywords:"st" AND keywords:"patricks") OR
> (description:"st" AND
> description:"patricks")
> 
> TIA - I'm hoping someone can save my sanity ;)

Their fieldNorm values are different. Norm consists of index time boost and length normalization. 

http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/search/Similarity.html#formula_norm

I can see that the one with 5 matches is longer than the other. Shorter documents are favored in solr/lucene with length normalization factor.