You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by John Nielsen <jn...@mcb.dk> on 2014/04/08 11:03:00 UTC

Strange relevance scoring

Hi,

We are seeing a strange phenomenon with our Solr setup which I have been
unable to answer.

My Google-fu is clearly not up to the task, so I am trying here.

It appears that if i do a freetext search for a single word, say "modellering"
on a text field, the scoring is massively boosted if the first word of the
text field is a hit.

For instance if there is only one occurrence of the word "modellering" in
the text field and that occurrence is the first word of the text, then that
document gets a higher relevancy than if the word "modelling" occurs 5
times in the text and the first word of the text is any other word.

Is this normal behavior? Is special attention paid to the first word in a
text field? I would think that the latter case would get the highest score.


-- 
Med venlig hilsen / Best regards

*John Nielsen*
Programmer



*MCB A/S*
Enghaven 15
DK-7500 Holstebro

Kundeservice: +45 9610 2824
post@mcb.dk
www.mcb.dk

Re: Strange relevance scoring

Posted by Aman Tandon <am...@gmail.com>.

yes david you must use the "omitNorms=true" for great performance

Thanks
Aman Tandon


On Tue, Apr 8, 2014 at 5:36 PM, Ahmet Arslan <io...@yahoo.com> wrote:

> Hi David,
>
> omitNorms="true" will cause additional performance gains too.
> https://wiki.apache.org/solr/SolrPerformanceFactors#indexed_fields
>
> To globally disable length norm, one can create a custom similarity and
> register it as a default similarity though.
>
>
>
> On Tuesday, April 8, 2014 2:59 PM, David Santamauro <
> david.santamauro@gmail.com> wrote:
>
> Is there any general setting that removes this "punishment" or must
> omitNorms=false be part of every field definition?
>
>
>
> On 4/8/2014 7:04 AM, Ahmet Arslan wrote:
> > Hi,
> >
> > length normal is computed for every document at index time. I think it
> is 1/sqrt(number of terms). Please see section 6. norm(t,d) at
> >
> >
> https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
> >
> >
> > If you don't care about length normalisation, you can set omitNorms=true
> in field declarations.
> http://wiki.apache.org/solr/SchemaXml#Common_field_options
> >
> >
> >
> > On Tuesday, April 8, 2014 1:57 PM, John Nielsen <jn...@mcb.dk> wrote:
> > Hi,
> >
> > I couldn't find any occurrence of SpanFirstQuery in either the schema.xml
> > or solrconfig.xml files.
> >
> > This is the query i used with debug=results.
> > http://pastebin.com/bWzUkjKz
> >
> > And here is the answer.
> > http://pastebin.com/nCXFcuky
> >
> > I am not sure what I am supposed to be looking for.
> >
> >
> >
> > On Tue, Apr 8, 2014 at 11:34 AM, Markus Jelsma
> > <ma...@openindex.io>wrote:
> >
> >> Hi - the thing you describe is possible when your set up uses
> >> SpanFirstQuery. But to be sure what's going on you should post the debug
> >> output.
> >>
> >> -----Original message-----
> >>> From:John Nielsen <jn...@mcb.dk>
> >>> Sent: Tuesday 8th April 2014 11:03
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Strange relevance scoring
> >>>
> >>> Hi,
> >>>
> >>> We are seeing a strange phenomenon with our Solr setup which I have
> been
> >>> unable to answer.
> >>>
> >>> My Google-fu is clearly not up to the task, so I am trying here.
> >>>
> >>> It appears that if i do a freetext search for a single word, say
> >> "modellering"
> >>> on a text field, the scoring is massively boosted if the first word of
> >> the
> >>> text field is a hit.
> >>>
> >>> For instance if there is only one occurrence of the word "modellering"
> in
> >>> the text field and that occurrence is the first word of the text, then
> >> that
> >>> document gets a higher relevancy than if the word "modelling" occurs 5
> >>> times in the text and the first word of the text is any other word.
> >>>
> >>> Is this normal behavior? Is special attention paid to the first word
> in a
> >>> text field? I would think that the latter case would get the highest
> >> score.
> >>>
> >>>
> >>> --
> >>> Med venlig hilsen / Best regards
> >>>
> >>> *John Nielsen*
> >>> Programmer
> >>>
> >>>
> >>>
> >>> *MCB A/S*
> >>> Enghaven 15
> >>> DK-7500 Holstebro
> >>>
> >>> Kundeservice: +45 9610 2824
> >>> post@mcb.dk
> >>> www.mcb.dk
> >
> >>>
> >>
> >
> >
> >
>
>

Re: Strange relevance scoring

Posted by Ahmet Arslan <io...@yahoo.com>.

Hi David,

omitNorms="true" will cause additional performance gains too. https://wiki.apache.org/solr/SolrPerformanceFactors#indexed_fields

To globally disable length norm, one can create a custom similarity and register it as a default similarity though. 



On Tuesday, April 8, 2014 2:59 PM, David Santamauro <da...@gmail.com> wrote:

Is there any general setting that removes this "punishment" or must 
omitNorms=false be part of every field definition?



On 4/8/2014 7:04 AM, Ahmet Arslan wrote:
> Hi,
>
> length normal is computed for every document at index time. I think it is 1/sqrt(number of terms). Please see section 6. norm(t,d) at
>
> https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
>
>
> If you don't care about length normalisation, you can set omitNorms=true in field declarations. http://wiki.apache.org/solr/SchemaXml#Common_field_options
>
>
>
> On Tuesday, April 8, 2014 1:57 PM, John Nielsen <jn...@mcb.dk> wrote:
> Hi,
>
> I couldn't find any occurrence of SpanFirstQuery in either the schema.xml
> or solrconfig.xml files.
>
> This is the query i used with debug=results.
> http://pastebin.com/bWzUkjKz
>
> And here is the answer.
> http://pastebin.com/nCXFcuky
>
> I am not sure what I am supposed to be looking for.
>
>
>
> On Tue, Apr 8, 2014 at 11:34 AM, Markus Jelsma
> <ma...@openindex.io>wrote:
>
>> Hi - the thing you describe is possible when your set up uses
>> SpanFirstQuery. But to be sure what's going on you should post the debug
>> output.
>>
>> -----Original message-----
>>> From:John Nielsen <jn...@mcb.dk>
>>> Sent: Tuesday 8th April 2014 11:03
>>> To: solr-user@lucene.apache.org
>>> Subject: Strange relevance scoring
>>>
>>> Hi,
>>>
>>> We are seeing a strange phenomenon with our Solr setup which I have been
>>> unable to answer.
>>>
>>> My Google-fu is clearly not up to the task, so I am trying here.
>>>
>>> It appears that if i do a freetext search for a single word, say
>> "modellering"
>>> on a text field, the scoring is massively boosted if the first word of
>> the
>>> text field is a hit.
>>>
>>> For instance if there is only one occurrence of the word "modellering" in
>>> the text field and that occurrence is the first word of the text, then
>> that
>>> document gets a higher relevancy than if the word "modelling" occurs 5
>>> times in the text and the first word of the text is any other word.
>>>
>>> Is this normal behavior? Is special attention paid to the first word in a
>>> text field? I would think that the latter case would get the highest
>> score.
>>>
>>>
>>> --
>>> Med venlig hilsen / Best regards
>>>
>>> *John Nielsen*
>>> Programmer
>>>
>>>
>>>
>>> *MCB A/S*
>>> Enghaven 15
>>> DK-7500 Holstebro
>>>
>>> Kundeservice: +45 9610 2824
>>> post@mcb.dk
>>> www.mcb.dk
>
>>>
>>
>
>
>

Re: Strange relevance scoring

Posted by David Santamauro <da...@gmail.com>.

Is there any general setting that removes this "punishment" or must 
omitNorms=false be part of every field definition?


On 4/8/2014 7:04 AM, Ahmet Arslan wrote:
> Hi,
>
> length normal is computed for every document at index time. I think it is 1/sqrt(number of terms). Please see section 6. norm(t,d) at
>
> https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
>
>
> If you don't care about length normalisation, you can set omitNorms=true in field declarations. http://wiki.apache.org/solr/SchemaXml#Common_field_options
>
>
>
> On Tuesday, April 8, 2014 1:57 PM, John Nielsen <jn...@mcb.dk> wrote:
> Hi,
>
> I couldn't find any occurrence of SpanFirstQuery in either the schema.xml
> or solrconfig.xml files.
>
> This is the query i used with debug=results.
> http://pastebin.com/bWzUkjKz
>
> And here is the answer.
> http://pastebin.com/nCXFcuky
>
> I am not sure what I am supposed to be looking for.
>
>
>
> On Tue, Apr 8, 2014 at 11:34 AM, Markus Jelsma
> <ma...@openindex.io>wrote:
>
>> Hi - the thing you describe is possible when your set up uses
>> SpanFirstQuery. But to be sure what's going on you should post the debug
>> output.
>>
>> -----Original message-----
>>> From:John Nielsen <jn...@mcb.dk>
>>> Sent: Tuesday 8th April 2014 11:03
>>> To: solr-user@lucene.apache.org
>>> Subject: Strange relevance scoring
>>>
>>> Hi,
>>>
>>> We are seeing a strange phenomenon with our Solr setup which I have been
>>> unable to answer.
>>>
>>> My Google-fu is clearly not up to the task, so I am trying here.
>>>
>>> It appears that if i do a freetext search for a single word, say
>> "modellering"
>>> on a text field, the scoring is massively boosted if the first word of
>> the
>>> text field is a hit.
>>>
>>> For instance if there is only one occurrence of the word "modellering" in
>>> the text field and that occurrence is the first word of the text, then
>> that
>>> document gets a higher relevancy than if the word "modelling" occurs 5
>>> times in the text and the first word of the text is any other word.
>>>
>>> Is this normal behavior? Is special attention paid to the first word in a
>>> text field? I would think that the latter case would get the highest
>> score.
>>>
>>>
>>> --
>>> Med venlig hilsen / Best regards
>>>
>>> *John Nielsen*
>>> Programmer
>>>
>>>
>>>
>>> *MCB A/S*
>>> Enghaven 15
>>> DK-7500 Holstebro
>>>
>>> Kundeservice: +45 9610 2824
>>> post@mcb.dk
>>> www.mcb.dk
>
>>>
>>
>
>
>

Re: Strange relevance scoring

Posted by Ahmet Arslan <io...@yahoo.com>.

Hi,

length normal is computed for every document at index time. I think it is 1/sqrt(number of terms). Please see section 6. norm(t,d) at

https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html


If you don't care about length normalisation, you can set omitNorms=true in field declarations. http://wiki.apache.org/solr/SchemaXml#Common_field_options



On Tuesday, April 8, 2014 1:57 PM, John Nielsen <jn...@mcb.dk> wrote:
Hi,

I couldn't find any occurrence of SpanFirstQuery in either the schema.xml
or solrconfig.xml files.

This is the query i used with debug=results.
http://pastebin.com/bWzUkjKz

And here is the answer.
http://pastebin.com/nCXFcuky

I am not sure what I am supposed to be looking for.



On Tue, Apr 8, 2014 at 11:34 AM, Markus Jelsma
<ma...@openindex.io>wrote:

> Hi - the thing you describe is possible when your set up uses
> SpanFirstQuery. But to be sure what's going on you should post the debug
> output.
>
> -----Original message-----
> > From:John Nielsen <jn...@mcb.dk>
> > Sent: Tuesday 8th April 2014 11:03
> > To: solr-user@lucene.apache.org
> > Subject: Strange relevance scoring
> >
> > Hi,
> >
> > We are seeing a strange phenomenon with our Solr setup which I have been
> > unable to answer.
> >
> > My Google-fu is clearly not up to the task, so I am trying here.
> >
> > It appears that if i do a freetext search for a single word, say
> "modellering"
> > on a text field, the scoring is massively boosted if the first word of
> the
> > text field is a hit.
> >
> > For instance if there is only one occurrence of the word "modellering" in
> > the text field and that occurrence is the first word of the text, then
> that
> > document gets a higher relevancy than if the word "modelling" occurs 5
> > times in the text and the first word of the text is any other word.
> >
> > Is this normal behavior? Is special attention paid to the first word in a
> > text field? I would think that the latter case would get the highest
> score.
> >
> >
> > --
> > Med venlig hilsen / Best regards
> >
> > *John Nielsen*
> > Programmer
> >
> >
> >
> > *MCB A/S*
> > Enghaven 15
> > DK-7500 Holstebro
> >
> > Kundeservice: +45 9610 2824
> > post@mcb.dk
> > www.mcb.dk

> >
>



-- 
Med venlig hilsen / Best regards

*John Nielsen*
Programmer



*MCB A/S*
Enghaven 15
DK-7500 Holstebro

Kundeservice: +45 9610 2824
post@mcb.dk
www.mcb.dk

Re: Strange relevance scoring

Posted by John Nielsen <jn...@mcb.dk>.

Hi,

I couldn't find any occurrence of SpanFirstQuery in either the schema.xml
or solrconfig.xml files.

This is the query i used with debug=results.
http://pastebin.com/bWzUkjKz

 And here is the answer.
http://pastebin.com/nCXFcuky

I am not sure what I am supposed to be looking for.



On Tue, Apr 8, 2014 at 11:34 AM, Markus Jelsma
<ma...@openindex.io>wrote:

> Hi - the thing you describe is possible when your set up uses
> SpanFirstQuery. But to be sure what's going on you should post the debug
> output.
>
> -----Original message-----
> > From:John Nielsen <jn...@mcb.dk>
> > Sent: Tuesday 8th April 2014 11:03
> > To: solr-user@lucene.apache.org
> > Subject: Strange relevance scoring
> >
> > Hi,
> >
> > We are seeing a strange phenomenon with our Solr setup which I have been
> > unable to answer.
> >
> > My Google-fu is clearly not up to the task, so I am trying here.
> >
> > It appears that if i do a freetext search for a single word, say
> "modellering"
> > on a text field, the scoring is massively boosted if the first word of
> the
> > text field is a hit.
> >
> > For instance if there is only one occurrence of the word "modellering" in
> > the text field and that occurrence is the first word of the text, then
> that
> > document gets a higher relevancy than if the word "modelling" occurs 5
> > times in the text and the first word of the text is any other word.
> >
> > Is this normal behavior? Is special attention paid to the first word in a
> > text field? I would think that the latter case would get the highest
> score.
> >
> >
> > --
> > Med venlig hilsen / Best regards
> >
> > *John Nielsen*
> > Programmer
> >
> >
> >
> > *MCB A/S*
> > Enghaven 15
> > DK-7500 Holstebro
> >
> > Kundeservice: +45 9610 2824
> > post@mcb.dk
> > www.mcb.dk
> >
>



-- 
Med venlig hilsen / Best regards

*John Nielsen*
Programmer



*MCB A/S*
Enghaven 15
DK-7500 Holstebro

Kundeservice: +45 9610 2824
post@mcb.dk
www.mcb.dk

RE: Strange relevance scoring

Posted by Markus Jelsma <ma...@openindex.io>.

Hi - the thing you describe is possible when your set up uses SpanFirstQuery. But to be sure what's going on you should post the debug output. 
 
-----Original message-----
> From:John Nielsen <jn...@mcb.dk>
> Sent: Tuesday 8th April 2014 11:03
> To: solr-user@lucene.apache.org
> Subject: Strange relevance scoring
> 
> Hi,
> 
> We are seeing a strange phenomenon with our Solr setup which I have been
> unable to answer.
> 
> My Google-fu is clearly not up to the task, so I am trying here.
> 
> It appears that if i do a freetext search for a single word, say "modellering"
> on a text field, the scoring is massively boosted if the first word of the
> text field is a hit.
> 
> For instance if there is only one occurrence of the word "modellering" in
> the text field and that occurrence is the first word of the text, then that
> document gets a higher relevancy than if the word "modelling" occurs 5
> times in the text and the first word of the text is any other word.
> 
> Is this normal behavior? Is special attention paid to the first word in a
> text field? I would think that the latter case would get the highest score.
> 
> 
> -- 
> Med venlig hilsen / Best regards
> 
> *John Nielsen*
> Programmer
> 
> 
> 
> *MCB A/S*
> Enghaven 15
> DK-7500 Holstebro
> 
> Kundeservice: +45 9610 2824
> post@mcb.dk
> www.mcb.dk
>

Re: Strange relevance scoring

Posted by John Nielsen <jn...@mcb.dk>.

Interesting.

Most of the text fields are single word fields or close to it, but on some
of the documents, long text appears.

How long does a text need to be before hitting length normalization?


On Tue, Apr 8, 2014 at 11:36 AM, Ahmet Arslan <io...@yahoo.com> wrote:

> Hi Nielsen,
>
> There is no special attention paid to first word. You are probably hitting
> length normalisation.
> Lucene/Solr punishes long documents, favours short documents.
> (5 times appearing one) longer?
>
>
>
> On Tuesday, April 8, 2014 12:03 PM, John Nielsen <jn...@mcb.dk> wrote:
> Hi,
>
> We are seeing a strange phenomenon with our Solr setup which I have been
> unable to answer.
>
> My Google-fu is clearly not up to the task, so I am trying here.
>
> It appears that if i do a freetext search for a single word, say
> "modellering"
> on a text field, the scoring is massively boosted if the first word of the
> text field is a hit.
>
> For instance if there is only one occurrence of the word "modellering" in
> the text field and that occurrence is the first word of the text, then that
> document gets a higher relevancy than if the word "modelling" occurs 5
> times in the text and the first word of the text is any other word.
>
> Is this normal behavior? Is special attention paid to the first word in a
> text field? I would think that the latter case would get the highest score.
>
>
> --
> Med venlig hilsen / Best regards
>
> *John Nielsen*
> Programmer
>
>
>
> *MCB A/S*
> Enghaven 15
> DK-7500 Holstebro
>
> Kundeservice: +45 9610 2824
> post@mcb.dk
> www.mcb.dk
>
>


-- 
Med venlig hilsen / Best regards

*John Nielsen*
Programmer



*MCB A/S*
Enghaven 15
DK-7500 Holstebro

Kundeservice: +45 9610 2824
post@mcb.dk
www.mcb.dk

Re: Strange relevance scoring

Posted by Ahmet Arslan <io...@yahoo.com>.

Hi Nielsen,

There is no special attention paid to first word. You are probably hitting length normalisation. 
Lucene/Solr punishes long documents, favours short documents. 
(5 times appearing one) longer?



On Tuesday, April 8, 2014 12:03 PM, John Nielsen <jn...@mcb.dk> wrote:
Hi,

We are seeing a strange phenomenon with our Solr setup which I have been
unable to answer.

My Google-fu is clearly not up to the task, so I am trying here.

It appears that if i do a freetext search for a single word, say "modellering"
on a text field, the scoring is massively boosted if the first word of the
text field is a hit.

For instance if there is only one occurrence of the word "modellering" in
the text field and that occurrence is the first word of the text, then that
document gets a higher relevancy than if the word "modelling" occurs 5
times in the text and the first word of the text is any other word.

Is this normal behavior? Is special attention paid to the first word in a
text field? I would think that the latter case would get the highest score.


-- 
Med venlig hilsen / Best regards

*John Nielsen*
Programmer



*MCB A/S*
Enghaven 15
DK-7500 Holstebro

Kundeservice: +45 9610 2824
post@mcb.dk
www.mcb.dk