You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by John Nielsen <jn...@mcb.dk> on 2014/04/08 11:03:00 UTC
Strange relevance scoring
Hi,
We are seeing a strange phenomenon with our Solr setup which I have been
unable to answer.
My Google-fu is clearly not up to the task, so I am trying here.
It appears that if i do a freetext search for a single word, say "modellering"
on a text field, the scoring is massively boosted if the first word of the
text field is a hit.
For instance if there is only one occurrence of the word "modellering" in
the text field and that occurrence is the first word of the text, then that
document gets a higher relevancy than if the word "modelling" occurs 5
times in the text and the first word of the text is any other word.
Is this normal behavior? Is special attention paid to the first word in a
text field? I would think that the latter case would get the highest score.
--
Med venlig hilsen / Best regards
*John Nielsen*
Programmer
*MCB A/S*
Enghaven 15
DK-7500 Holstebro
Kundeservice: +45 9610 2824
post@mcb.dk
www.mcb.dk
Re: Strange relevance scoring
Posted by Aman Tandon <am...@gmail.com>.
yes david you must use the "omitNorms=true" for great performance
Thanks
Aman Tandon
On Tue, Apr 8, 2014 at 5:36 PM, Ahmet Arslan <io...@yahoo.com> wrote:
> Hi David,
>
> omitNorms="true" will cause additional performance gains too.
> https://wiki.apache.org/solr/SolrPerformanceFactors#indexed_fields
>
> To globally disable length norm, one can create a custom similarity and
> register it as a default similarity though.
>
>
>
> On Tuesday, April 8, 2014 2:59 PM, David Santamauro <
> david.santamauro@gmail.com> wrote:
>
> Is there any general setting that removes this "punishment" or must
> omitNorms=false be part of every field definition?
>
>
>
> On 4/8/2014 7:04 AM, Ahmet Arslan wrote:
> > Hi,
> >
> > length normal is computed for every document at index time. I think it
> is 1/sqrt(number of terms). Please see section 6. norm(t,d) at
> >
> >
> https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
> >
> >
> > If you don't care about length normalisation, you can set omitNorms=true
> in field declarations.
> http://wiki.apache.org/solr/SchemaXml#Common_field_options
> >
> >
> >
> > On Tuesday, April 8, 2014 1:57 PM, John Nielsen <jn...@mcb.dk> wrote:
> > Hi,
> >
> > I couldn't find any occurrence of SpanFirstQuery in either the schema.xml
> > or solrconfig.xml files.
> >
> > This is the query i used with debug=results.
> > http://pastebin.com/bWzUkjKz
> >
> > And here is the answer.
> > http://pastebin.com/nCXFcuky
> >
> > I am not sure what I am supposed to be looking for.
> >
> >
> >
> > On Tue, Apr 8, 2014 at 11:34 AM, Markus Jelsma
> > <ma...@openindex.io>wrote:
> >
> >> Hi - the thing you describe is possible when your set up uses
> >> SpanFirstQuery. But to be sure what's going on you should post the debug
> >> output.
> >>
> >> -----Original message-----
> >>> From:John Nielsen <jn...@mcb.dk>
> >>> Sent: Tuesday 8th April 2014 11:03
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Strange relevance scoring
> >>>
> >>> Hi,
> >>>
> >>> We are seeing a strange phenomenon with our Solr setup which I have
> been
> >>> unable to answer.
> >>>
> >>> My Google-fu is clearly not up to the task, so I am trying here.
> >>>
> >>> It appears that if i do a freetext search for a single word, say
> >> "modellering"
> >>> on a text field, the scoring is massively boosted if the first word of
> >> the
> >>> text field is a hit.
> >>>
> >>> For instance if there is only one occurrence of the word "modellering"
> in
> >>> the text field and that occurrence is the first word of the text, then
> >> that
> >>> document gets a higher relevancy than if the word "modelling" occurs 5
> >>> times in the text and the first word of the text is any other word.
> >>>
> >>> Is this normal behavior? Is special attention paid to the first word
> in a
> >>> text field? I would think that the latter case would get the highest
> >> score.
> >>>
> >>>
> >>> --
> >>> Med venlig hilsen / Best regards
> >>>
> >>> *John Nielsen*
> >>> Programmer
> >>>
> >>>
> >>>
> >>> *MCB A/S*
> >>> Enghaven 15
> >>> DK-7500 Holstebro
> >>>
> >>> Kundeservice: +45 9610 2824
> >>> post@mcb.dk
> >>> www.mcb.dk
> >
> >>>
> >>
> >
> >
> >
>
>
Re: Strange relevance scoring
Posted by Ahmet Arslan <io...@yahoo.com>.
Hi David,
omitNorms="true" will cause additional performance gains too. https://wiki.apache.org/solr/SolrPerformanceFactors#indexed_fields
To globally disable length norm, one can create a custom similarity and register it as a default similarity though.
On Tuesday, April 8, 2014 2:59 PM, David Santamauro <da...@gmail.com> wrote:
Is there any general setting that removes this "punishment" or must
omitNorms=false be part of every field definition?
On 4/8/2014 7:04 AM, Ahmet Arslan wrote:
> Hi,
>
> length normal is computed for every document at index time. I think it is 1/sqrt(number of terms). Please see section 6. norm(t,d) at
>
> https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
>
>
> If you don't care about length normalisation, you can set omitNorms=true in field declarations. http://wiki.apache.org/solr/SchemaXml#Common_field_options
>
>
>
> On Tuesday, April 8, 2014 1:57 PM, John Nielsen <jn...@mcb.dk> wrote:
> Hi,
>
> I couldn't find any occurrence of SpanFirstQuery in either the schema.xml
> or solrconfig.xml files.
>
> This is the query i used with debug=results.
> http://pastebin.com/bWzUkjKz
>
> And here is the answer.
> http://pastebin.com/nCXFcuky
>
> I am not sure what I am supposed to be looking for.
>
>
>
> On Tue, Apr 8, 2014 at 11:34 AM, Markus Jelsma
> <ma...@openindex.io>wrote:
>
>> Hi - the thing you describe is possible when your set up uses
>> SpanFirstQuery. But to be sure what's going on you should post the debug
>> output.
>>
>> -----Original message-----
>>> From:John Nielsen <jn...@mcb.dk>
>>> Sent: Tuesday 8th April 2014 11:03
>>> To: solr-user@lucene.apache.org
>>> Subject: Strange relevance scoring
>>>
>>> Hi,
>>>
>>> We are seeing a strange phenomenon with our Solr setup which I have been
>>> unable to answer.
>>>
>>> My Google-fu is clearly not up to the task, so I am trying here.
>>>
>>> It appears that if i do a freetext search for a single word, say
>> "modellering"
>>> on a text field, the scoring is massively boosted if the first word of
>> the
>>> text field is a hit.
>>>
>>> For instance if there is only one occurrence of the word "modellering" in
>>> the text field and that occurrence is the first word of the text, then
>> that
>>> document gets a higher relevancy than if the word "modelling" occurs 5
>>> times in the text and the first word of the text is any other word.
>>>
>>> Is this normal behavior? Is special attention paid to the first word in a
>>> text field? I would think that the latter case would get the highest
>> score.
>>>
>>>
>>> --
>>> Med venlig hilsen / Best regards
>>>
>>> *John Nielsen*
>>> Programmer
>>>
>>>
>>>
>>> *MCB A/S*
>>> Enghaven 15
>>> DK-7500 Holstebro
>>>
>>> Kundeservice: +45 9610 2824
>>> post@mcb.dk
>>> www.mcb.dk
>
>>>
>>
>
>
>
Re: Strange relevance scoring
Posted by David Santamauro <da...@gmail.com>.
Is there any general setting that removes this "punishment" or must
omitNorms=false be part of every field definition?
On 4/8/2014 7:04 AM, Ahmet Arslan wrote:
> Hi,
>
> length normal is computed for every document at index time. I think it is 1/sqrt(number of terms). Please see section 6. norm(t,d) at
>
> https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
>
>
> If you don't care about length normalisation, you can set omitNorms=true in field declarations. http://wiki.apache.org/solr/SchemaXml#Common_field_options
>
>
>
> On Tuesday, April 8, 2014 1:57 PM, John Nielsen <jn...@mcb.dk> wrote:
> Hi,
>
> I couldn't find any occurrence of SpanFirstQuery in either the schema.xml
> or solrconfig.xml files.
>
> This is the query i used with debug=results.
> http://pastebin.com/bWzUkjKz
>
> And here is the answer.
> http://pastebin.com/nCXFcuky
>
> I am not sure what I am supposed to be looking for.
>
>
>
> On Tue, Apr 8, 2014 at 11:34 AM, Markus Jelsma
> <ma...@openindex.io>wrote:
>
>> Hi - the thing you describe is possible when your set up uses
>> SpanFirstQuery. But to be sure what's going on you should post the debug
>> output.
>>
>> -----Original message-----
>>> From:John Nielsen <jn...@mcb.dk>
>>> Sent: Tuesday 8th April 2014 11:03
>>> To: solr-user@lucene.apache.org
>>> Subject: Strange relevance scoring
>>>
>>> Hi,
>>>
>>> We are seeing a strange phenomenon with our Solr setup which I have been
>>> unable to answer.
>>>
>>> My Google-fu is clearly not up to the task, so I am trying here.
>>>
>>> It appears that if i do a freetext search for a single word, say
>> "modellering"
>>> on a text field, the scoring is massively boosted if the first word of
>> the
>>> text field is a hit.
>>>
>>> For instance if there is only one occurrence of the word "modellering" in
>>> the text field and that occurrence is the first word of the text, then
>> that
>>> document gets a higher relevancy than if the word "modelling" occurs 5
>>> times in the text and the first word of the text is any other word.
>>>
>>> Is this normal behavior? Is special attention paid to the first word in a
>>> text field? I would think that the latter case would get the highest
>> score.
>>>
>>>
>>> --
>>> Med venlig hilsen / Best regards
>>>
>>> *John Nielsen*
>>> Programmer
>>>
>>>
>>>
>>> *MCB A/S*
>>> Enghaven 15
>>> DK-7500 Holstebro
>>>
>>> Kundeservice: +45 9610 2824
>>> post@mcb.dk
>>> www.mcb.dk
>
>>>
>>
>
>
>
Re: Strange relevance scoring
Posted by Ahmet Arslan <io...@yahoo.com>.
Hi,
length normal is computed for every document at index time. I think it is 1/sqrt(number of terms). Please see section 6. norm(t,d) at
https://lucene.apache.org/core/4_7_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
If you don't care about length normalisation, you can set omitNorms=true in field declarations. http://wiki.apache.org/solr/SchemaXml#Common_field_options
On Tuesday, April 8, 2014 1:57 PM, John Nielsen <jn...@mcb.dk> wrote:
Hi,
I couldn't find any occurrence of SpanFirstQuery in either the schema.xml
or solrconfig.xml files.
This is the query i used with debug=results.
http://pastebin.com/bWzUkjKz
And here is the answer.
http://pastebin.com/nCXFcuky
I am not sure what I am supposed to be looking for.
On Tue, Apr 8, 2014 at 11:34 AM, Markus Jelsma
<ma...@openindex.io>wrote:
> Hi - the thing you describe is possible when your set up uses
> SpanFirstQuery. But to be sure what's going on you should post the debug
> output.
>
> -----Original message-----
> > From:John Nielsen <jn...@mcb.dk>
> > Sent: Tuesday 8th April 2014 11:03
> > To: solr-user@lucene.apache.org
> > Subject: Strange relevance scoring
> >
> > Hi,
> >
> > We are seeing a strange phenomenon with our Solr setup which I have been
> > unable to answer.
> >
> > My Google-fu is clearly not up to the task, so I am trying here.
> >
> > It appears that if i do a freetext search for a single word, say
> "modellering"
> > on a text field, the scoring is massively boosted if the first word of
> the
> > text field is a hit.
> >
> > For instance if there is only one occurrence of the word "modellering" in
> > the text field and that occurrence is the first word of the text, then
> that
> > document gets a higher relevancy than if the word "modelling" occurs 5
> > times in the text and the first word of the text is any other word.
> >
> > Is this normal behavior? Is special attention paid to the first word in a
> > text field? I would think that the latter case would get the highest
> score.
> >
> >
> > --
> > Med venlig hilsen / Best regards
> >
> > *John Nielsen*
> > Programmer
> >
> >
> >
> > *MCB A/S*
> > Enghaven 15
> > DK-7500 Holstebro
> >
> > Kundeservice: +45 9610 2824
> > post@mcb.dk
> > www.mcb.dk
> >
>
--
Med venlig hilsen / Best regards
*John Nielsen*
Programmer
*MCB A/S*
Enghaven 15
DK-7500 Holstebro
Kundeservice: +45 9610 2824
post@mcb.dk
www.mcb.dk
Re: Strange relevance scoring
Posted by John Nielsen <jn...@mcb.dk>.
Hi,
I couldn't find any occurrence of SpanFirstQuery in either the schema.xml
or solrconfig.xml files.
This is the query i used with debug=results.
http://pastebin.com/bWzUkjKz
And here is the answer.
http://pastebin.com/nCXFcuky
I am not sure what I am supposed to be looking for.
On Tue, Apr 8, 2014 at 11:34 AM, Markus Jelsma
<ma...@openindex.io>wrote:
> Hi - the thing you describe is possible when your set up uses
> SpanFirstQuery. But to be sure what's going on you should post the debug
> output.
>
> -----Original message-----
> > From:John Nielsen <jn...@mcb.dk>
> > Sent: Tuesday 8th April 2014 11:03
> > To: solr-user@lucene.apache.org
> > Subject: Strange relevance scoring
> >
> > Hi,
> >
> > We are seeing a strange phenomenon with our Solr setup which I have been
> > unable to answer.
> >
> > My Google-fu is clearly not up to the task, so I am trying here.
> >
> > It appears that if i do a freetext search for a single word, say
> "modellering"
> > on a text field, the scoring is massively boosted if the first word of
> the
> > text field is a hit.
> >
> > For instance if there is only one occurrence of the word "modellering" in
> > the text field and that occurrence is the first word of the text, then
> that
> > document gets a higher relevancy than if the word "modelling" occurs 5
> > times in the text and the first word of the text is any other word.
> >
> > Is this normal behavior? Is special attention paid to the first word in a
> > text field? I would think that the latter case would get the highest
> score.
> >
> >
> > --
> > Med venlig hilsen / Best regards
> >
> > *John Nielsen*
> > Programmer
> >
> >
> >
> > *MCB A/S*
> > Enghaven 15
> > DK-7500 Holstebro
> >
> > Kundeservice: +45 9610 2824
> > post@mcb.dk
> > www.mcb.dk
> >
>
--
Med venlig hilsen / Best regards
*John Nielsen*
Programmer
*MCB A/S*
Enghaven 15
DK-7500 Holstebro
Kundeservice: +45 9610 2824
post@mcb.dk
www.mcb.dk
RE: Strange relevance scoring
Posted by Markus Jelsma <ma...@openindex.io>.
Hi - the thing you describe is possible when your set up uses SpanFirstQuery. But to be sure what's going on you should post the debug output.
-----Original message-----
> From:John Nielsen <jn...@mcb.dk>
> Sent: Tuesday 8th April 2014 11:03
> To: solr-user@lucene.apache.org
> Subject: Strange relevance scoring
>
> Hi,
>
> We are seeing a strange phenomenon with our Solr setup which I have been
> unable to answer.
>
> My Google-fu is clearly not up to the task, so I am trying here.
>
> It appears that if i do a freetext search for a single word, say "modellering"
> on a text field, the scoring is massively boosted if the first word of the
> text field is a hit.
>
> For instance if there is only one occurrence of the word "modellering" in
> the text field and that occurrence is the first word of the text, then that
> document gets a higher relevancy than if the word "modelling" occurs 5
> times in the text and the first word of the text is any other word.
>
> Is this normal behavior? Is special attention paid to the first word in a
> text field? I would think that the latter case would get the highest score.
>
>
> --
> Med venlig hilsen / Best regards
>
> *John Nielsen*
> Programmer
>
>
>
> *MCB A/S*
> Enghaven 15
> DK-7500 Holstebro
>
> Kundeservice: +45 9610 2824
> post@mcb.dk
> www.mcb.dk
>
Re: Strange relevance scoring
Posted by John Nielsen <jn...@mcb.dk>.
Interesting.
Most of the text fields are single word fields or close to it, but on some
of the documents, long text appears.
How long does a text need to be before hitting length normalization?
On Tue, Apr 8, 2014 at 11:36 AM, Ahmet Arslan <io...@yahoo.com> wrote:
> Hi Nielsen,
>
> There is no special attention paid to first word. You are probably hitting
> length normalisation.
> Lucene/Solr punishes long documents, favours short documents.
> (5 times appearing one) longer?
>
>
>
> On Tuesday, April 8, 2014 12:03 PM, John Nielsen <jn...@mcb.dk> wrote:
> Hi,
>
> We are seeing a strange phenomenon with our Solr setup which I have been
> unable to answer.
>
> My Google-fu is clearly not up to the task, so I am trying here.
>
> It appears that if i do a freetext search for a single word, say
> "modellering"
> on a text field, the scoring is massively boosted if the first word of the
> text field is a hit.
>
> For instance if there is only one occurrence of the word "modellering" in
> the text field and that occurrence is the first word of the text, then that
> document gets a higher relevancy than if the word "modelling" occurs 5
> times in the text and the first word of the text is any other word.
>
> Is this normal behavior? Is special attention paid to the first word in a
> text field? I would think that the latter case would get the highest score.
>
>
> --
> Med venlig hilsen / Best regards
>
> *John Nielsen*
> Programmer
>
>
>
> *MCB A/S*
> Enghaven 15
> DK-7500 Holstebro
>
> Kundeservice: +45 9610 2824
> post@mcb.dk
> www.mcb.dk
>
>
--
Med venlig hilsen / Best regards
*John Nielsen*
Programmer
*MCB A/S*
Enghaven 15
DK-7500 Holstebro
Kundeservice: +45 9610 2824
post@mcb.dk
www.mcb.dk
Re: Strange relevance scoring
Posted by Ahmet Arslan <io...@yahoo.com>.
Hi Nielsen,
There is no special attention paid to first word. You are probably hitting length normalisation.
Lucene/Solr punishes long documents, favours short documents.
(5 times appearing one) longer?
On Tuesday, April 8, 2014 12:03 PM, John Nielsen <jn...@mcb.dk> wrote:
Hi,
We are seeing a strange phenomenon with our Solr setup which I have been
unable to answer.
My Google-fu is clearly not up to the task, so I am trying here.
It appears that if i do a freetext search for a single word, say "modellering"
on a text field, the scoring is massively boosted if the first word of the
text field is a hit.
For instance if there is only one occurrence of the word "modellering" in
the text field and that occurrence is the first word of the text, then that
document gets a higher relevancy than if the word "modelling" occurs 5
times in the text and the first word of the text is any other word.
Is this normal behavior? Is special attention paid to the first word in a
text field? I would think that the latter case would get the highest score.
--
Med venlig hilsen / Best regards
*John Nielsen*
Programmer
*MCB A/S*
Enghaven 15
DK-7500 Holstebro
Kundeservice: +45 9610 2824
post@mcb.dk
www.mcb.dk