You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Hodder, Rick" <RH...@navg.com> on 2018/07/20 15:41:19 UTC

SOLR 7.1 ClassicSimilarityFactory Problem

I am using SOLR 7.1
ClassicSimilarityFactory
I have data in my core with field called CompanyName in an indexed field IDX_CompanyName

<field name="IDX_CompanyName " type="text_general" indexed="true" stored="false" multiValued="true" />
<field name="CompanyName" type="string" indexed="true" stored="true"/>
<copyField source="CompanyName" dest=" IDX_CompanyName"/>

Here are a few of the 900,000 rows in the core

Cityview
Citadel
CivicVentures
Clutch City Sports
Clutch City Sports &amp; Entertainment
Clutch City Sports &amp; Entertainment
Clutch City Sports &amp; Entertainment


If I search for IDX_Company:(clutch AND city) and a fl=*,score and maxrows of 750, and at 1500 I get the following results

CompanyName                Score
Cityview                               5.874983
Citadel                                  5.3502507
CivicVentures                    4.7278214
<other rows, but no clutch city>

If I search for IDX_Company:(clutch AND city) and a maxrows of 5000 I get the following results

CompanyName                                                                Score
Cityview                                                                               5.874983
Citadel                                                                                  5.3502507
CivicVentures                                                                    4.7278214
Clutch City Sports &amp; Entertainment                3.6542892
Clutch City Sports &amp; Entertainment                3.6542892
Clutch City Sports &amp; Entertainment                3.6542892

Ive tried looking at the debug query to figure out what its doing and I'm confused by what it is saying

The debug info for Cityview is

<str name="366640">
5.874983 = sum of:
  1.9583277 = weight(Synonym(IDX_CompanyName:c IDX_ CompanyName:cl IDX_ CompanyName:clu IDX_CompanyName:clut IDX_CompanyName:clutc IDX_CompanyName:clutch) in 16639) [ClassicSimilarity], result of:
    1.9583277 = fieldWeight in 16639, product of:
      1.0 = tf(freq=1.0), with freq of:
        1.0 = termFreq=1.0
      1.9583277 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
        166407.0 = docFreq
        433880.0 = docCount
      1.0 = fieldNorm(doc=16639)
  3.9166553 = weight(Synonym(IDX_ CompanyName:c IDX_ CompanyName:ci IDX_ CompanyName:cit IDX_ CompanyName:city) in 16639) [ClassicSimilarity], result of:
    3.9166553 = fieldWeight in 16639, product of:
      2.0 = tf(freq=4.0), with freq of:
        4.0 = termFreq=4.0
      1.9583277 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
        166407.0 = docFreq
        433880.0 = docCount
      1.0 = fieldNorm(doc=16639)
</str>

The debug info for Clutch City Sports &amp; Entertainment is

<str name="409550">
3.6542892 = sum of:
  1.9583277 = weight(Synonym(IDX_CompanyName:c IDX_ CompanyName:cl IDX_ CompanyName:clu IDX_ CompanyName:clut IDX_ CompanyName:clutc IDX_ CompanyName:clutch) in 9549) [ClassicSimilarity], result of:
    1.9583277 = fieldWeight in 9549, product of:
      2.828427 = tf(freq=8.0), with freq of:
        8.0 = termFreq=8.0
      1.9583277 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
        166407.0 = docFreq
        433880.0 = docCount
      0.35355338 = fieldNorm(doc=9549)
  1.6959615 = weight(Synonym(IDX_ CompanyName:c IDX_ CompanyName:ci IDX_ CompanyName:cit IDX_ CompanyName:city) in 9549) [ClassicSimilarity], result of:
    1.6959615 = fieldWeight in 9549, product of:
      2.4494898 = tf(freq=6.0), with freq of:
        6.0 = termFreq=6.0
      1.9583277 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
        166407.0 = docFreq
        433880.0 = docCount
      0.35355338 = fieldNorm(doc=9549)
</str>

Why would something with 2 hits score lower? Why does the max rows influence this?

How might I fix this?

This didn't used to happen in SOLR 4.10 (I know its an older version, but...)


Thanks,

Rick Hodder
Information Technology
Navigators Management Company, Inc.
83 Wooster Heights Road, 2nd Floor
Danbury, CT  06810
(475) 329-6251

[Forbes_Best Places Logo2016]


Re: SOLR 7.1 ClassicSimilarityFactory Problem

Posted by Erick Erickson <er...@gmail.com>.
Why do you think you need to "fix" anything here?

FieldNorm here is significantly different. On a quick scan (and you're
right, trying to understand it all at a glance is daunting) your
fieldNorm is lowering the score of the second doc. Basically the
"two hits" are in a longer field so their weight is less. Which is
part of the basic function of scoring.

Plus it looks like you've n-grammed the field, which is further
confusing the issue.

I don't see what rows is changing, please point it out. You're getting
the exact same score for the reported documents, it's just that
as you add more rows you get information for more docs as far as
I can tell.

You can try omitting norms and/or creating a non-ngrammed field.

As for why it's different from 4x, no clue. Perhaps the Lucene
folks can weigh in.

Best,
Erick

On Fri, Jul 20, 2018 at 8:41 AM, Hodder, Rick <RH...@navg.com> wrote:

> I am using SOLR 7.1
>
> ClassicSimilarityFactory
>
> I have data in my core with field called CompanyName in an indexed field
> IDX_CompanyName
>
>
>
> <field name="IDX_CompanyName " type="text_general" indexed="true"
> stored="false" multiValued="true" />
>
> <field name="CompanyName" type="string" indexed="true" stored="true"/>
>
> <copyField source="CompanyName" dest=" IDX_CompanyName"/>
>
>
>
> Here are a few of the 900,000 rows in the core
>
>
>
> Cityview
>
> Citadel
>
> CivicVentures
>
> Clutch City Sports
>
> Clutch City Sports &amp; Entertainment
>
> Clutch City Sports &amp; Entertainment
>
> Clutch City Sports &amp; Entertainment
>
>
>
>
>
> If I *search* for IDX_Company:(clutch AND city) and a fl=*,score and
> maxrows of 750, and at 1500 I get the following results
>
>
>
> *CompanyName                Score*
>
> Cityview                               5.874983
>
> Citadel                                  5.3502507
>
> CivicVentures                    4.7278214
>
> <other rows, but no clutch city>
>
>
>
> If I *search* for IDX_Company:(clutch AND city) and a maxrows of 5000 I
> get the following results
>
>
>
> *CompanyName
>                                                 Score*
>
> Cityview
>                                 5.874983
>
> Citadel
> 5.3502507
>
> CivicVentures
> 4.7278214
>
> Clutch City Sports &amp; Entertainment                3.6542892
>
> Clutch City Sports &amp; Entertainment                3.6542892
>
> Clutch City Sports &amp; Entertainment                3.6542892
>
>
>
> Ive tried looking at the debug query to figure out what its doing and I’m
> confused by what it is saying
>
>
>
> The debug info for Cityview is
>
>
>
> <str name="366640">
>
> 5.874983 = sum of:
>
>   1.9583277 = weight(Synonym(IDX_CompanyName:c IDX_ CompanyName:cl IDX_
> CompanyName:clu IDX_CompanyName:clut IDX_CompanyName:clutc
> IDX_CompanyName:clutch) in 16639) [ClassicSimilarity], result of:
>
>     1.9583277 = fieldWeight in 16639, product of:
>
>       1.0 = tf(freq=1.0), with freq of:
>
>         1.0 = termFreq=1.0
>
>       1.9583277 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
>
>         166407.0 = docFreq
>
>         433880.0 = docCount
>
>       1.0 = fieldNorm(doc=16639)
>
>   3.9166553 = weight(Synonym(IDX_ CompanyName:c IDX_ CompanyName:ci IDX_
> CompanyName:cit IDX_ CompanyName:city) in 16639) [ClassicSimilarity],
> result of:
>
>     3.9166553 = fieldWeight in 16639, product of:
>
>       2.0 = tf(freq=4.0), with freq of:
>
>         4.0 = termFreq=4.0
>
>       1.9583277 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
>
>         166407.0 = docFreq
>
>         433880.0 = docCount
>
>       1.0 = fieldNorm(doc=16639)
>
> </str>
>
>
>
> The debug info for Clutch City Sports &amp; Entertainment is
>
>
>
> <str name="409550">
>
> 3.6542892 = sum of:
>
>   1.9583277 = weight(Synonym(IDX_CompanyName:c IDX_ CompanyName:cl IDX_
> CompanyName:clu IDX_ CompanyName:clut IDX_ CompanyName:clutc IDX_
> CompanyName:clutch) in 9549) [ClassicSimilarity], result of:
>
>     1.9583277 = fieldWeight in 9549, product of:
>
>       2.828427 = tf(freq=8.0), with freq of:
>
>         8.0 = termFreq=8.0
>
>       1.9583277 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
>
>         166407.0 = docFreq
>
>         433880.0 = docCount
>
>       0.35355338 = fieldNorm(doc=9549)
>
>   1.6959615 = weight(Synonym(IDX_ CompanyName:c IDX_ CompanyName:ci IDX_
> CompanyName:cit IDX_ CompanyName:city) in 9549) [ClassicSimilarity], result
> of:
>
>     1.6959615 = fieldWeight in 9549, product of:
>
>       2.4494898 = tf(freq=6.0), with freq of:
>
>         6.0 = termFreq=6.0
>
>       1.9583277 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
>
>         166407.0 = docFreq
>
>         433880.0 = docCount
>
>       0.35355338 = fieldNorm(doc=9549)
>
> </str>
>
>
>
> Why would something with 2 hits score lower? Why does the max rows
> influence this?
>
>
>
> How might I fix this?
>
>
>
> This didn’t used to happen in SOLR 4.10 (I know its an older version, but…)
>
>
>
>
>
> Thanks,
>
>
>
> Rick Hodder
>
> Information Technology
>
> Navigators Management Company, Inc.
>
> 83 Wooster Heights Road
> <https://maps.google.com/?q=83+Wooster+Heights+Road&entry=gmail&source=g>,
> 2nd Floor
>
> Danbury, CT  06810
>
> (475) 329-6251
>
>
>
> [image: Forbes_Best Places Logo2016]
>
>
>