You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Aaron Daubman <da...@gmail.com> on 2012/07/19 06:10:11 UTC

Frustrating differences in fieldNorm between two different versions of solr indexing the same document

Greetings,

I've been digging in to this for two days now and have come up short -
hopefully there is some simple answer I am just not seeing:

I have a solr 1.4.1 instance and a solr 3.6.0 instance, both configured as
identically as possible (given deprecations) and indexing the same document.

For most queries the results are very close (scoring within three
significant differences, almost identical positions in results).

However, for certain documents, the scores are very different (causing
these docs to be ranked +/- 25 positions different or more in the results)

In looking at debugQuery output, it seems like this is due to fieldNorm
values being lower for the 3.6.0 instance than the 1.4.1.

(note that for most docs, the fieldNorms are identical)

I have taken the field values for the example below and run them
through /admin/analysis.jsp on each solr instance. Even for the problematic
docs/fields, the results are almost identical. For the example below, the
t_tag values for the problematic doc:
1.4.1: 162 values
3.6.0: 164 values

note that 1/sqrt(162) = 0.07857 ~= fieldNorm for 1.4.1,
however, (1/0.0625)^2 = 256, which is no where near 164

Here is a particular example from 1.4.1:
1.6263733 = (MATCH) fieldWeight(t_tag:soul in 2066419), product of:
   3.8729835 = tf(termFreq(t_tag:soul)=15)
   5.3750753 = idf(docFreq=27619, maxDocs=2194294)
   0.078125 = fieldNorm(field=t_tag, doc=2066419)

And the same from 3.6.0:
1.3042576 = (MATCH) fieldWeight(t_tag:soul in 1977957), product of:
   3.8729835 = tf(termFreq(t_tag:soul)=15)
   5.388126 = idf(docFreq=27740, maxDocs=2232857)
   0.0625 = fieldNorm(field=t_tag, doc=1977957)


Here is the 1.4.1 config for the t_tag field and text type:
    <fieldtype name="text" class="solr.TextField"
positionIncrementGap="100">
          <analyzer>
              <tokenizer class="solr.StandardTokenizerFactory"/>
              <filter class="solr.StandardFilterFactory"/>
              <filter class="solr.ISOLatin1AccentFilterFactory"/>
              <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
              <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
          </analyzer>
      </fieldtype>
<dynamicField name="t_*" type="text" indexed="true" stored="true"
required="false" multiValued="true" termVectors="true"/>


And 3.6.0 schema config for the t_tag field and text type:
        <fieldtype name="text" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
            <analyzer>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.StandardFilterFactory"/>
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.StopFilterFactory"
words="stopwords.txt" ignoreCase="true"/>
                <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
                <filter class="solr.PorterStemFilterFactory"/>
            </analyzer>
        </fieldtype>
        <field name="t_tag" type="text" indexed="true" stored="true"
required="false" multiValued="true"/>

I at first got distracted by this change between versions:
LUCENE-2286: Enabled DefaultSimilarity.setDiscountOverlaps by default. This
means that terms with a position increment gap of zero do not affect the
norms calculation by default.
However, this doesn't appear to be causing the issue as, according to
analysis.jsp there is no overlap for t_tag...

Can you point me to where these fieldNorm differences are coming from and
why they'd only be happing for a select few documents for which the content
doesn't stand out?

Thank you,
     Aaron

Re: Frustrating differences in fieldNorm between two different versions of solr indexing the same document

Posted by Aaron Daubman <da...@gmail.com>.

Robert,

So this is lossy: basically you can think of there being only 256
> possible values. So when you increased the number of terms only
> slightly by changing your analysis, this happened to bump you over the
> edge rounding you up to the next value.
>
> more information:
> http://lucene.apache.org/core/3_6_0/scoring.html
>
> http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html



Thanks - this was extremely helpful! I had read both sources before but
didn't grasp the magnitude of lossy-ness until your pointer and mention of
edge-case.
Just to help out anybody else who might run in to this, I hacked together a
little harness to demonstrate:
---
fieldLength: 160, computeNorm: 0.07905694, floatToByte315: 109,
byte315ToFloat: 0.078125
fieldLength: 161, computeNorm: 0.07881104, floatToByte315: 109,
byte315ToFloat: 0.078125
fieldLength: 162, computeNorm: 0.07856742, floatToByte315: 109,
byte315ToFloat: 0.078125
fieldLength: 163, computeNorm: 0.07832605, floatToByte315: 109,
byte315ToFloat: 0.078125
fieldLength: 164, computeNorm: 0.07808688, floatToByte315: 108,
byte315ToFloat: 0.0625
fieldLength: 165, computeNorm: 0.077849895, floatToByte315: 108,
byte315ToFloat: 0.0625
fieldLength: 166, computeNorm: 0.07761505, floatToByte315: 108,
byte315ToFloat: 0.0625
---

So my takeaway is that these scores that vary significantly are caused by:
1) a field with lengths right on this boundary between the two analyzer
chains
2) the fact that we might be searching for matches from 50+ values to a
field with 150+ values, and so the overall score is repeatedly impacted by
the otherwise typically insignificant change in fieldNorm value

Thanks again,
     Aaron

Re: Frustrating differences in fieldNorm between two different versions of solr indexing the same document

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Jul 19, 2012 at 11:11 AM, Aaron Daubman <da...@gmail.com> wrote:

> Apologies if I didn't clearly state my goal/concern: I am not looking for
> the exact same scoring - I am looking to explain scoring differences.
>  Deprecated components will eventually go away, time moves on, etc...
> etc... I would like to be able to run current code, and should be able to -
> the part that is sticking is being able to *explain* the difference in
> results.
>

OK: i totally missed that, sorry!

to explain why you see such a large difference:

The difference is that these length normalizations are computed at
index time and fit inside a *single byte* by default. This is to keep
ram usage low for many documents and many fields with norms (since its
#fieldsWithNorms * #documents in bytes in ram).
So this is lossy: basically you can think of there being only 256
possible values. So when you increased the number of terms only
slightly by changing your analysis, this happened to bump you over the
edge rounding you up to the next value.

more information:
http://lucene.apache.org/core/3_6_0/scoring.html
http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html

by the way: if you don't like this:
1. if you can still live with a single byte, maybe plug in your own
Similarity class into 3.6, overriding decodeNormValue/encodeNormValue.
For example, you could use a different SmallFloat configuration that
has less range but more precision for your use case (if your docs are
all short or whatever)
2. otherwise, if you feel you need more than a single byte, check out
4.0-ALPHA: you arent limited to a single byte there.

-- 
lucidimagination.com

Re: Frustrating differences in fieldNorm between two different versions of solr indexing the same document

Posted by Aaron Daubman <da...@gmail.com>.

Robert,

> I have a solr 1.4.1 instance and a solr 3.6.0 instance, both configured as
> > identically as possible (given deprecations) and indexing the same
> document.
>
> Why did you do this? If you want the exact same scoring, use the exact
> same analysis.
> This means specifying luceneMatchVersion = 2.9, and the exact same
> analysis components (even if deprecated).
>
> > I have taken the field values for the example below and run them
> > through /admin/analysis.jsp on each solr instance. Even for the
> problematic
> > docs/fields, the results are almost identical. For the example below, the
> > t_tag values for the problematic doc:
> > 1.4.1: 162 values
> > 3.6.0: 164 values
> >
>
> This is why: you changed your analysis.
>

Apologies if I didn't clearly state my goal/concern: I am not looking for
the exact same scoring - I am looking to explain scoring differences.
 Deprecated components will eventually go away, time moves on, etc...
etc... I would like to be able to run current code, and should be able to -
the part that is sticking is being able to *explain* the difference in
results.

As you can see from my email, after running the different analysis on the
input, the output does not demonstrate (in any way that I can see) why the
fieldNorm values would be so different. Even with the different analysis,
the results are almost identical - which *should* result in an almost
identical fieldNorm???

Again, the desire is not to be the same, it is to understand the difference.

Thanks,
     Aaron

Re: Frustrating differences in fieldNorm between two different versions of solr indexing the same document

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Jul 19, 2012 at 12:10 AM, Aaron Daubman <da...@gmail.com> wrote:
> Greetings,
>
> I've been digging in to this for two days now and have come up short -
> hopefully there is some simple answer I am just not seeing:
>
> I have a solr 1.4.1 instance and a solr 3.6.0 instance, both configured as
> identically as possible (given deprecations) and indexing the same document.

Why did you do this? If you want the exact same scoring, use the exact
same analysis.
This means specifying luceneMatchVersion = 2.9, and the exact same
analysis components (even if deprecated).

> I have taken the field values for the example below and run them
> through /admin/analysis.jsp on each solr instance. Even for the problematic
> docs/fields, the results are almost identical. For the example below, the
> t_tag values for the problematic doc:
> 1.4.1: 162 values
> 3.6.0: 164 values
>

This is why: you changed your analysis.

-- 
lucidimagination.com