You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jorge Luis Betancourt González <jl...@uci.cu> on 2015/05/14 23:14:52 UTC

Re: [MASSMAIL]Re: High fieldNorm values causing really odd results

Hi Hoss,

First of all, thank you for your reply.

Sorry for leaving the Solr version out in my previous email, I'm using Solr 4.10.3 running on Centos7, with the following JRE: Oracle Corporation OpenJDK 64-Bit Server VM (1.7.0_75 24.75-b04)

This are the relevant portions of my schema.xml

        <!-- Generic text field type -->
        <fieldType name="text" class="solr.TextField" sortMissingLast="true">
            <analyzer>
                <charFilter class="solr.HTMLStripCharFilterFactory"/>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.ASCIIFoldingFilterFactory"/>
               	<filter class="solr.StopFilterFactory"
                    ignoreCase="true" words="stopwords.txt"/>
                <filter class="solr.LowerCaseFilterFactory"/>
               	<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
        </fieldType>

        <field name="title" type="text" stored="true" indexed="true" multiValued="true"/>

In this particular case I'm not using any special features, just a typical text field. I'm using the default similarity class provided by Solr, this is a pretty straightforward setup :)

Regards,

----- Original Message -----
From: "Chris Hostetter" <ho...@fucit.org>
To: solr-user@lucene.apache.org
Sent: Thursday, May 14, 2015 4:08:36 PM
Subject: [MASSMAIL]Re: High fieldNorm values causing really odd results


:       {
:          "match":true,
:          "value":655360,
:          "description":"fieldNorm(doc=5316)"
:       }
	...
: This match is in the "title" field, which has 119669 total terms (which 
: isn't such big number) and the total document count in this index is 

that smells like a bug -- by the looks of it an overflow bug?

can you please provide some details on the version of solr you are using, 
and the specifics of your schema: what field type, what similarity 
configuration you have (if any) etc...


-Hoss
http://www.lucidworks.com/

Re: [MASSMAIL]Re: High fieldNorm values causing really odd results

Posted by Jorge Luis Betancourt González <jl...@uci.cu>.
For what I'm seeing I've defined a boost field in my docs, this field is defined as float which has the following fieldType: 

 <fieldType name="float" class="solr.TrieFloatField" precisionStep="6"/>

Is a boost field used by default to boost a document? I couldn't find any reference to this behaviour in the docs, only using a boost attribute in the doc/field level.

Is this a desired behaviour? 

Regards,

----- Original Message -----
From: "Jorge Luis Betancourt González" <jl...@uci.cu>
To: solr-user@lucene.apache.org
Sent: Thursday, May 14, 2015 11:49:18 PM
Subject: Re: [MASSMAIL]Re: High fieldNorm values causing really odd results

Regarding the experiment, sorry If I explained myself in the wrong way, the indexed document doesn't have 119669 terms have a lot less terms (less than a 1000 terms, I don't have the exact number here now), instead 119669 is the number of distinct terms reported by luke (Top-terms total in the admin interface) on the title field. 

This index was built from scratch using 4.10.3 if I'm no remembering incorrectly. Perhaps part of the data could be indexed using 4.10.2, but we updated our box quite some time ago and this problem didn't appear until recently. The more strange issue is that this was working fine until a week or so ago, the only thing I found strange is that the root partition in our Solr box got out of space; basically we've Solr deployed in Tomcat, which is installed in the root partition but the cores and all Solr related data is stored in a separated partition mounted in /opt with plenty of space to grow; could this be the cause of this behavior? 

We're thinking on rebuilding our index, but would love to avoid it if possible and more importantly find the root cause if this issue (if is possible at all).

As I said before very grateful for your responses,

----- Original Message -----
From: "Chris Hostetter" <ho...@fucit.org>
To: solr-user@lucene.apache.org
Sent: Thursday, May 14, 2015 7:11:08 PM
Subject: Re: [MASSMAIL]Re: High fieldNorm values causing really odd results


: Sorry for leaving the Solr version out in my previous email, I'm using 
: Solr 4.10.3 running on Centos7, with the following JRE: Oracle 
: Corporation OpenJDK 64-Bit Server VM (1.7.0_75 24.75-b04)

I can't reproduce Using Solr 4.10.3 (or 4.10.4 - mistread your email the 
first time)

Are you certain you didn't *build* this index with a different Similarity 
configured? or did you perhaps build it with an older version of Solr that 
might have had a bug in it?

Here's what i tried...

applied this patch to the example configs based on the fieldType you 
specified...

hossman@tray:~/lucene/lucene_solr_4_10_3_tag$ svn diff
Index: solr/example/solr/collection1/conf/schema.xml
===================================================================
--- solr/example/solr/collection1/conf/schema.xml	(revision 1679472)
+++ solr/example/solr/collection1/conf/schema.xml	(working copy)
@@ -46,6 +46,21 @@
 -->
 
 <schema name="example" version="1.5">
+
+        <fieldType name="hoss_type" class="solr.TextField" sortMissingLast="true">
+            <analyzer>
+                <charFilter class="solr.HTMLStripCharFilterFactory"/>
+                <tokenizer class="solr.StandardTokenizerFactory"/>
+                <filter class="solr.ASCIIFoldingFilterFactory"/>
+                <filter class="solr.StopFilterFactory"
+                    ignoreCase="true" words="stopwords.txt"/>
+                <filter class="solr.LowerCaseFilterFactory"/>
+                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
+            </analyzer>
+        </fieldType>
+
+        <field name="hoss_test" type="hoss_type" stored="true" indexed="true" multiValued="true"/>
+  
   <!-- attribute "name" is the name of this schema and is only used for display purposes.
        version="x.y" is Solr's version number for the schema syntax and 
        semantics.  It should not normally be changed by applications.

...started up "java -jar start.jar" and then wrote & ran this script to 
generate a doc with the number of unique terms in my field that you mentioned & indexed it...

hossman@tray:~/tmp$ cat make-big-field.pl
#/usr/bin/perl

print qq{<add><doc><field name="id">hoss</field><field 
name="hoss_test">\n};
for (1..119669) {
    print "term${_} ";
}
print qq{</field></doc></add>\n};
hossman@tray:~/tmp$ perl make-big-field.pl > tmp.xml
hossman@tray:~/tmp$ curl -X POST -H 'Content-Type: application/xml' --data-binary @tmp.xml "http://localhost:8983/solr/collection1/update?commit=true"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">877</int></lst>
</response>


Then confirmed i got a very small fieldNorm when querying against this 
field...

hossman@tray:~/tmp$ curl 
'http://localhost:8983/solr/collection1/select?q=hoss_test:term1&debug=results&wt=json&indent=true&fl=id&omitHeader=true'
{
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"hoss"}]
  },
  "debug":{
    "explain":{
      "hoss":"\n7.491524E-4 = (MATCH) weight(hoss_test:term1 in 0) 
[DefaultSimilarity], result of:\n  7.491524E-4 = fieldWeight in 0, product 
of:\n    1.0 = tf(freq=1.0), with freq of:\n      1.0 = termFreq=1.0\n    
0.30685282 = idf(docFreq=1, maxDocs=1)\n    0.0024414062 = 
fieldNorm(doc=0)\n"}}}


-Hoss
http://www.lucidworks.com/

Re: [MASSMAIL]Re: High fieldNorm values causing really odd results

Posted by Jorge Luis Betancourt González <jl...@uci.cu>.
Regarding the experiment, sorry If I explained myself in the wrong way, the indexed document doesn't have 119669 terms have a lot less terms (less than a 1000 terms, I don't have the exact number here now), instead 119669 is the number of distinct terms reported by luke (Top-terms total in the admin interface) on the title field. 

This index was built from scratch using 4.10.3 if I'm no remembering incorrectly. Perhaps part of the data could be indexed using 4.10.2, but we updated our box quite some time ago and this problem didn't appear until recently. The more strange issue is that this was working fine until a week or so ago, the only thing I found strange is that the root partition in our Solr box got out of space; basically we've Solr deployed in Tomcat, which is installed in the root partition but the cores and all Solr related data is stored in a separated partition mounted in /opt with plenty of space to grow; could this be the cause of this behavior? 

We're thinking on rebuilding our index, but would love to avoid it if possible and more importantly find the root cause if this issue (if is possible at all).

As I said before very grateful for your responses,

----- Original Message -----
From: "Chris Hostetter" <ho...@fucit.org>
To: solr-user@lucene.apache.org
Sent: Thursday, May 14, 2015 7:11:08 PM
Subject: Re: [MASSMAIL]Re: High fieldNorm values causing really odd results


: Sorry for leaving the Solr version out in my previous email, I'm using 
: Solr 4.10.3 running on Centos7, with the following JRE: Oracle 
: Corporation OpenJDK 64-Bit Server VM (1.7.0_75 24.75-b04)

I can't reproduce Using Solr 4.10.3 (or 4.10.4 - mistread your email the 
first time)

Are you certain you didn't *build* this index with a different Similarity 
configured? or did you perhaps build it with an older version of Solr that 
might have had a bug in it?

Here's what i tried...

applied this patch to the example configs based on the fieldType you 
specified...

hossman@tray:~/lucene/lucene_solr_4_10_3_tag$ svn diff
Index: solr/example/solr/collection1/conf/schema.xml
===================================================================
--- solr/example/solr/collection1/conf/schema.xml	(revision 1679472)
+++ solr/example/solr/collection1/conf/schema.xml	(working copy)
@@ -46,6 +46,21 @@
 -->
 
 <schema name="example" version="1.5">
+
+        <fieldType name="hoss_type" class="solr.TextField" sortMissingLast="true">
+            <analyzer>
+                <charFilter class="solr.HTMLStripCharFilterFactory"/>
+                <tokenizer class="solr.StandardTokenizerFactory"/>
+                <filter class="solr.ASCIIFoldingFilterFactory"/>
+                <filter class="solr.StopFilterFactory"
+                    ignoreCase="true" words="stopwords.txt"/>
+                <filter class="solr.LowerCaseFilterFactory"/>
+                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
+            </analyzer>
+        </fieldType>
+
+        <field name="hoss_test" type="hoss_type" stored="true" indexed="true" multiValued="true"/>
+  
   <!-- attribute "name" is the name of this schema and is only used for display purposes.
        version="x.y" is Solr's version number for the schema syntax and 
        semantics.  It should not normally be changed by applications.

...started up "java -jar start.jar" and then wrote & ran this script to 
generate a doc with the number of unique terms in my field that you mentioned & indexed it...

hossman@tray:~/tmp$ cat make-big-field.pl
#/usr/bin/perl

print qq{<add><doc><field name="id">hoss</field><field 
name="hoss_test">\n};
for (1..119669) {
    print "term${_} ";
}
print qq{</field></doc></add>\n};
hossman@tray:~/tmp$ perl make-big-field.pl > tmp.xml
hossman@tray:~/tmp$ curl -X POST -H 'Content-Type: application/xml' --data-binary @tmp.xml "http://localhost:8983/solr/collection1/update?commit=true"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">877</int></lst>
</response>


Then confirmed i got a very small fieldNorm when querying against this 
field...

hossman@tray:~/tmp$ curl 
'http://localhost:8983/solr/collection1/select?q=hoss_test:term1&debug=results&wt=json&indent=true&fl=id&omitHeader=true'
{
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"hoss"}]
  },
  "debug":{
    "explain":{
      "hoss":"\n7.491524E-4 = (MATCH) weight(hoss_test:term1 in 0) 
[DefaultSimilarity], result of:\n  7.491524E-4 = fieldWeight in 0, product 
of:\n    1.0 = tf(freq=1.0), with freq of:\n      1.0 = termFreq=1.0\n    
0.30685282 = idf(docFreq=1, maxDocs=1)\n    0.0024414062 = 
fieldNorm(doc=0)\n"}}}


-Hoss
http://www.lucidworks.com/

Re: [MASSMAIL]Re: High fieldNorm values causing really odd results

Posted by Chris Hostetter <ho...@fucit.org>.
: Sorry for leaving the Solr version out in my previous email, I'm using 
: Solr 4.10.3 running on Centos7, with the following JRE: Oracle 
: Corporation OpenJDK 64-Bit Server VM (1.7.0_75 24.75-b04)

I can't reproduce Using Solr 4.10.3 (or 4.10.4 - mistread your email the 
first time)

Are you certain you didn't *build* this index with a different Similarity 
configured? or did you perhaps build it with an older version of Solr that 
might have had a bug in it?

Here's what i tried...

applied this patch to the example configs based on the fieldType you 
specified...

hossman@tray:~/lucene/lucene_solr_4_10_3_tag$ svn diff
Index: solr/example/solr/collection1/conf/schema.xml
===================================================================
--- solr/example/solr/collection1/conf/schema.xml	(revision 1679472)
+++ solr/example/solr/collection1/conf/schema.xml	(working copy)
@@ -46,6 +46,21 @@
 -->
 
 <schema name="example" version="1.5">
+
+        <fieldType name="hoss_type" class="solr.TextField" sortMissingLast="true">
+            <analyzer>
+                <charFilter class="solr.HTMLStripCharFilterFactory"/>
+                <tokenizer class="solr.StandardTokenizerFactory"/>
+                <filter class="solr.ASCIIFoldingFilterFactory"/>
+                <filter class="solr.StopFilterFactory"
+                    ignoreCase="true" words="stopwords.txt"/>
+                <filter class="solr.LowerCaseFilterFactory"/>
+                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
+            </analyzer>
+        </fieldType>
+
+        <field name="hoss_test" type="hoss_type" stored="true" indexed="true" multiValued="true"/>
+  
   <!-- attribute "name" is the name of this schema and is only used for display purposes.
        version="x.y" is Solr's version number for the schema syntax and 
        semantics.  It should not normally be changed by applications.

...started up "java -jar start.jar" and then wrote & ran this script to 
generate a doc with the number of unique terms in my field that you mentioned & indexed it...

hossman@tray:~/tmp$ cat make-big-field.pl
#/usr/bin/perl

print qq{<add><doc><field name="id">hoss</field><field 
name="hoss_test">\n};
for (1..119669) {
    print "term${_} ";
}
print qq{</field></doc></add>\n};
hossman@tray:~/tmp$ perl make-big-field.pl > tmp.xml
hossman@tray:~/tmp$ curl -X POST -H 'Content-Type: application/xml' --data-binary @tmp.xml "http://localhost:8983/solr/collection1/update?commit=true"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">877</int></lst>
</response>


Then confirmed i got a very small fieldNorm when querying against this 
field...

hossman@tray:~/tmp$ curl 
'http://localhost:8983/solr/collection1/select?q=hoss_test:term1&debug=results&wt=json&indent=true&fl=id&omitHeader=true'
{
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"hoss"}]
  },
  "debug":{
    "explain":{
      "hoss":"\n7.491524E-4 = (MATCH) weight(hoss_test:term1 in 0) 
[DefaultSimilarity], result of:\n  7.491524E-4 = fieldWeight in 0, product 
of:\n    1.0 = tf(freq=1.0), with freq of:\n      1.0 = termFreq=1.0\n    
0.30685282 = idf(docFreq=1, maxDocs=1)\n    0.0024414062 = 
fieldNorm(doc=0)\n"}}}


-Hoss
http://www.lucidworks.com/