You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Frederico Azeiteiro <Fr...@cision.com> on 2012/11/26 15:06:29 UTC

Search differences between solr 1.4.0 and 3.6.1

Hi,

 

While updating our SOLR to 3.6.1 I noticed some results differences when
using search strings with letters+number.

For a text field defined as:

<analyzer type="index">
<http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml> 

<tokenizer class="solr.WhitespaceTokenizerFactory"/>

<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>

<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
catenateNumbers="1" catenateWords="1" generateNumberParts="0"
generateWordParts="1" stemEnglishPossessive="0"/>

</analyzer>

<analyzer type="query">
<http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml> 

<tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.SynonymFilterFactory" ignoreCase="true"
expand="true" synonyms="synonyms.txt"/>

<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
catenateNumbers="0" catenateWords="0" generateNumberParts="0"
generateWordParts="1"/>

</analyzer>

 

Searching for string GAMES12 returns a lot of results on 3.6.1 that are
not returned on 1.4.0.

 

It looks like WordDelimiterFilterFactory  is acting different for 3.6.1,
the numeric part of the keyword is being ignored and the search is
performed using only GAMES.

 

Analisys returns for 1.4.0:

org.apache.solr.analysis.WordDelimiterFilterFactory
{protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0,
catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0}

term position

1

2

term text

GAMES

12

term type

word

word

source start,end

0,5

5,7

payload

		

 

AND for 3.6.1

 

org.apache.solr.analysis.WordDelimiterFilterFactory
{protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0,
catenateWords=0, luceneMatchVersion=LUCENE_36, generateWordParts=1,
catenateAll=0, catenateNumbers=0}

position

1

term text

GAMES

startOffset

0

endOffset

5

type

word

positionLength

1

 

 

Is this something that can be modified/fixed to return the same results?

 

Thank you.

 

Regards,

Frederico

RE: Search differences between solr 1.4.0 and 3.6.1

Posted by Frederico Azeiteiro <Fr...@cision.com>.

Sorry, ignore the "<http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>".
Somehow that text appeared when I copy/pasted the XML from IE and I did not notice, but that is not part of the schema... :)

Still can't figure this thing out...

-----Mensagem original-----
De: Erick Erickson [mailto:erickerickson@gmail.com] 
Enviada: quarta-feira, 28 de Novembro de 2012 12:52
Para: solr-user@lucene.apache.org
Assunto: Re: Search differences between solr 1.4.0 and 3.6.1

Well, I get the same results in 1.4 and 3.6. The only difference is I didn't put <http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>
in.

In both cases the 12 is missing from the query analysis but is in the index analysis, due to the catenateNumbers being 1 in one case and
0 in the other.

So Im guessing there's something else going on that you're overlooking, but don't have any good clue....

Best
Erick


On Wed, Nov 28, 2012 at 4:34 AM, Frederico Azeiteiro < Frederico.Azeiteiro@cision.com> wrote:

> I just reload both indexes just to make sure that all definitions are 
> loaded.
> On Analysis tool I can see differences, even that the fields are 
> defined on the same way:
>
> Query Analyser for 3.6.1
> org.apache.solr.analysis.WordDelimiterFilterFactory
> {protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0, 
> catenateWords=0, luceneMatchVersion=LUCENE_36, generateWordParts=1, 
> catenateAll=0, catenateNumbers=0} term text: GAMES
>
> Query Analyser for 1.4.0
> org.apache.solr.analysis.WordDelimiterFilterFactory
> {protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0, 
> catenateWords=0, generateWordParts=1, catenateAll=0, 
> catenateNumbers=0} term text: GAMES | 12
>
> The "12" is lost on query for 3.6.1.
> The only diference I can see on the field definition is the 
> "luceneMatchVersion=LUCENE_36"... Could it cause this issue?
>
> Thank you.
> Frederico
>
> -----Mensagem original-----
> De: Erick Erickson [mailto:erickerickson@gmail.com]
> Enviada: terça-feira, 27 de Novembro de 2012 12:26
> Para: solr-user@lucene.apache.org
> Assunto: Re: Search differences between solr 1.4.0 and 3.6.1
>
> Using the definition you provided, I don't get the same output. Are 
> you sure you are doing what you think? The generateNumberParts=0 keeps the '12'
> from making it through the filter in 1.4 and 3.6 so I suspect you're 
> not quite doing something the same way in both.
>
> Perhaps looking at index tokenization in one and query in the other?
>
> Best
> Erick
>
>
> On Mon, Nov 26, 2012 at 9:06 AM, Frederico Azeiteiro < 
> Frederico.Azeiteiro@cision.com> wrote:
>
> > Hi,
> >
> >
> >
> > While updating our SOLR to 3.6.1 I noticed some results differences 
> > when using search strings with letters+number.
> >
> > For a text field defined as:
> >
> > <analyzer type="index">
> > <http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>
> >
> > <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >
> > <charFilter class="solr.MappingCharFilterFactory"
> > mapping="mapping-ISOLatin1Accent.txt"/>
> >
> > <filter class="solr.WordDelimiterFilterFactory"
> > protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
> > catenateNumbers="1" catenateWords="1" generateNumberParts="0"
> > generateWordParts="1" stemEnglishPossessive="0"/>
> >
> > </analyzer>
> >
> > <analyzer type="query">
> > <http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>
> >
> > <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >
> > <filter class="solr.SynonymFilterFactory" ignoreCase="true"
> > expand="true" synonyms="synonyms.txt"/>
> >
> > <filter class="solr.WordDelimiterFilterFactory"
> > protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
> > catenateNumbers="0" catenateWords="0" generateNumberParts="0"
> > generateWordParts="1"/>
> >
> > </analyzer>
> >
> >
> >
> > Searching for string GAMES12 returns a lot of results on 3.6.1 that 
> > are not returned on 1.4.0.
> >
> >
> >
> > It looks like WordDelimiterFilterFactory  is acting different for 
> > 3.6.1, the numeric part of the keyword is being ignored and the 
> > search is performed using only GAMES.
> >
> >
> >
> > Analisys returns for 1.4.0:
> >
> > org.apache.solr.analysis.WordDelimiterFilterFactory
> > {protected=protwords.txt, splitOnCaseChange=1, 
> > generateNumberParts=0, catenateWords=0, generateWordParts=1, 
> > catenateAll=0, catenateNumbers=0}
> >
> > term position
> >
> > 1
> >
> > 2
> >
> > term text
> >
> > GAMES
> >
> > 12
> >
> > term type
> >
> > word
> >
> > word
> >
> > source start,end
> >
> > 0,5
> >
> > 5,7
> >
> > payload
> >
> >
> >
> >
> >
> > AND for 3.6.1
> >
> >
> >
> > org.apache.solr.analysis.WordDelimiterFilterFactory
> > {protected=protwords.txt, splitOnCaseChange=1, 
> > generateNumberParts=0, catenateWords=0, 
> > luceneMatchVersion=LUCENE_36, generateWordParts=1, catenateAll=0, 
> > catenateNumbers=0}
> >
> > position
> >
> > 1
> >
> > term text
> >
> > GAMES
> >
> > startOffset
> >
> > 0
> >
> > endOffset
> >
> > 5
> >
> > type
> >
> > word
> >
> > positionLength
> >
> > 1
> >
> >
> >
> >
> >
> > Is this something that can be modified/fixed to return the same results?
> >
> >
> >
> > Thank you.
> >
> >
> >
> > Regards,
> >
> > Frederico
> >
> >
> >
> >
> >
> >
>

Re: Search differences between solr 1.4.0 and 3.6.1

Posted by Erick Erickson <er...@gmail.com>.

Well, I get the same results in 1.4 and 3.6. The only difference is I
didn't put
<http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>
in.

In both cases the 12 is missing from the query analysis but is in the
index analysis, due to the catenateNumbers being 1 in one case and
0 in the other.

So Im guessing there's something else going on that you're overlooking,
but don't have any good clue....

Best
Erick


On Wed, Nov 28, 2012 at 4:34 AM, Frederico Azeiteiro <
Frederico.Azeiteiro@cision.com> wrote:

> I just reload both indexes just to make sure that all definitions are
> loaded.
> On Analysis tool I can see differences, even that the fields are defined
> on the same way:
>
> Query Analyser for 3.6.1
> org.apache.solr.analysis.WordDelimiterFilterFactory
> {protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0,
> catenateWords=0, luceneMatchVersion=LUCENE_36, generateWordParts=1,
> catenateAll=0, catenateNumbers=0}
> term text: GAMES
>
> Query Analyser for 1.4.0
> org.apache.solr.analysis.WordDelimiterFilterFactory
> {protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0,
> catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0}
> term text: GAMES | 12
>
> The "12" is lost on query for 3.6.1.
> The only diference I can see on the field definition is the
> "luceneMatchVersion=LUCENE_36"... Could it cause this issue?
>
> Thank you.
> Frederico
>
> -----Mensagem original-----
> De: Erick Erickson [mailto:erickerickson@gmail.com]
> Enviada: terça-feira, 27 de Novembro de 2012 12:26
> Para: solr-user@lucene.apache.org
> Assunto: Re: Search differences between solr 1.4.0 and 3.6.1
>
> Using the definition you provided, I don't get the same output. Are you
> sure you are doing what you think? The generateNumberParts=0 keeps the '12'
> from making it through the filter in 1.4 and 3.6 so I suspect you're not
> quite doing something the same way in both.
>
> Perhaps looking at index tokenization in one and query in the other?
>
> Best
> Erick
>
>
> On Mon, Nov 26, 2012 at 9:06 AM, Frederico Azeiteiro <
> Frederico.Azeiteiro@cision.com> wrote:
>
> > Hi,
> >
> >
> >
> > While updating our SOLR to 3.6.1 I noticed some results differences
> > when using search strings with letters+number.
> >
> > For a text field defined as:
> >
> > <analyzer type="index">
> > <http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>
> >
> > <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >
> > <charFilter class="solr.MappingCharFilterFactory"
> > mapping="mapping-ISOLatin1Accent.txt"/>
> >
> > <filter class="solr.WordDelimiterFilterFactory"
> > protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
> > catenateNumbers="1" catenateWords="1" generateNumberParts="0"
> > generateWordParts="1" stemEnglishPossessive="0"/>
> >
> > </analyzer>
> >
> > <analyzer type="query">
> > <http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>
> >
> > <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >
> > <filter class="solr.SynonymFilterFactory" ignoreCase="true"
> > expand="true" synonyms="synonyms.txt"/>
> >
> > <filter class="solr.WordDelimiterFilterFactory"
> > protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
> > catenateNumbers="0" catenateWords="0" generateNumberParts="0"
> > generateWordParts="1"/>
> >
> > </analyzer>
> >
> >
> >
> > Searching for string GAMES12 returns a lot of results on 3.6.1 that
> > are not returned on 1.4.0.
> >
> >
> >
> > It looks like WordDelimiterFilterFactory  is acting different for
> > 3.6.1, the numeric part of the keyword is being ignored and the search
> > is performed using only GAMES.
> >
> >
> >
> > Analisys returns for 1.4.0:
> >
> > org.apache.solr.analysis.WordDelimiterFilterFactory
> > {protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0,
> > catenateWords=0, generateWordParts=1, catenateAll=0,
> > catenateNumbers=0}
> >
> > term position
> >
> > 1
> >
> > 2
> >
> > term text
> >
> > GAMES
> >
> > 12
> >
> > term type
> >
> > word
> >
> > word
> >
> > source start,end
> >
> > 0,5
> >
> > 5,7
> >
> > payload
> >
> >
> >
> >
> >
> > AND for 3.6.1
> >
> >
> >
> > org.apache.solr.analysis.WordDelimiterFilterFactory
> > {protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0,
> > catenateWords=0, luceneMatchVersion=LUCENE_36, generateWordParts=1,
> > catenateAll=0, catenateNumbers=0}
> >
> > position
> >
> > 1
> >
> > term text
> >
> > GAMES
> >
> > startOffset
> >
> > 0
> >
> > endOffset
> >
> > 5
> >
> > type
> >
> > word
> >
> > positionLength
> >
> > 1
> >
> >
> >
> >
> >
> > Is this something that can be modified/fixed to return the same results?
> >
> >
> >
> > Thank you.
> >
> >
> >
> > Regards,
> >
> > Frederico
> >
> >
> >
> >
> >
> >
>

RE: Search differences between solr 1.4.0 and 3.6.1

Posted by Frederico Azeiteiro <Fr...@cision.com>.

I just reload both indexes just to make sure that all definitions are loaded.
On Analysis tool I can see differences, even that the fields are defined on the same way:

Query Analyser for 3.6.1
org.apache.solr.analysis.WordDelimiterFilterFactory {protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0, catenateWords=0, luceneMatchVersion=LUCENE_36, generateWordParts=1, catenateAll=0, catenateNumbers=0}
term text: GAMES

Query Analyser for 1.4.0
org.apache.solr.analysis.WordDelimiterFilterFactory {protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0, catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0}
term text: GAMES | 12
 
The "12" is lost on query for 3.6.1.
The only diference I can see on the field definition is the "luceneMatchVersion=LUCENE_36"... Could it cause this issue?

Thank you.
Frederico

-----Mensagem original-----
De: Erick Erickson [mailto:erickerickson@gmail.com] 
Enviada: terça-feira, 27 de Novembro de 2012 12:26
Para: solr-user@lucene.apache.org
Assunto: Re: Search differences between solr 1.4.0 and 3.6.1

Using the definition you provided, I don't get the same output. Are you sure you are doing what you think? The generateNumberParts=0 keeps the '12'
from making it through the filter in 1.4 and 3.6 so I suspect you're not quite doing something the same way in both.

Perhaps looking at index tokenization in one and query in the other?

Best
Erick


On Mon, Nov 26, 2012 at 9:06 AM, Frederico Azeiteiro < Frederico.Azeiteiro@cision.com> wrote:

> Hi,
>
>
>
> While updating our SOLR to 3.6.1 I noticed some results differences 
> when using search strings with letters+number.
>
> For a text field defined as:
>
> <analyzer type="index">
> <http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>
>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>
> <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>
> <filter class="solr.WordDelimiterFilterFactory"
> protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
> catenateNumbers="1" catenateWords="1" generateNumberParts="0"
> generateWordParts="1" stemEnglishPossessive="0"/>
>
> </analyzer>
>
> <analyzer type="query">
> <http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>
>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>
> <filter class="solr.SynonymFilterFactory" ignoreCase="true"
> expand="true" synonyms="synonyms.txt"/>
>
> <filter class="solr.WordDelimiterFilterFactory"
> protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
> catenateNumbers="0" catenateWords="0" generateNumberParts="0"
> generateWordParts="1"/>
>
> </analyzer>
>
>
>
> Searching for string GAMES12 returns a lot of results on 3.6.1 that 
> are not returned on 1.4.0.
>
>
>
> It looks like WordDelimiterFilterFactory  is acting different for 
> 3.6.1, the numeric part of the keyword is being ignored and the search 
> is performed using only GAMES.
>
>
>
> Analisys returns for 1.4.0:
>
> org.apache.solr.analysis.WordDelimiterFilterFactory
> {protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0, 
> catenateWords=0, generateWordParts=1, catenateAll=0, 
> catenateNumbers=0}
>
> term position
>
> 1
>
> 2
>
> term text
>
> GAMES
>
> 12
>
> term type
>
> word
>
> word
>
> source start,end
>
> 0,5
>
> 5,7
>
> payload
>
>
>
>
>
> AND for 3.6.1
>
>
>
> org.apache.solr.analysis.WordDelimiterFilterFactory
> {protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0, 
> catenateWords=0, luceneMatchVersion=LUCENE_36, generateWordParts=1, 
> catenateAll=0, catenateNumbers=0}
>
> position
>
> 1
>
> term text
>
> GAMES
>
> startOffset
>
> 0
>
> endOffset
>
> 5
>
> type
>
> word
>
> positionLength
>
> 1
>
>
>
>
>
> Is this something that can be modified/fixed to return the same results?
>
>
>
> Thank you.
>
>
>
> Regards,
>
> Frederico
>
>
>
>
>
>

Re: Search differences between solr 1.4.0 and 3.6.1

Posted by Erick Erickson <er...@gmail.com>.

Using the definition you provided, I don't get the same output. Are you
sure you are doing what you think? The generateNumberParts=0 keeps the '12'
from making it through the filter in 1.4 and 3.6 so I suspect you're not
quite doing something the same way in both.

Perhaps looking at index tokenization in one and query in the other?

Best
Erick


On Mon, Nov 26, 2012 at 9:06 AM, Frederico Azeiteiro <
Frederico.Azeiteiro@cision.com> wrote:

> Hi,
>
>
>
> While updating our SOLR to 3.6.1 I noticed some results differences when
> using search strings with letters+number.
>
> For a text field defined as:
>
> <analyzer type="index">
> <http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>
>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>
> <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>
> <filter class="solr.WordDelimiterFilterFactory"
> protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
> catenateNumbers="1" catenateWords="1" generateNumberParts="0"
> generateWordParts="1" stemEnglishPossessive="0"/>
>
> </analyzer>
>
> <analyzer type="query">
> <http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>
>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>
> <filter class="solr.SynonymFilterFactory" ignoreCase="true"
> expand="true" synonyms="synonyms.txt"/>
>
> <filter class="solr.WordDelimiterFilterFactory"
> protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
> catenateNumbers="0" catenateWords="0" generateNumberParts="0"
> generateWordParts="1"/>
>
> </analyzer>
>
>
>
> Searching for string GAMES12 returns a lot of results on 3.6.1 that are
> not returned on 1.4.0.
>
>
>
> It looks like WordDelimiterFilterFactory  is acting different for 3.6.1,
> the numeric part of the keyword is being ignored and the search is
> performed using only GAMES.
>
>
>
> Analisys returns for 1.4.0:
>
> org.apache.solr.analysis.WordDelimiterFilterFactory
> {protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0,
> catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0}
>
> term position
>
> 1
>
> 2
>
> term text
>
> GAMES
>
> 12
>
> term type
>
> word
>
> word
>
> source start,end
>
> 0,5
>
> 5,7
>
> payload
>
>
>
>
>
> AND for 3.6.1
>
>
>
> org.apache.solr.analysis.WordDelimiterFilterFactory
> {protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0,
> catenateWords=0, luceneMatchVersion=LUCENE_36, generateWordParts=1,
> catenateAll=0, catenateNumbers=0}
>
> position
>
> 1
>
> term text
>
> GAMES
>
> startOffset
>
> 0
>
> endOffset
>
> 5
>
> type
>
> word
>
> positionLength
>
> 1
>
>
>
>
>
> Is this something that can be modified/fixed to return the same results?
>
>
>
> Thank you.
>
>
>
> Regards,
>
> Frederico
>
>
>
>
>
>

Re: Search differences between solr 1.4.0 and 3.6.1

Posted by Jack Krupansky <ja...@basetechnology.com>.

One change was to change the default for autoGeneratePhraseQueries from true 
to false. That means that now RoC would match Ro OR C rather than "Ro C" 
(phrase).

Simply add autoGeneratePhraseQueries=true to your field type - no need to 
re-index.

-- Jack Krupansky

-----Original Message----- 
From: Frederico Azeiteiro
Sent: Wednesday, November 28, 2012 12:31 PM
To: solr-user@lucene.apache.org
Subject: RE: Search differences between solr 1.4.0 and 3.6.1

Also, i'm having issues with searching "RoC" . It returns thousands of 
matches on 3.6.1 against just a few on solr 1.4.0.
Looking to analysis I see no differences...

Should I add "RoC" to protected keywords or can I tweak something on schema 
to achieve exact "RoC" matches?


-----Mensagem original-----
De: Frederico Azeiteiro [mailto:Frederico.Azeiteiro@cision.com]
Enviada: quarta-feira, 28 de Novembro de 2012 17:19
Para: solr-user@lucene.apache.org
Assunto: RE: Search differences between solr 1.4.0 and 3.6.1

Ok, I'll test that and let you know.

Is there some test I can easily do to confirm that is was really a 
side-effect of the bug?

____________________________________________
Frederico Azeiteiro
Developer



-----Mensagem original-----
De: Jack Krupansky [mailto:jack@basetechnology.com]
Enviada: quarta-feira, 28 de Novembro de 2012 13:39
Para: solr-user@lucene.apache.org
Assunto: Re: Search differences between solr 1.4.0 and 3.6.1

You need to add the generateNumberParts=1 attribute - assuming you actually 
want the number generated.

The fact that your schema worked in 1.4 was probably simply a side effect of 
this bug:
https://issues.apache.org/jira/browse/SOLR-1706
"wrong tokens output from WordDelimiterFilter depending upon options"

-- Jack Krupansky

-----Original Message-----
From: Frederico Azeiteiro
Sent: Monday, November 26, 2012 9:06 AM
To: solr-user@lucene.apache.org
Subject: Search differences between solr 1.4.0 and 3.6.1

Hi,



While updating our SOLR to 3.6.1 I noticed some results differences when 
using search strings with letters+number.

For a text field defined as:

<analyzer type="index">
<http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>

<tokenizer class="solr.WhitespaceTokenizerFactory"/>

<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>

<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
catenateNumbers="1" catenateWords="1" generateNumberParts="0"
generateWordParts="1" stemEnglishPossessive="0"/>

</analyzer>

<analyzer type="query">
<http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>

<tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.SynonymFilterFactory" ignoreCase="true"
expand="true" synonyms="synonyms.txt"/>

<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
catenateNumbers="0" catenateWords="0" generateNumberParts="0"
generateWordParts="1"/>

</analyzer>



Searching for string GAMES12 returns a lot of results on 3.6.1 that are not 
returned on 1.4.0.



It looks like WordDelimiterFilterFactory  is acting different for 3.6.1, the 
numeric part of the keyword is being ignored and the search is performed 
using only GAMES.



Analisys returns for 1.4.0:

org.apache.solr.analysis.WordDelimiterFilterFactory
{protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0, 
catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0}

term position

1

2

term text

GAMES

12

term type

word

word

source start,end

0,5

5,7

payload





AND for 3.6.1



org.apache.solr.analysis.WordDelimiterFilterFactory
{protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0, 
catenateWords=0, luceneMatchVersion=LUCENE_36, generateWordParts=1, 
catenateAll=0, catenateNumbers=0}

position

1

term text

GAMES

startOffset

0

endOffset

5

type

word

positionLength

1





Is this something that can be modified/fixed to return the same results?



Thank you.



Regards,

Frederico

RE: Search differences between solr 1.4.0 and 3.6.1

Posted by Frederico Azeiteiro <Fr...@cision.com>.

Also, i'm having issues with searching "RoC" . It returns thousands of matches on 3.6.1 against just a few on solr 1.4.0.
Looking to analysis I see no differences...

Should I add "RoC" to protected keywords or can I tweak something on schema to achieve exact "RoC" matches?


-----Mensagem original-----
De: Frederico Azeiteiro [mailto:Frederico.Azeiteiro@cision.com] 
Enviada: quarta-feira, 28 de Novembro de 2012 17:19
Para: solr-user@lucene.apache.org
Assunto: RE: Search differences between solr 1.4.0 and 3.6.1

Ok, I'll test that and let you know.

Is there some test I can easily do to confirm that is was really a side-effect of the bug?

____________________________________________
Frederico Azeiteiro
Developer
 


-----Mensagem original-----
De: Jack Krupansky [mailto:jack@basetechnology.com] 
Enviada: quarta-feira, 28 de Novembro de 2012 13:39
Para: solr-user@lucene.apache.org
Assunto: Re: Search differences between solr 1.4.0 and 3.6.1

You need to add the generateNumberParts=1 attribute - assuming you actually want the number generated.

The fact that your schema worked in 1.4 was probably simply a side effect of this bug:
https://issues.apache.org/jira/browse/SOLR-1706
"wrong tokens output from WordDelimiterFilter depending upon options"

-- Jack Krupansky

-----Original Message-----
From: Frederico Azeiteiro
Sent: Monday, November 26, 2012 9:06 AM
To: solr-user@lucene.apache.org
Subject: Search differences between solr 1.4.0 and 3.6.1

Hi,



While updating our SOLR to 3.6.1 I noticed some results differences when using search strings with letters+number.

For a text field defined as:

<analyzer type="index">
<http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>

<tokenizer class="solr.WhitespaceTokenizerFactory"/>

<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>

<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
catenateNumbers="1" catenateWords="1" generateNumberParts="0"
generateWordParts="1" stemEnglishPossessive="0"/>

</analyzer>

<analyzer type="query">
<http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>

<tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.SynonymFilterFactory" ignoreCase="true"
expand="true" synonyms="synonyms.txt"/>

<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
catenateNumbers="0" catenateWords="0" generateNumberParts="0"
generateWordParts="1"/>

</analyzer>



Searching for string GAMES12 returns a lot of results on 3.6.1 that are not returned on 1.4.0.



It looks like WordDelimiterFilterFactory  is acting different for 3.6.1, the numeric part of the keyword is being ignored and the search is performed using only GAMES.



Analisys returns for 1.4.0:

org.apache.solr.analysis.WordDelimiterFilterFactory
{protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0, catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0}

term position

1

2

term text

GAMES

12

term type

word

word

source start,end

0,5

5,7

payload





AND for 3.6.1



org.apache.solr.analysis.WordDelimiterFilterFactory
{protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0, catenateWords=0, luceneMatchVersion=LUCENE_36, generateWordParts=1, catenateAll=0, catenateNumbers=0}

position

1

term text

GAMES

startOffset

0

endOffset

5

type

word

positionLength

1





Is this something that can be modified/fixed to return the same results?



Thank you.



Regards,

Frederico

RE: Search differences between solr 1.4.0 and 3.6.1

Posted by Frederico Azeiteiro <Fr...@cision.com>.

Ok, I'll test that and let you know.

Is there some test I can easily do to confirm that is was really a side-effect of the bug?

____________________________________________
Frederico Azeiteiro
Developer
 


-----Mensagem original-----
De: Jack Krupansky [mailto:jack@basetechnology.com] 
Enviada: quarta-feira, 28 de Novembro de 2012 13:39
Para: solr-user@lucene.apache.org
Assunto: Re: Search differences between solr 1.4.0 and 3.6.1

You need to add the generateNumberParts=1 attribute - assuming you actually want the number generated.

The fact that your schema worked in 1.4 was probably simply a side effect of this bug:
https://issues.apache.org/jira/browse/SOLR-1706
"wrong tokens output from WordDelimiterFilter depending upon options"

-- Jack Krupansky

-----Original Message-----
From: Frederico Azeiteiro
Sent: Monday, November 26, 2012 9:06 AM
To: solr-user@lucene.apache.org
Subject: Search differences between solr 1.4.0 and 3.6.1

Hi,



While updating our SOLR to 3.6.1 I noticed some results differences when using search strings with letters+number.

For a text field defined as:

<analyzer type="index">
<http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>

<tokenizer class="solr.WhitespaceTokenizerFactory"/>

<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>

<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
catenateNumbers="1" catenateWords="1" generateNumberParts="0"
generateWordParts="1" stemEnglishPossessive="0"/>

</analyzer>

<analyzer type="query">
<http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>

<tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.SynonymFilterFactory" ignoreCase="true"
expand="true" synonyms="synonyms.txt"/>

<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
catenateNumbers="0" catenateWords="0" generateNumberParts="0"
generateWordParts="1"/>

</analyzer>



Searching for string GAMES12 returns a lot of results on 3.6.1 that are not returned on 1.4.0.



It looks like WordDelimiterFilterFactory  is acting different for 3.6.1, the numeric part of the keyword is being ignored and the search is performed using only GAMES.



Analisys returns for 1.4.0:

org.apache.solr.analysis.WordDelimiterFilterFactory
{protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0, catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0}

term position

1

2

term text

GAMES

12

term type

word

word

source start,end

0,5

5,7

payload





AND for 3.6.1



org.apache.solr.analysis.WordDelimiterFilterFactory
{protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0, catenateWords=0, luceneMatchVersion=LUCENE_36, generateWordParts=1, catenateAll=0, catenateNumbers=0}

position

1

term text

GAMES

startOffset

0

endOffset

5

type

word

positionLength

1





Is this something that can be modified/fixed to return the same results?



Thank you.



Regards,

Frederico

Re: Search differences between solr 1.4.0 and 3.6.1

Posted by Jack Krupansky <ja...@basetechnology.com>.

You need to add the generateNumberParts=1 attribute - assuming you actually 
want the number generated.

The fact that your schema worked in 1.4 was probably simply a side effect of 
this bug:
https://issues.apache.org/jira/browse/SOLR-1706
"wrong tokens output from WordDelimiterFilter depending upon options"

-- Jack Krupansky

-----Original Message----- 
From: Frederico Azeiteiro
Sent: Monday, November 26, 2012 9:06 AM
To: solr-user@lucene.apache.org
Subject: Search differences between solr 1.4.0 and 3.6.1

Hi,



While updating our SOLR to 3.6.1 I noticed some results differences when
using search strings with letters+number.

For a text field defined as:

<analyzer type="index">
<http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>

<tokenizer class="solr.WhitespaceTokenizerFactory"/>

<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>

<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
catenateNumbers="1" catenateWords="1" generateNumberParts="0"
generateWordParts="1" stemEnglishPossessive="0"/>

</analyzer>

<analyzer type="query">
<http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>

<tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.SynonymFilterFactory" ignoreCase="true"
expand="true" synonyms="synonyms.txt"/>

<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
catenateNumbers="0" catenateWords="0" generateNumberParts="0"
generateWordParts="1"/>

</analyzer>



Searching for string GAMES12 returns a lot of results on 3.6.1 that are
not returned on 1.4.0.



It looks like WordDelimiterFilterFactory  is acting different for 3.6.1,
the numeric part of the keyword is being ignored and the search is
performed using only GAMES.



Analisys returns for 1.4.0:

org.apache.solr.analysis.WordDelimiterFilterFactory
{protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0,
catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0}

term position

1

2

term text

GAMES

12

term type

word

word

source start,end

0,5

5,7

payload





AND for 3.6.1



org.apache.solr.analysis.WordDelimiterFilterFactory
{protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0,
catenateWords=0, luceneMatchVersion=LUCENE_36, generateWordParts=1,
catenateAll=0, catenateNumbers=0}

position

1

term text

GAMES

startOffset

0

endOffset

5

type

word

positionLength

1





Is this something that can be modified/fixed to return the same results?



Thank you.



Regards,

Frederico