You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions)" <ex...@us.bosch.com> on 2014/07/09 18:17:22 UTC

Lower/UpperCase Issue

I have a situation here, when I search with "BALANCER" the results are different Compare to "Balancer" and the order is different  When I search "BALANCER" then, the documents with Upper Case are first in the List and for "Balancer" it is in different order.

I am confused why this behavior, Can some one has same issue or I am missing something.

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
         <charFilter class="solr.HTMLStripCharFilterFactory" />
      <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
              <filter class="solr.PorterStemFilterFactory"/>
              <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
         <charFilter class="solr.HTMLStripCharFilterFactory" />
     <tokenizer class="solr.StandardTokenizerFactory"/>
              <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>

         </analyzer>
    </fieldType>

e.g query

http://localhost:8983/solr/Test/select?q=*%3A*&fq=Name%3ABALANCER&wt=json&indent=true

http://localhost:8983/solr/Test/select?q=*%3A*&fq=Name%3ABalancer&wt=json&indent=true

Thanks

Ravi

RE: Lower/UpperCase Issue

Posted by "EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions)" <ex...@us.bosch.com>.

Thanks Erick & Shawn, I will analyze with your input and share outcome.

Thanks

Ravi

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Thursday, July 10, 2014 12:17 AM
To: solr-user@lucene.apache.org
Subject: Re: Lower/UpperCase Issue

Side note. Puttling LowercaseFilter in front of WordDelimiterFilterFactory is usually a poor choice. One of the purposes of WDFF is that it breaks lower->upper case transitions into separate tokens.

NOTE: This is _not_ germane to your problem IMO.

But it _is_ an indication that you might want to spend some time with the admin/analysis page to understand the affects of the filters on various inputs.

Also, add &debug=all to your query to see exactly why things were returned in the order they were. In this case, as Shawn says, all queries will be scored the same.

Actually, I'd be very interested in the output of adding &debug=all to the two queries. Theoretically, since all the scores are the same, I'd expect the returns to be constantly ordered unless the filter query is parsing Balancer and BALANCER differently.
I'm going to guess that the parsing of the fq clause is different somehow, but that's only a guess.

Best,
Erick

On Wed, Jul 9, 2014 at 4:20 PM, Shawn Heisey <so...@elyograg.org> wrote:
> On 7/9/2014 2:02 PM, EXTERNAL Taminidi Ravi (ETI,
> Automotive-Service-Solutions) wrote:
>> Here is the schema part.
>>
>> <field name="Name" type="text_general" indexed="true" stored="true" 
>> />
>
> Your query is *:*, which is a constant score query.  You also have a 
> filter, which does not affect scoring.
>
> Since there is no score difference between different documents with 
> your query results, the lack of a sort parameter means that you will 
> most likely get the results in the order that Lucene returns them, 
> which is completely indeterminate.
>
> There's probably some minute difference between the two queries at the 
> Lucene level, possibly because the stemmer behaves differently with 
> different case or just because the internal matching happens 
> differently, which makes Lucene return the results in a different order.
>
> If you want to be absolutely in control of your result order when the 
> query results in a constant score, you'll need to specify a sort parameter.
>
> Thanks,
> Shawn
>

Re: Lower/UpperCase Issue

Posted by Erick Erickson <er...@gmail.com>.

Side note. Puttling LowercaseFilter in front of
WordDelimiterFilterFactory is usually a poor
choice. One of the purposes of WDFF is that
it breaks lower->upper case transitions into
separate tokens.

NOTE: This is _not_ germane to your problem
IMO.

But it _is_ an indication that you might want to
spend some time with the admin/analysis page
to understand the affects of the filters on various
inputs.

Also, add &debug=all to your query to see exactly
why things were returned in the order they were. In
this case, as Shawn says, all queries will be scored
the same.

Actually, I'd be very interested in the output of
adding &debug=all to the two queries. Theoretically,
since all the scores are the same, I'd expect the
returns to be constantly ordered unless the filter
query is parsing Balancer and BALANCER differently.
I'm going to guess that the parsing of the fq clause is
different somehow, but that's only a guess.

Best,
Erick

On Wed, Jul 9, 2014 at 4:20 PM, Shawn Heisey <so...@elyograg.org> wrote:
> On 7/9/2014 2:02 PM, EXTERNAL Taminidi Ravi (ETI,
> Automotive-Service-Solutions) wrote:
>> Here is the schema part.
>>
>> <field name="Name" type="text_general" indexed="true" stored="true" />
>
> Your query is *:*, which is a constant score query.  You also have a
> filter, which does not affect scoring.
>
> Since there is no score difference between different documents with your
> query results, the lack of a sort parameter means that you will most
> likely get the results in the order that Lucene returns them, which is
> completely indeterminate.
>
> There's probably some minute difference between the two queries at the
> Lucene level, possibly because the stemmer behaves differently with
> different case or just because the internal matching happens
> differently, which makes Lucene return the results in a different order.
>
> If you want to be absolutely in control of your result order when the
> query results in a constant score, you'll need to specify a sort parameter.
>
> Thanks,
> Shawn
>

Re: Lower/UpperCase Issue

Posted by Shawn Heisey <so...@elyograg.org>.

On 7/9/2014 2:02 PM, EXTERNAL Taminidi Ravi (ETI,
Automotive-Service-Solutions) wrote:
> Here is the schema part.
>
> <field name="Name" type="text_general" indexed="true" stored="true" />

Your query is *:*, which is a constant score query.  You also have a
filter, which does not affect scoring.

Since there is no score difference between different documents with your
query results, the lack of a sort parameter means that you will most
likely get the results in the order that Lucene returns them, which is
completely indeterminate.

There's probably some minute difference between the two queries at the
Lucene level, possibly because the stemmer behaves differently with
different case or just because the internal matching happens
differently, which makes Lucene return the results in a different order.

If you want to be absolutely in control of your result order when the
query results in a constant score, you'll need to specify a sort parameter.

Thanks,
Shawn

RE: Lower/UpperCase Issue

Posted by "EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions)" <ex...@us.bosch.com>.

Here is the schema part.

<field name="Name" type="text_general" indexed="true" stored="true" />


-----Original Message-----
From: Shawn Heisey [mailto:solr@elyograg.org] 
Sent: Wednesday, July 09, 2014 12:24 PM
To: solr-user@lucene.apache.org
Subject: Re: Lower/UpperCase Issue

On 7/9/2014 10:17 AM, EXTERNAL Taminidi Ravi (ETI,
Automotive-Service-Solutions) wrote:
> I have a situation here, when I search with "BALANCER" the results are different Compare to "Balancer" and the order is different  When I search "BALANCER" then, the documents with Upper Case are first in the List and for "Balancer" it is in different order.
>
> I am confused why this behavior, Can some one has same issue or I am missing something.
>
> <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>          <charFilter class="solr.HTMLStripCharFilterFactory" />
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
>               <filter class="solr.PorterStemFilterFactory"/>
>               <filter class="solr.LowerCaseFilterFactory"/>
>               <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>          <charFilter class="solr.HTMLStripCharFilterFactory" />
>      <tokenizer class="solr.StandardTokenizerFactory"/>
>               <filter class="solr.PorterStemFilterFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>               <filter class="solr.WordDelimiterFilterFactory" 
> generateWordParts="0" generateNumberParts="0" catenateWords="1" 
> catenateNumbers="1" catenateAll="0"/>
>
>          </analyzer>
>     </fieldType>
>
> e.g query
>
> http://localhost:8983/solr/Test/select?q=*%3A*&fq=Name%3ABALANCER&wt=j
> son&indent=true
>
> http://localhost:8983/solr/Test/select?q=*%3A*&fq=Name%3ABalancer&wt=j
> son&indent=true

What is the full field definition for Name?  You've included a fieldType here, but that's only half the picture.

Thanks,
Shawn

Re: Lower/UpperCase Issue

Posted by Shawn Heisey <so...@elyograg.org>.

On 7/9/2014 10:17 AM, EXTERNAL Taminidi Ravi (ETI,
Automotive-Service-Solutions) wrote:
> I have a situation here, when I search with "BALANCER" the results are different Compare to "Balancer" and the order is different  When I search "BALANCER" then, the documents with Upper Case are first in the List and for "Balancer" it is in different order.
>
> I am confused why this behavior, Can some one has same issue or I am missing something.
>
> <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>          <charFilter class="solr.HTMLStripCharFilterFactory" />
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
>               <filter class="solr.PorterStemFilterFactory"/>
>               <filter class="solr.LowerCaseFilterFactory"/>
>               <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>          <charFilter class="solr.HTMLStripCharFilterFactory" />
>      <tokenizer class="solr.StandardTokenizerFactory"/>
>               <filter class="solr.PorterStemFilterFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
>         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>               <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
>
>          </analyzer>
>     </fieldType>
>
> e.g query
>
> http://localhost:8983/solr/Test/select?q=*%3A*&fq=Name%3ABALANCER&wt=json&indent=true
>
> http://localhost:8983/solr/Test/select?q=*%3A*&fq=Name%3ABalancer&wt=json&indent=true

What is the full field definition for Name?  You've included a fieldType
here, but that's only half the picture.

Thanks,
Shawn

Re: Lower/UpperCase Issue

Posted by Jack Krupansky <ja...@basetechnology.com>.

Ahmet is correct: the porter stemmer assumes that your input is lower case, 
so be sure to place the lower case filter before stemming.

BTW, this is the kind of detail that I have in my e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

You could also find this detail down at the level of the Lucene Javadoc, but 
IMHO it's inappropriate to expect Solr users to have to dive down into 
Lucene Javadoc.

See:
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilter.html

-- Jack Krupansky

-----Original Message----- 
From: EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions)
Sent: Wednesday, July 9, 2014 4:03 PM
To: solr-user@lucene.apache.org ; Ahmet Arslan
Subject: RE: Lower/UpperCase Issue

Do I need to use different algorithm instead of porter stemming..? can you 
suggest anything in you mind..?

-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com.INVALID]
Sent: Wednesday, July 09, 2014 12:26 PM
To: solr-user@lucene.apache.org
Subject: Re: Lower/UpperCase Issue

Hi,

Analysis admin page will tell you the truth. Just a guess: porter stem 
filter could be "case sensitive" and that may cause the difference. I am 
pretty sure porter stemming algorithms designed to work on lowercase input.

By the way you have two lowercase filters defined in index analyzer.

Ahmet



On Wednesday, July 9, 2014 7:18 PM, "EXTERNAL Taminidi Ravi (ETI, 
Automotive-Service-Solutions)" <ex...@us.bosch.com> wrote:
I have a situation here, when I search with "BALANCER" the results are 
different Compare to "Balancer" and the order is different  When I search 
"BALANCER" then, the documents with Upper Case are first in the List and for 
"Balancer" it is in different order.

I am confused why this behavior, Can some one has same issue or I am missing 
something.

<fieldType name="text_general" class="solr.TextField" 
positionIncrementGap="100">
      <analyzer type="index">
         <charFilter class="solr.HTMLStripCharFilterFactory" />
      <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" />
              <filter class="solr.PorterStemFilterFactory"/>
              <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="0" generateNumberParts="0" catenateWords="1" 
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
         <charFilter class="solr.HTMLStripCharFilterFactory" />
     <tokenizer class="solr.StandardTokenizerFactory"/>
              <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="0" generateNumberParts="0" catenateWords="1" 
catenateNumbers="1" catenateAll="0"/>

         </analyzer>
    </fieldType>

e.g query

http://localhost:8983/solr/Test/select?q=*%3A*&fq=Name%3ABALANCER&wt=json&indent=true

http://localhost:8983/solr/Test/select?q=*%3A*&fq=Name%3ABalancer&wt=json&indent=true

Thanks

Ravi

RE: Lower/UpperCase Issue

Posted by "EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions)" <ex...@us.bosch.com>.

Do I need to use different algorithm instead of porter stemming..? can you suggest anything in you mind..?

-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com.INVALID] 
Sent: Wednesday, July 09, 2014 12:26 PM
To: solr-user@lucene.apache.org
Subject: Re: Lower/UpperCase Issue

Hi,

Analysis admin page will tell you the truth. Just a guess: porter stem filter could be "case sensitive" and that may cause the difference. I am pretty sure porter stemming algorithms designed to work on lowercase input.

By the way you have two lowercase filters defined in index analyzer.

Ahmet

On Wednesday, July 9, 2014 7:18 PM, "EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions)" <ex...@us.bosch.com> wrote:
I have a situation here, when I search with "BALANCER" the results are different Compare to "Balancer" and the order is different  When I search "BALANCER" then, the documents with Upper Case are first in the List and for "Balancer" it is in different order.

I am confused why this behavior, Can some one has same issue or I am missing something.

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
         <charFilter class="solr.HTMLStripCharFilterFactory" />
      <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
              <filter class="solr.PorterStemFilterFactory"/>
              <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
         <charFilter class="solr.HTMLStripCharFilterFactory" />
     <tokenizer class="solr.StandardTokenizerFactory"/>
              <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>

         </analyzer>
    </fieldType>

e.g query

http://localhost:8983/solr/Test/select?q=*%3A*&fq=Name%3ABALANCER&wt=json&indent=true

http://localhost:8983/solr/Test/select?q=*%3A*&fq=Name%3ABalancer&wt=json&indent=true

Thanks

Ravi

Re: Lower/UpperCase Issue

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi,

Analysis admin page will tell you the truth. Just a guess: porter stem filter could be "case sensitive" and that may cause the difference. 
I am pretty sure porter stemming algorithms designed to work on lowercase input.

By the way you have two lowercase filters defined in index analyzer.

Ahmet



On Wednesday, July 9, 2014 7:18 PM, "EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions)" <ex...@us.bosch.com> wrote:
I have a situation here, when I search with "BALANCER" the results are different Compare to "Balancer" and the order is different  When I search "BALANCER" then, the documents with Upper Case are first in the List and for "Balancer" it is in different order.

I am confused why this behavior, Can some one has same issue or I am missing something.

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
         <charFilter class="solr.HTMLStripCharFilterFactory" />
      <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
              <filter class="solr.PorterStemFilterFactory"/>
              <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
         <charFilter class="solr.HTMLStripCharFilterFactory" />
     <tokenizer class="solr.StandardTokenizerFactory"/>
              <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>

         </analyzer>
    </fieldType>

e.g query

http://localhost:8983/solr/Test/select?q=*%3A*&fq=Name%3ABALANCER&wt=json&indent=true

http://localhost:8983/solr/Test/select?q=*%3A*&fq=Name%3ABalancer&wt=json&indent=true

Thanks

Ravi