You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Preeti Bhat <pr...@shoregrp.com> on 2017/06/28 13:25:32 UTC

Using asterik(*) with unicode characters.

Hi All,

I have a requirement where the user can give an Unicode or ascii character as input but expects same result.

For example: MöllerGruppen AS vs MollerGruppen AS should give out same result.

I am able to get this done using <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>, but due to some reason when it try to do MöllerGruppen* I am getting the below message.

""metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","org.apache.solr.common.SolrException"],
    "msg":"analyzer returned too many terms for multiTerm term: MöllerGruppen",
    "code":400}}
"

It works for MollerGruppen* though.

Could someone please advise on this.

Below is the fieldtype of this field.

<fieldType name="string_ci" class="solr.TextField">
    <analyzer type="index">
            <charFilter class="solr.HTMLStripCharFilterFactory"/>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
              <filter class="solr.TrimFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
              <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" splitOnCaseChange="0" catenateWords="1" splitOnNumerics="0" stemEnglishPossessive="0" preserveOriginal="1"/>
    </analyzer>
    <analyzer type="query">
            <charFilter class="solr.HTMLStripCharFilterFactory"/>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
              <filter class="solr.TrimFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
              <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" splitOnCaseChange="0" catenateWords="1" splitOnNumerics="0" stemEnglishPossessive="0" preserveOriginal="1"/>
    </analyzer>
  </fieldType>



Thanks and Regards,
Preeti



NOTICE TO RECIPIENTS: This communication may contain confidential and/or privileged information. If you are not the intended recipient (or have received this communication in error) please notify the sender and it-support@shoregrp.com immediately, and destroy this communication. Any unauthorized copying, disclosure or distribution of the material in this communication is strictly forbidden. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Finally, the recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.

RE: Using asterik(*) with unicode characters.

Posted by Preeti Bhat <pr...@shoregrp.com>.

Thanks Erick, its working now as expected.

Thanks and Regards,
Preeti Bhat

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Wednesday, June 28, 2017 9:20 PM
To: solr-user
Subject: Re: Using asterik(*) with unicode characters.

There's a long blog on wildcards here:
https://lucidworks.com/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/

The gist is that when you are analyzing a token, if the analysis chain splits a token into more than one part then wildcards are impossible to get right. So any "MultiTermAware" filter will barf if you ask it to emit more than one token when doing wildcard searches. For filters that are _not_ "MultiTermAware", they're just skipped in the query analysis chain.

That leaves the question of why your query chain seems to emit two tokens for  MöllerGruppen but not MollerGruppen. I think it's because you have preserveOriginal set to true in the query analysis chain
here:
 <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>

So this entry emits both
MöllerGruppen and MollerGruppen
for the input
MöllerGruppen
but not for
MollerGruppen
since MollerGruppen doesn't need any folding. This violates this constraint imposed by ASCIIFoldingFilterFactory being "MultiTermAware", which means if it emits two tokens it barfs.

You do not need to set "preserveOriginal='true' " in your _query_ chain since your indexing chain puts both the folded and un-folded versions in the index at the same position.

So I think if you set perserveOriginal to false (again, in the _query_ analysis chain, leave it true in the index analysis chain) you'll be OK. Your queries will also be somewhat faster.

Best,
Erick

On Wed, Jun 28, 2017 at 6:25 AM, Preeti Bhat <pr...@shoregrp.com> wrote:
> Hi All,
>
> I have a requirement where the user can give an Unicode or ascii character as input but expects same result.
>
> For example: MöllerGruppen AS vs MollerGruppen AS should give out same result.
>
> I am able to get this done using <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>, but due to some reason when it try to do MöllerGruppen* I am getting the below message.
>
> ""metadata":[
>       "error-class","org.apache.solr.common.SolrException",
>       "root-error-class","org.apache.solr.common.SolrException"],
>     "msg":"analyzer returned too many terms for multiTerm term: MöllerGruppen",
>     "code":400}}
> "
>
> It works for MollerGruppen* though.
>
> Could someone please advise on this.
>
> Below is the fieldtype of this field.
>
> <fieldType name="string_ci" class="solr.TextField">
>     <analyzer type="index">
>             <charFilter class="solr.HTMLStripCharFilterFactory"/>
>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>               <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
>               <filter class="solr.TrimFilterFactory"/>
>       <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
>               <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" splitOnCaseChange="0" catenateWords="1" splitOnNumerics="0" stemEnglishPossessive="0" preserveOriginal="1"/>
>     </analyzer>
>     <analyzer type="query">
>             <charFilter class="solr.HTMLStripCharFilterFactory"/>
>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>               <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
>               <filter class="solr.TrimFilterFactory"/>
>       <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
>               <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" splitOnCaseChange="0" catenateWords="1" splitOnNumerics="0" stemEnglishPossessive="0" preserveOriginal="1"/>
>     </analyzer>
>   </fieldType>
>
>
>
> Thanks and Regards,
> Preeti
>
>
>
> NOTICE TO RECIPIENTS: This communication may contain confidential and/or privileged information. If you are not the intended recipient (or have received this communication in error) please notify the sender and it-support@shoregrp.com immediately, and destroy this communication. Any unauthorized copying, disclosure or distribution of the material in this communication is strictly forbidden. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Finally, the recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.
>
>

NOTICE TO RECIPIENTS: This communication may contain confidential and/or privileged information. If you are not the intended recipient (or have received this communication in error) please notify the sender and it-support@shoregrp.com immediately, and destroy this communication. Any unauthorized copying, disclosure or distribution of the material in this communication is strictly forbidden. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Finally, the recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.

Re: Using asterik(*) with unicode characters.

Posted by Erick Erickson <er...@gmail.com>.

There's a long blog on wildcards here:
https://lucidworks.com/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/

The gist is that when you are analyzing a token, if the analysis chain
splits a token into more than one part then wildcards are impossible
to get right. So any "MultiTermAware" filter will barf if you ask it
to emit more than one token when doing wildcard searches. For filters
that are _not_ "MultiTermAware", they're just skipped in the query
analysis chain.

That leaves the question of why your query chain seems to emit two
tokens for  MöllerGruppen but not MollerGruppen. I think it's because
you have preserveOriginal set to true in the query analysis chain
here:
 <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>

So this entry emits both
MöllerGruppen and MollerGruppen
for the input
MöllerGruppen
but not for
MollerGruppen
since MollerGruppen doesn't need any folding. This violates this
constraint imposed by ASCIIFoldingFilterFactory being
"MultiTermAware", which means if it emits two tokens it barfs.

You do not need to set "preserveOriginal='true' " in your _query_
chain since your indexing chain puts both the folded and un-folded
versions in the index at the same position.

So I think if you set perserveOriginal to false (again, in the _query_
analysis chain, leave it true in the index analysis chain) you'll be
OK. Your queries will also be somewhat faster.

Best,
Erick

On Wed, Jun 28, 2017 at 6:25 AM, Preeti Bhat <pr...@shoregrp.com> wrote:
> Hi All,
>
> I have a requirement where the user can give an Unicode or ascii character as input but expects same result.
>
> For example: MöllerGruppen AS vs MollerGruppen AS should give out same result.
>
> I am able to get this done using <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>, but due to some reason when it try to do MöllerGruppen* I am getting the below message.
>
> ""metadata":[
>       "error-class","org.apache.solr.common.SolrException",
>       "root-error-class","org.apache.solr.common.SolrException"],
>     "msg":"analyzer returned too many terms for multiTerm term: MöllerGruppen",
>     "code":400}}
> "
>
> It works for MollerGruppen* though.
>
> Could someone please advise on this.
>
> Below is the fieldtype of this field.
>
> <fieldType name="string_ci" class="solr.TextField">
>     <analyzer type="index">
>             <charFilter class="solr.HTMLStripCharFilterFactory"/>
>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>               <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
>               <filter class="solr.TrimFilterFactory"/>
>       <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
>               <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" splitOnCaseChange="0" catenateWords="1" splitOnNumerics="0" stemEnglishPossessive="0" preserveOriginal="1"/>
>     </analyzer>
>     <analyzer type="query">
>             <charFilter class="solr.HTMLStripCharFilterFactory"/>
>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>               <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
>               <filter class="solr.TrimFilterFactory"/>
>       <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
>               <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" splitOnCaseChange="0" catenateWords="1" splitOnNumerics="0" stemEnglishPossessive="0" preserveOriginal="1"/>
>     </analyzer>
>   </fieldType>
>
>
>
> Thanks and Regards,
> Preeti
>
>
>
> NOTICE TO RECIPIENTS: This communication may contain confidential and/or privileged information. If you are not the intended recipient (or have received this communication in error) please notify the sender and it-support@shoregrp.com immediately, and destroy this communication. Any unauthorized copying, disclosure or distribution of the material in this communication is strictly forbidden. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Finally, the recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.
>
>