You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Cassandra Targett (JIRA)" <ji...@apache.org> on 2017/03/17 16:35:42 UTC

[jira] [Updated] (SOLR-10314) Spellcheck with SnowballPorterFilterFactory and Synonyms doesn't work well

     [ https://issues.apache.org/jira/browse/SOLR-10314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cassandra Targett updated SOLR-10314:
-------------------------------------
    Description: 
As noted in SOLR-10252, the default spellcheck configuration in the data_driven_schema_configs (and basic_configs) uses the {{\_text_}} field as the default field for spellcheck. This field is {{text_general}} field type.

If I use this default configuration for spellcheck, but modify the {{text_general}} field to use the SnowballPorterFilterFactory (with language=German in this case), and have synonyms in my analysis chain, queries to the {{/spell}} request handler will fail when there are 2 or more terms which are both preceded with a {{+}} operator. 

Note that the default spellcheck configuration also enables spellcheck.collate - if I disable that, I do not get any error. I also do not get an error if I use only 1 term, even if it is spelled "correctly". If at least one of the terms is spelled incorrectly, that also does not give an error.

So, in summary, there's a pretty specific list of variables at work here:

# {{/spell}} request handler
# 2 or more terms, both spelled correctly (or, both terms exist in the index)
# all terms required with {{+}}
# synonyms (there is a big list in this case, which I cannot share...see SOLR-10252 for an example of the parsed query to see how big the list can get)
# SnowballPorterFilter
# spellcheck.collate=true

The error returned is: 
{code}
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:7574/solr/spelltest3_shard1_replica2: String index out of range: -1
{code}

I made several experiments and found that if synonyms are removed from the field type (and thus the query analysis chain), the query is successful with collations enabled. So it's not SnowballPorterFilter by itself, but with {{+}} and synonyms and collation.

The field type definition is:

{code}
  <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.SnowballPorterFilterFactory" language="German"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.SnowballPorterFilterFactory" language="German"/>
    </analyzer>
  </fieldType>
{code}

This problem was found with 5.5.2, but I verified it still exists in 6.4 and 6.5.

  was:
As noted in SOLR-10252, the default spellcheck configuration in the data_driven_schema_configs (and basic_configs) uses the {{\_text_}} field as the default field for spellcheck. This field is {{text_general}} field type.

If I use this default configuration for spellcheck, but modify the {{text_general}} field to use the SnowballPorterFilterFactory (with language=German in this case), and have synonyms in my analysis chain, queries to the {{/spell}} request handler will fail when there are 2 or more terms which are both preceded with a {{+}} operator. 

Note that the default spellcheck configuration also enables spellcheck.collation - if I disable that, I do not get any error. I also do not get an error if I use only 1 term, even if it is spelled "correctly". If at least one of the terms is spelled incorrectly, that also does not give an error.

So, in summary, there's a pretty specific list of variables at work here:

# {{/spell}} request handler
# 2 or more terms, both spelled correctly (or, both terms exist in the index)
# all terms required with {{+}}
# synonyms (there is a big list in this case, which I cannot share...see SOLR-10252 for an example of the parsed query to see how big the list can get)
# SnowballPorterFilter
# spellcheck.collation=true

The error returned is: 
{code}
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:7574/solr/spelltest3_shard1_replica2: String index out of range: -1
{code}

I made several experiments and found that if synonyms are removed from the field type (and thus the query analysis chain), the query is successful with collations enabled. So it's not SnowballPorterFilter by itself, but with {{+}} and synonyms and collation.

The field type definition is:

{code}
  <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.SnowballPorterFilterFactory" language="German"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.SnowballPorterFilterFactory" language="German"/>
    </analyzer>
  </fieldType>
{code}

This problem was found with 5.5.2, but I verified it still exists in 6.4 and 6.5.


> Spellcheck with SnowballPorterFilterFactory and Synonyms doesn't work well
> --------------------------------------------------------------------------
>
>                 Key: SOLR-10314
>                 URL: https://issues.apache.org/jira/browse/SOLR-10314
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: spellchecker
>            Reporter: Cassandra Targett
>             Fix For: 5.5, 6.4
>
>
> As noted in SOLR-10252, the default spellcheck configuration in the data_driven_schema_configs (and basic_configs) uses the {{\_text_}} field as the default field for spellcheck. This field is {{text_general}} field type.
> If I use this default configuration for spellcheck, but modify the {{text_general}} field to use the SnowballPorterFilterFactory (with language=German in this case), and have synonyms in my analysis chain, queries to the {{/spell}} request handler will fail when there are 2 or more terms which are both preceded with a {{+}} operator. 
> Note that the default spellcheck configuration also enables spellcheck.collate - if I disable that, I do not get any error. I also do not get an error if I use only 1 term, even if it is spelled "correctly". If at least one of the terms is spelled incorrectly, that also does not give an error.
> So, in summary, there's a pretty specific list of variables at work here:
> # {{/spell}} request handler
> # 2 or more terms, both spelled correctly (or, both terms exist in the index)
> # all terms required with {{+}}
> # synonyms (there is a big list in this case, which I cannot share...see SOLR-10252 for an example of the parsed query to see how big the list can get)
> # SnowballPorterFilter
> # spellcheck.collate=true
> The error returned is: 
> {code}
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:7574/solr/spelltest3_shard1_replica2: String index out of range: -1
> {code}
> I made several experiments and found that if synonyms are removed from the field type (and thus the query analysis chain), the query is successful with collations enabled. So it's not SnowballPorterFilter by itself, but with {{+}} and synonyms and collation.
> The field type definition is:
> {code}
>   <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
>     <analyzer type="index">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
>       <filter class="solr.SnowballPorterFilterFactory" language="German"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>     <analyzer type="query">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
>       <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.SnowballPorterFilterFactory" language="German"/>
>     </analyzer>
>   </fieldType>
> {code}
> This problem was found with 5.5.2, but I verified it still exists in 6.4 and 6.5.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org