You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shawn Heisey (JIRA)" <ji...@apache.org> on 2015/07/29 15:36:04 UTC

[jira] [Created] (SOLR-7848) Strictly enforce charFilter/tokenizer/filter order in fieldType definitions

Shawn Heisey created SOLR-7848:
----------------------------------

             Summary: Strictly enforce charFilter/tokenizer/filter order in fieldType definitions
                 Key: SOLR-7848
                 URL: https://issues.apache.org/jira/browse/SOLR-7848
             Project: Solr
          Issue Type: Improvement
          Components: Schema and Analysis
    Affects Versions: 5.2.1
            Reporter: Shawn Heisey
            Priority: Minor


Currently you can define a fieldType with the components specified backwards:

{noformat}
    <fieldType name="icu_test" class="solr.TextField">
      <analyzer> 
        <filter class="solr.LowercaseFilterFactory"/>
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
      </analyzer>
    </fieldType>
{noformat}

This will work (just tested in 5.2.1), but it will work in exactly the opposite order that it is defined.

The moinmoin wiki page for Analyzers, Tokenizers, and TokenFilters, in the section for HTMLStripCharFilterFactory, states that charFilter definitions must come before the tokenizer.  This bit of documentation is wrong.

The easiest fix would be to correct the wiki page, but if the order in the config can be detected, we could emit a warning in 5.x when the order is wrong and fail to start the core in 6.0.

When I was first building my schema, back in the 1.4 days, I was thoroughly confused and caught off guard when I tried to use PatternReplaceCharFilterFactory.  I found that it was being executed before tokenization, even though I had defined it AFTER.  I did eventually figure out my mistake and switched to PatternReplaceFilterFactory.  If the incorrect order had been enforced, or caused a warning in the log, I would have figured it out a lot sooner.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org