You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Stephan Damson <st...@bayer.com> on 2019/02/26 07:18:59 UTC

SOLR Tokenizer “solr.SimplePatternSplitTokenizerFactory” splits at unexpected characters

Hi!

I'm having unexpected results with the solr.SimplePatternSplitTokenizerFactory. The pattern used is actually from an example in the SOLR documentation and I do not understand where I made a mistake or why it does not work as expected.
If we take the example input "operative", the analyzer shows that during indexing, the input gets split into the tokens "ope", "a" and "ive", that is the tokenizer splits at the characters "r" and "t", and not at the expected whitespace characters (CR, TAB). Just to be sure I also tried to use more than one backspace in the pattern (e.g. \t and \\t<file:///\\t>), but this did not change how the input is tokenized during indexing.

What am I missing?
SOLR version used is 7.5.0.
The definition of the field type in the schema is as follows:
<fieldType name="text_custom" class="solr.TextField" positionIncrementGap="100" multiValued="true">
  <analyzer type="index">
    <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>

    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>

    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Many thanks in advance for any help you can provide!

Re: SOLR Tokenizer “solr.SimplePatternSplitTokenizerFactory” splits at unexpected characters

Posted by Shawn Heisey <ap...@elyograg.org>.
On 2/26/2019 12:18 AM, Stephan Damson wrote:
> If we take the example input "operative", the analyzer shows that during indexing, the input gets split into the tokens "ope", "a" and "ive", that is the tokenizer splits at the characters "r" and "t", and not at the expected whitespace characters (CR, TAB). Just to be sure I also tried to use more than one backspace in the pattern (e.g. \t and \\t<file:///\\t>), but this did not change how the input is tokenized during indexing.

I tried your fieldType on 7.5.0 and I see the same problem.  I couldn't 
get it working no matter what I tried.

I then tested it on 7.7.0 and it works properly in that version.

Thanks,
Shawn