You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Vikas Saurabh (JIRA)" <ji...@apache.org> on 2015/11/17 17:01:10 UTC

[jira] [Created] (OAK-3648) Use StandardTokenizer instead of ClassicTokenizer in OakAnalyzer

Vikas Saurabh created OAK-3648:
----------------------------------

             Summary: Use StandardTokenizer instead of ClassicTokenizer in OakAnalyzer
                 Key: OAK-3648
                 URL: https://issues.apache.org/jira/browse/OAK-3648
             Project: Jackrabbit Oak
          Issue Type: Improvement
          Components: lucene
            Reporter: Vikas Saurabh
            Assignee: Vikas Saurabh


This is related to OAK-3276 where the intent was to use {{StandardAnalyzer}} by default (instead of {{OakAnalyzer}}). As discussed there, we need specific word delimiter which isn't possible with StandardAnalyzer, so we instead should switch over to StandardTokenizer in OakAnalyer itself.

A few motivations to do that:
* Better unicode support
* ClassicTokenizer is the old (~lucene 3.1) implementation of standard tokenizer

One of the key difference between classic and standard tokenizer is the way they delimit words (standard analyzer follows unicode text segmentation rules)... but that difference gets nullified as we have our own WordDelimiterFilter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)