You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steve Rowe (JIRA)" <ji...@apache.org> on 2014/12/09 19:09:12 UTC

[jira] [Resolved] (LUCENE-6103) StandardTokenizer doesn't tokenize word:word

     [ https://issues.apache.org/jira/browse/LUCENE-6103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steve Rowe resolved LUCENE-6103.
--------------------------------
    Resolution: Not a Problem
      Assignee: Steve Rowe

StandardTokenizer implements [the word boundary rules in Unicode UAX#29|http://www.unicode.org/reports/tr29/#Word_Boundaries].

The ASCII colon (and other colonicalish forms) is included in the set of characters matched by the [{{WordBreak:MidLetter}}|http://www.unicode.org/reports/tr29/#MidLetter] property value, which appears in [rules WB6 and WB7|http://www.unicode.org/reports/tr29/#WB6] - these rules forbid word breaks between the colon and surrounding letters.

To get what you want, you could customize the JFlex grammar used to generate StandardTokenizer by removing colons from the {{MidLetter}} definition used.

Another alternative is ICUTokenizer, which allows runtime per-orthographic-script specification of word-break rules - check out the factory javadocs: http://lucene.apache.org/core/4_9_0/analyzers-icu/org/apache/lucene/analysis/icu/segmentation/ICUTokenizerFactory.html 






> StandardTokenizer doesn't tokenize word:word
> --------------------------------------------
>
>                 Key: LUCENE-6103
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6103
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 4.9
>            Reporter: Itamar Syn-Hershko
>            Assignee: Steve Rowe
>
> StandardTokenizer (and by result most default analyzers) will not tokenize word:word and will preserve it as one token. This can be easily seen using Elasticsearch's analyze API:
> localhost:9200/_analyze?tokenizer=standard&text=word%20word:word
> If this is the intended behavior, then why? I can't really see the logic behind it.
> If not, I'll be happy to join in the effort of fixing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org