You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Uwe Schindler (JIRA)" <ji...@apache.org> on 2010/04/23 11:45:49 UTC

[jira] Commented: (LUCENE-2414) add icu-based tokenizer for unicode text segmentation

    [ https://issues.apache.org/jira/browse/LUCENE-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860199#action_12860199 ] 

Uwe Schindler commented on LUCENE-2414:
---------------------------------------

One more comment:
The abstract ICUTokenizerConfig is public, but only has package protected methods. So nobody can ever implement an own one. Abstract classes are different than interfaces, whose methods are always public (that the nature of interfaces).
I would make the methods public, else the whole configuration makes no sense.

I would also change the method name of getTokenizer() to getBreakIterator().

> add icu-based tokenizer for unicode text segmentation
> -----------------------------------------------------
>
>                 Key: LUCENE-2414
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2414
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 3.1
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2414.patch, LUCENE-2414.patch, LUCENE-2414.patch
>
>
> I pulled out the last part of LUCENE-1488, the tokenizer itself and cleaned it up some.
> The idea is simple:
> * First step is to divide text into writing system boundaries (scripts)
> * You supply an ICUTokenizerConfig (or just use the default) which lets you tailor segmentation on a per-writing system basis.
> * This tailoring can be any BreakIterator, so rule-based or dictionary-based or your own.
> The default implementation (if you do not customize) is just to do UAX#29, but with tailorings for stuff with no clear word division:
> * Thai (uses dictionary-based word breaking)
> * Khmer, Myanmar, Lao (uses custom rules for syllabification)
> Additionally as more of an example i have a tailoring for hebrew that treats the punctuation special. (People have asked before
> for ways to make standardanalyzer treat dashes differently, etc)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org