You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Steven Rowe (JIRA)" <ji...@apache.org> on 2010/11/08 21:09:06 UTC

[jira] Created: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
---------------------------------------------------------------------------

Key: LUCENE-2747
URL: https://issues.apache.org/jira/browse/LUCENE-2747
Project: Lucene - Java
Issue Type: Improvement
Components: Analysis
Affects Versions: 3.1, 4.0
Reporter: Steven Rowe
Fix For: 3.1, 4.0

As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization. Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0). The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters. The language-specific analyzers should switch to StandardTokenizer in 3.1.

Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.

For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary. Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930137#action_12930137 ] 

Robert Muir commented on LUCENE-2747:
-------------------------------------

DM, thanks, I see exactly where you are coming from.

I see your point: previously it was much easier to take something like SimpleAnalyzer and 'adapt' it to a given language based on things like unicode properties.
In fact thats exactly what we did in the cases here (Arabic, Persian, Hindi, etc)

But now we can actually tokenize "correctly" for more languages with jflex, thanks to its improved unicode support, and its superior to these previous hacks :)

to try to answer some of your questions (all my opinion):

bq. Is there a point to having SimpleAnalyzer

I guess so, a lot of people can use this if they have english-only content and are probably happy with discard numbers etc... its not a big loss to me if it goes though.

bq. Shouldn't UAX29Tokenizer be moved to core? (What is core anyway?)

In trunk (4.x codeline) there is no core, contrib, or solr for analyzer components any more. they are all combined into modules/analysis.
In branch_3x (3.x codeline) we did not make this rather disruptive refactor: there UAX29Tokenizer is in fact in lucene core.

bq. Would there be a way to plugin ICUTokenizer as a replacement for UAX29Tokenizer into StandardTokenizer, such that all Analyzers using StandardTokenizer would get the alternate implementation?

Personally, i would prefer if we move towards a factory model where things like these supplied "language analyzers" are actually xml/json/properties snippets.
In other words, they are just example configurations that builds your analyzer, like solr does.
This is nice, because then you dont have to write code to easily customize how your analyzer works.

I think we have been making slow steps towards this, just doing basic things like moving stopwords lists to .txt files.
But i think the next step would be LUCENE-2510, where we have factories/config attribute parsers for all these analysis components already written.

Then we could have support for declarative analyzer specification via xml/json/.properties/whatever, and move all these Analyzers to that.
I still think you should be able to code up your own analyzer, but in my opinion this is much easier and preferred for the ones we supply.

Also i think this would solve a lot of analyzer-backwards-compat problems, because then our supplied analyzers are really just configuration file examples,
and we can change our examples however we want... someone can use their old config file (and hopefully old analysis module jar file!) to guarantee
the exact same behavior if they want.

Finally, most of the benefits of ICUTokenizer are actually in the UAX29 support... the tokenizers are pretty close with some minor differences:
* the jflex-based implementation is faster, and better in my opinion.
* the ICU-based implementation allows tailoring, and supplies tailored tokenization for several complex scripts (jflex doesnt have this... yet)
* the ICU-based implementation works with all of unicode, at the moment jflex is limited to the basic multilingual plane.

In my opinion the last 2 points will probably be eventually resolved... i could see our ICUTokenizer possibly becoming obselete down the road 
by some better jflex support, though it would have to probably have hooks into ICU for the complex script support (so we get it for free from ICU)


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965066#action_12965066 ] 

Robert Muir commented on LUCENE-2747:
-------------------------------------

OK, i'd like to move forward with this if there are no objections, it improves these analyzers by improving their multilingual
capabilities, allowing them to support numbers etc, and deprecates/removes useless custom code (only unreleased functionality is removed)

We also add CharFilter support to the ReusableAnalyzerBase, which was missing before.

DM: we can open a followup issue to improve the docs for the simplistic tokenizers as discussed here.


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930580#action_12930580 ] 

Robert Muir commented on LUCENE-2747:
-------------------------------------

bq. I meant o.a.l.analysis.core. I'd expect the premier analyzers to be in core.

I guess the package doesn't make a big difference to me, all the analyzers are in one place and the same.
its true we mixed the "core" analysis stuff with contrib and solr, but if there are warts with the contrib stuff,
we should be able to clean it up (I think this has been happening)

bq. I guess I meant: Shouldn't the SimpleAnalyzer just be constructed the same as StandardAnalyzer with the addition of a Filter that pitch tokens that are not needed?

I don't think so, this seems like a trap. Lots of people use StandardTokenizer without any filter... I don't think 
it should emit trash (punctuation). if you want to do the other stuff, you can use ClassicTokenizer, or even 
WhitespaceTokenizer and filter to your heart's content.


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930073#action_12930073 ] 

Robert Muir commented on LUCENE-2747:
-------------------------------------

Simon: i agree with both those points... we should change the method signature.

also i called it charStream (this is what Solr's analyzer calls it), but this is slightly confusing since the api is all Reader-based.
alternatively we could give this a different name, wrapReader or something... not sure, i didnt have any better ideas than charStream.



> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930075#action_12930075 ] 

Simon Willnauer commented on LUCENE-2747:
-----------------------------------------

bq. ...alternatively we could give this a different name, wrapReader or something... not sure, i didnt have any better ideas than charStream.
wrapReader seem to be to specific what about initReader?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930670#action_12930670 ] 

DM Smith commented on LUCENE-2747:
----------------------------------

Robert/Steven, I'm sorry. I fat fingered the last post. I really need to take more care.
    s/Standard/Simple/;

That is, SimpleAnalyzer is not appropriate for many languages. If it were based upon a variation of UAX29Tokenizer, but didn't handle NUM or ALPHANUM, but WORD instead, it would be the same type of token stream, just alpha words.

Regarding compatibility: I think the results for English would be nearly, if not, identical. Western European would only be slightly off from identical. But for other languages it would be an improvement.

At this point, I'm content with what you guys are doing with non-English texts. Great job.


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929738#action_12929738 ] 

Robert Muir commented on LUCENE-2747:
-------------------------------------

bq. CharFilter must at least also implement read() to read one char. 

Thats incorrect.
only read(char[] cbuf, int off, int len)  is abstract in Reader.
CharStream extends Reader, but only adds correctOffset.
CharFilter extends CharStream, but only delegates read(char[] cbuf, int off, int len) 

So implementing read() only adds useless code duplication here.

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930090#action_12930090 ] 

Robert Muir commented on LUCENE-2747:
-------------------------------------

bq. I'm not too keen on this. For classics and ancient texts the standard analyzer is not as good as the simple analyzer.

DM, can you elaborate here? 

Are you speaking of the existing StandardAnalyzer in previous releases, that doesn't properly deal with tokenizing diacritics, etc?
This is the reason these "special" tokenizers exist: to work around those bugs.
but StandardTokenizer now handles this stuff fine, and they are obselete.

I'm confused though, in previous releases how SimpleAnalyzer would ever be any better, since it would barf on these diacritics too,
it only emits tokens that are runs of Character.isLetter

Or is there something else i'm missing here?


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929737#action_12929737 ] 

Uwe Schindler commented on LUCENE-2747:
---------------------------------------

CharFilter must at least also implement read() to read one char. This is not needed by most tokenizers, but like that its incomplete. It does *not* need to impl char[] without offset and count.

Else looks fine for now in reusable.

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930598#action_12930598 ] 

Steven Rowe commented on LUCENE-2747:
-------------------------------------

bq. How about implementing StandardTokenizer using jflex to be UAX29Tokenizer minus NUM and ALPHANUM?

DM, ALPHANUM is the only "word" token type - there is no ALPHA or WORD type - from the UAX29Tokenizer javadocs:

{quote}
Tokens produced are of the following types:

* <ALPHANUM>: A sequence of alphabetic and numeric characters
* <NUM>: A number
* <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
* <IDEOGRAPHIC>: A single CJKV ideographic character
* <HIRAGANA>: A single hiragana character
{quote}

So I guess you think that StandardAnalyzer should exclude tokens that have numeric characters in them?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930024#action_12930024 ] 

Robert Muir commented on LUCENE-2747:
-------------------------------------

bq. You removed TestHindiFilters.testTokenizer(), TestIndicTokenizer.testBasics() and TestIndicTokenizer.testFormat(), but these would be useful in TestStandardAnalyzer and TestUAX29Tokenizer, wouldn't they?

oh, i just deleted everything associated with that tokenizer...

bq. You did not remove ArabicLetterTokenizer and IndicTokenizer, presumably so that they can be used with Lucene 4.0+ when the supplied Version is less than 3.1 - good catch, I had forgotten this requirement - but when can we actually get rid of these? Since they will be staying, shouldn't their tests remain too, but using Version.LUCENE_30 instead of TEST_VERSION_CURRENT?

i removed the indictokenizer (unreleased) and deleted its tests.
but i kept and deprecated the arabic one, since we have released it.


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930681#action_12930681 ] 

Robert Muir commented on LUCENE-2747:
-------------------------------------

bq. That is, SimpleAnalyzer is not appropriate for many languages. If it were based upon a variation of UAX29Tokenizer, but didn't handle NUM or ALPHANUM, but WORD instead, it would be the same type of token stream, just alpha words.

Ok, now i understand you, and yes I agree...  My question is, should we even bother fixing it? Like would anyone who actually cares about unicode really want only some hacked subset of UAX#29 ?

These simple ones like SimpleAnalyzer, WhitespaceAnalyzer, StopAnalyzer are all really bad for Unicode text in different ways, though Simple/Stop are bigger offenders i think, because they will separate a base character from its combining characters (in my opinion, this should always be avoided) and worse: they will break on these.

But people using them are probably happy? e.g. you can do like Solr,  use whitespaceanalyzer and follow thru with something like WordDelimiterFilter and its mostly ok, depending upon options, except for cases like CJK where its a death trap.

Personally i just don't use these things since I know the problems, but we could document "this is simplistic and won't work well for many languages" and keep them around for people that don't care?

And yeah i suppose its confusing these really "simple" ones are in the .core package, but to me the package is meaningless, i was just trying to keep the analyzers arranged in some kind of order (e.g. pattern-based analysis in the .pattern package, etc).

We could just as well call the package .basic or .simple or something else, its just a name.


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930578#action_12930578 ] 

DM Smith commented on LUCENE-2747:
----------------------------------

{quote}
bq. Shouldn't UAX29Tokenizer be moved to core? (What is core anyway?)

In trunk (4.x codeline) there is no core, contrib, or solr for analyzer components any more. they are all combined into modules/analysis.
In branch_3x (3.x codeline) we did not make this rather disruptive refactor: there UAX29Tokenizer is in fact in lucene core.
{quote}

I meant o.a.l.analysis.core. I'd expect the *premier* analyzers to be in core.

{quote}
bq. Is there a point to having SimpleAnalyzer

I guess so, a lot of people can use this if they have english-only content and are probably happy with discard numbers etc... its not a big loss to me if it goes though.
{quote}

I guess I meant: Shouldn't the SimpleAnalyzer just be constructed the same as StandardAnalyzer with the addition of a Filter that pitch tokens that are not needed?
With the suggestion in LUCENE-2167 to use UAX29Tokenizer for StandardAnalyzer, effectively deprecating EMAIL and URL and possibly adding some kind of PUNCTUATION (so that URLs/emails/acronyms... can be reconstructed, if someone desires), the StandardAnalyzer is about as simple as one could get and properly handle non-english/non-western languages. It just creates ALPHANUM,  NUM and PUNCTUATION (if added) that SimpleAnalyzer does not care about.


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930448#action_12930448 ] 

DM Smith commented on LUCENE-2747:
----------------------------------

Robert, I think we are on the same wavelength. Thanks.

I like the idea of declarative analyzers, too.

Regarding the "last 2 points" has anyone given input to the JFlex team on these needs?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929826#action_12929826 ] 

Uwe Schindler commented on LUCENE-2747:
---------------------------------------

Ay, ay "Code Dup Policeman".

>From perf standpoint for real FilterReaders in java.io that would be no-go, but here it's fine as Tokenizers always buffer. Also java.io's FilterReader are different and delegate this method, but not CharFilter.

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-2747.
---------------------------------

    Resolution: Fixed
      Assignee: Robert Muir

Committed revision 1043114, 1043181 (3x)

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Assignee: Robert Muir
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930119#action_12930119 ] 

DM Smith commented on LUCENE-2747:
----------------------------------

bq. DM, can you elaborate here?

I was a bit trigger happy with the comment. I should have looked at the code rather than the jira comments alone. The old StandardAnalyzer had a kitchen sink approach to tokenizations trying to do too much with *modern* constructs, e.g. URLs, email addresses, acronyms.... It and SimpleAnalyzer would produce about the same stream on "old" English and some other texts, but the StandardAnalyzer was much slower. (I don't remember how slow, but it was obvious.)

Both of these were weak when it came to non-English/non-Western texts. Thus I could take the language specific tokenizers, lists of stop words, stemmers and create variations of the SimpleAnalyzer that properly handled a particular language. (I created my own analyzers because I wanted to make stop words and stemming optional)

In looking at the code in trunk (should have done that before making my comment), I see that UAX29Tokenizer is duplicated in the StandardAnalyzer's jflex and that ClassicAnalyzer is the old jflex. Also, the new StandardAnalyzer does a lot less.

If I understand the suggestion of this and the other 2 issues, StandardAnalyzer will no longer handle modern constructs. As I see it this is what SimpleAnalyzer should be: Based on UAX29 and does little else. Thus my confusion. Is there a point to having SimpleAnalyzer? Shouldn't UAX29Tokenizer be moved to core? (What is core anyway?)

And if I understand where this is going: Would there be a way to plugin ICUTokenizer as a replacement for UAX29Tokenizer into StandardTokenizer, such that all Analyzers using StandardTokenizer would get the alternate implementation?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930470#action_12930470 ] 

Steven Rowe commented on LUCENE-2747:
-------------------------------------

DM, I'm a committer on the JFlex team.  About the second to last point: when Robert said "(jflex doesnt have this... yet)" he meant the jflex-based implementation, not JFlex itself.

About the last point, JFlex is shooting for level 1 compliance with [UTS#18 Unicode Regular Expressions|http://unicode.org/reports/tr18/], which requires conforming implementations to "handle the full range of Unicode code points, including values from U+FFFF to U+10FFFF."

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929834#action_12929834 ] 

Robert Muir commented on LUCENE-2747:
-------------------------------------

Wait, thats an interesting point, any advantage to actually using "real FilterReaders" for this API?


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930674#action_12930674 ] 

Steven Rowe commented on LUCENE-2747:
-------------------------------------

bq. That is, SimpleAnalyzer is not appropriate for many languages. If it were based upon a variation of UAX29Tokenizer, but didn't handle NUM or ALPHANUM, but WORD instead, it would be the same type of token stream, just alpha words.

DM, if you want, I could help you convert UAX29Tokenizer in this way - I hang out on IRC #lucene and #lucene-dev a lot of the time.  I think you'd want to call it something other than SimpleAnalyzer, though - maybe UAX29AlphabeticTokenizer?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2747:
--------------------------------

    Attachment: LUCENE-2747.patch

here's an updated patch.

in reality the previous patch was a problem: because initReader() was in the TokenStream components,
it caused code duplication in any Analyzer, as it had to specifiy its CharFilter twice: once in the createComponents for the initial Reader,
and once in the TokenStreamComponents implementation for reset(Reader).

So i moved this to just be a method of ReusableAnalyzerBase.

Also, i didn't apply the 'throws IOException'. After re-thinking, there is no need to do this.
None of our CharFilters for example, throw IOExceptions in their ctors.
Even the Analyzer.tokenStream method cannot throw IOException.

We shouldn't add 'throws X exception' just because some arbitrary user class MIGHT throw it,
they might throw SQLException, or InvalidMidiDataException too.

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929936#action_12929936 ] 

Steven Rowe commented on LUCENE-2747:
-------------------------------------

Robert, your patch looks good - I have a couple of questions:

* You removed {{TestHindiFilters.testTokenizer()}}, {{TestIndicTokenizer.testBasics()}} and {{TestIndicTokenizer.testFormat()}}, but these would be useful in {{TestStandardAnalyzer}} and {{TestUAX29Tokenizer}}, wouldn't they?
* You did not remove {{ArabicLetterTokenizer}} and {{IndicTokenizer}}, presumably so that they can be used with Lucene 4.0+ when the supplied {{Version}} is less than 3.1 -- good catch, I had forgotten this requirement -- but when can we actually get rid of these?  Since they will be staying, shouldn't their tests remain too, but using {{Version.LUCENE_30}} instead of {{TEST_VERSION_CURRENT}}?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930601#action_12930601 ] 

Robert Muir commented on LUCENE-2747:
-------------------------------------

bq. Robert, Let me ask another way. How about implementing StandardTokenizer using jflex to be UAX29Tokenizer minus NUM and ALPHANUM?

I think, as i already commented, that it would be best if StandardTokenizer just implemented the UAX#29 algorithm.
its a great default and I dont think we should try to alter it.


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929934#action_12929934 ] 

DM Smith commented on LUCENE-2747:
----------------------------------

I'm not too keen on this. For classics and ancient texts the standard analyzer is not as good as the simple analyzer. I think it is important to have a tokenizer that does not try to be too smart. I think it'd be good to have a SimpleAnalyzer based upon UAX#29, too.

Then I'd be happy.

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930690#action_12930690 ] 

DM Smith commented on LUCENE-2747:
----------------------------------

Robert, I think
* "core" is a bad name that needs to be changed. It is misleading.
* Documentation should be improved along the lines you suggest.

You mention a few broken Analyzers (and by implication related tokenizers and filters). I've a question about LowerCaseFilter: Isn't it bad as well?


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929945#action_12929945 ] 

Steven Rowe commented on LUCENE-2747:
-------------------------------------

bq. I'm not too keen on this. For classics and ancient texts the standard analyzer is not as good as the simple analyzer. I think it is important to have a tokenizer that does not try to be too smart. I think it'd be good to have a SimpleAnalyzer based upon UAX#29, too.

{{UAX29Tokenizer}} could be combined with {{LowercaseFilter}} to provide that, no?

Robert is arguing in the reopened LUCENE-2167 for {{StandardTokenizer}} to be stripped down so that it only implements UAX#29 rules (i.e., dropping URL+Email recognition), so if that comes to pass, {{StandardAnalyzer}} would just be UAX#29+lowercase+stopword (with English stopwords by default, but those can be overridden in the ctor) -- would that make you happy?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "DM Smith (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930585#action_12930585 ] 

DM Smith commented on LUCENE-2747:
----------------------------------

Robert, Let me ask another way. How about implementing StandardTokenizer using jflex to be UAX29Tokenizer minus NUM and ALPHANUM?

> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch, LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2747:
--------------------------------

    Attachment: LUCENE-2747.patch

here's a quick stab at a patch.

I had to add at least minimal support to ReusableAnalyzerBase in case you want charfilters,
since it doesn't have any today.

maybe there is a better way to do it though.


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930072#action_12930072 ] 

Simon Willnauer commented on LUCENE-2747:
-----------------------------------------

I looked at the patch briefly and the charStream(Reader) extension looks good to me while I would make it protected and throw a IOException. Since this API is public and folks will use it in the wild we need to make sure we don't have to add the exception later or people creating Readers have to play tricks just because the interface has no IOException. About making it protected, do we need to call that in a non-protected context, maybe I miss something..
{code}
 public Reader charStream(Reader reader) {
   return reader;
 }

// should be 

  protected Reader charStream(Reader reader) throws IOException{
    return reader;
  }
{code}


> Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-2747
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2747
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2747.patch
>
>
> As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization.  Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0).  The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters.  The language-specific analyzers should switch to StandardTokenizer in 3.1.
> Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer.  
> For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary.  Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org