You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "javi (JIRA)" <ji...@apache.org> on 2010/04/21 16:31:49 UTC

[jira] Created: (LUCENE-2407) make CharTokenizer.MAX_WORD_LEN parametrizable

make CharTokenizer.MAX_WORD_LEN parametrizable
----------------------------------------------

                 Key: LUCENE-2407
                 URL: https://issues.apache.org/jira/browse/LUCENE-2407
             Project: Lucene - Java
          Issue Type: Improvement
    Affects Versions: 3.0.1
            Reporter: javi
            Priority: Minor
             Fix For: 3.1


as discussed here http://n3.nabble.com/are-long-words-split-into-up-to-256-long-tokens-tp739914p739914.html it would be nice to be able to parametrize that value. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2407) make CharTokenizer.MAX_WORD_LEN parametrizable

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859373#action_12859373 ] 

Uwe Schindler commented on LUCENE-2407:
---------------------------------------

This is also a problem for some asian languaes. If ThaiAnalyzer would use CharTokenizer, very long passages could get lost, as ThatWordFilter would not get the complete string (thai is not tokenized by the tokenizer, but later in the filter)

This also applies to StandardTokenizer, maybe we should set a good default when analyzing Thai text (ThaiAnalyzer should init StandardTokenizer with a large/infinite value).

> make CharTokenizer.MAX_WORD_LEN parametrizable
> ----------------------------------------------
>
>                 Key: LUCENE-2407
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2407
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 3.0.1
>            Reporter: javi
>            Priority: Minor
>             Fix For: 3.1
>
>
> as discussed here http://n3.nabble.com/are-long-words-split-into-up-to-256-long-tokens-tp739914p739914.html it would be nice to be able to parametrize that value. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org