You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "javi (JIRA)" <ji...@apache.org> on 2010/04/21 16:31:49 UTC
[jira] Created: (LUCENE-2407) make CharTokenizer.MAX_WORD_LEN
parametrizable
make CharTokenizer.MAX_WORD_LEN parametrizable
----------------------------------------------
Key: LUCENE-2407
URL: https://issues.apache.org/jira/browse/LUCENE-2407
Project: Lucene - Java
Issue Type: Improvement
Affects Versions: 3.0.1
Reporter: javi
Priority: Minor
Fix For: 3.1
as discussed here http://n3.nabble.com/are-long-words-split-into-up-to-256-long-tokens-tp739914p739914.html it would be nice to be able to parametrize that value.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] Commented: (LUCENE-2407) make CharTokenizer.MAX_WORD_LEN
parametrizable
Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859373#action_12859373 ]
Uwe Schindler commented on LUCENE-2407:
---------------------------------------
This is also a problem for some asian languaes. If ThaiAnalyzer would use CharTokenizer, very long passages could get lost, as ThatWordFilter would not get the complete string (thai is not tokenized by the tokenizer, but later in the filter)
This also applies to StandardTokenizer, maybe we should set a good default when analyzing Thai text (ThaiAnalyzer should init StandardTokenizer with a large/infinite value).
> make CharTokenizer.MAX_WORD_LEN parametrizable
> ----------------------------------------------
>
> Key: LUCENE-2407
> URL: https://issues.apache.org/jira/browse/LUCENE-2407
> Project: Lucene - Java
> Issue Type: Improvement
> Affects Versions: 3.0.1
> Reporter: javi
> Priority: Minor
> Fix For: 3.1
>
>
> as discussed here http://n3.nabble.com/are-long-words-split-into-up-to-256-long-tokens-tp739914p739914.html it would be nice to be able to parametrize that value.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org