You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Erick Erickson (JIRA)" <ji...@apache.org> on 2017/03/06 18:18:33 UTC

[jira] [Commented] (LUCENE-7705) Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

    [ https://issues.apache.org/jira/browse/LUCENE-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897762#comment-15897762 ] 

Erick Erickson commented on LUCENE-7705:
----------------------------------------

Patch looks good. I'm going to hang back on committing this until we figure out SOLR-10229 (control schema proliferation). The additional schema you put in here is about the only way currently to test Solr schemas, so that's perfectly appropriate. I'd just like to use this as a test case for what it would take to move constructing schemas to inside the tests rather than have each new case like this require another schema that we then have to maintain.

But if SOLR-10229 takes very long I'll just commit this one and we can work out the rest later.

> Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
> ---------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7705
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7705
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Amrit Sarkar
>            Assignee: Erick Erickson
>            Priority: Minor
>         Attachments: LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch, LUCENE-7705.patch
>
>
> SOLR-10186
> [~erickerickson]: Is there a good reason that we hard-code a 256 character limit for the CharTokenizer? In order to change this limit it requires that people copy/paste the incrementToken into some new class since incrementToken is final.
> KeywordTokenizer can easily change the default (which is also 256 bytes), but to do so requires code rather than being able to configure it in the schema.
> For KeywordTokenizer, this is Solr-only. For the CharTokenizer classes (WhitespaceTokenizer, UnicodeWhitespaceTokenizer and LetterTokenizer) (Factories) it would take adding a c'tor to the base class in Lucene and using it in the factory.
> Any objections?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org