You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2007/04/24 02:34:15 UTC

[jira] Commented: (SOLR-211) regex split() Tokenizer

    [ https://issues.apache.org/jira/browse/SOLR-211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491106 ] 

Hoss Man commented on SOLR-211:
-------------------------------

some quick comments based on a cursory reading of the patch...

1) RegexSplitTokenizerFactory.init should probably compile the regex into a pattern that can be reused more then once ... i think  String.split calls recompile each time.
2) i don't think the offset stuff will work properly ... the length of the regex string is not the same as the length of the string it matches on when splitting (ie: \p{javaWhitespace}) ... we would probably need to use the Matcher API and iterate over the individual matches.
3) in the vein of like things having like names, we may wan to call this the PatternSplitTokenizer and name it's init param "pattern" (to match PatternReplaceFilter)

> regex split() Tokenizer
> -----------------------
>
>                 Key: SOLR-211
>                 URL: https://issues.apache.org/jira/browse/SOLR-211
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Ryan McKinley
>         Attachments: SOLR-211-RegexSplitTokenizer.patch
>
>
> A TokenizerFactory that makes tokens from:
>   string.split( regex );

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.