You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2016/10/03 21:12:20 UTC

[jira] [Commented] (LUCENE-7465) Add a PatternTokenizer that uses Lucene's RegExp implementation

    [ https://issues.apache.org/jira/browse/LUCENE-7465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543441#comment-15543441 ] 

Adrien Grand commented on LUCENE-7465:
--------------------------------------

I like the separate factory idea better, it makes it easier to evolve those two impls separately, eg. in the case that we decide to deprecate PatternTokenizer or to move it to sandbox.

> Add a PatternTokenizer that uses Lucene's RegExp implementation
> ---------------------------------------------------------------
>
>                 Key: LUCENE-7465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7465
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: master (7.0), 6.3
>
>         Attachments: LUCENE-7465.patch, LUCENE-7465.patch
>
>
> I think there are some nice benefits to a version of PatternTokenizer that uses Lucene's RegExp impl instead of the JDK's:
>   * Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp is attempted the user discovers it up front instead of later on when a "lucky" document arrives
>   * It processes the incoming characters as a stream, only pulling 128 characters at a time, vs the existing {{PatternTokenizer}} which currently reads the entire string up front (this has caused heap problems in the past)
>   * It should be fast.
> I named it {{SimplePatternTokenizer}}, and it still needs a factory and improved tests, but I think it's otherwise close.
> It currently does not take a {{group}} parameter because Lucene's RegExps don't yet implement sub group capture.  I think we could add that at some point, but it's a bit tricky.
> This doesn't even have group=-1 support (like String.split) ... I think if we did that we should maybe name it differently ({{SimplePatternSplitTokenizer}}?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org