You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steven Rowe (JIRA)" <ji...@apache.org> on 2010/07/19 07:22:51 UTC
[jira] Issue Comment Edited: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard

    [ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889760#action_12889760 ] 

Steven Rowe edited comment on LUCENE-2167 at 7/19/10 1:20 AM:
--------------------------------------------------------------

{quote}
bq. I think any perf issues are resolved, I also think the DFA size is more manageable from our previous changes, and arguably ok now (ill defer to your judgement on whether we need to attack this more though).

To address the DFA size I want to try your previous suggestion of a simpler IPv6 regex in the JFlex grammar, then full validation in the action via a java.util.regex NFA. You've previously said that you thought returning a new type like INVALID_URL would be fine, but I'd prefer not to do that - I want to try backing out and trying an alternate path if this action-based validation fails.
{quote}

The attached {{StandardTokenizerImpl.jflex}} is the result of my attempt, which appears to be successful - tests all pass.

However, the resultant .class file size is even larger than before: 67,947 bytes.

I give up: I think we should go with the full-blown IPv6 regex as part of the DFA.

      was (Author: steve_rowe):
    {quote}
bq. I think any perf issues are resolved, I also think the DFA size is more manageable from our previous changes, and arguably ok now (ill defer to your judgement on whether we need to attack this more though).

To address the DFA size I want to try your previous suggestion of a simpler IPv6 regex in the JFlex grammar, then full validation in the action via a java.util.regex NFA. You've previously said that you thought returning a new type like INVALID_URL would be fine, but I'd prefer not to do that - I want to try backing out and trying an alternate path if this action-based validation fails.
}

The attached {{StandardTokenizerImpl.jflex}} is the result of my attempt, which appears to be successful - tests all pass.

However, the resultant .class file size is even larger than before: 67,947 bytes.

I give up: I think we should go with the full-blown IPv6 regex as part of the DFA.
  
> Implement StandardTokenizer with the UAX#29 Standard
> ----------------------------------------------------
>
>                 Key: LUCENE-2167
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2167
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>    Affects Versions: 3.1
>            Reporter: Shyamal Prasad
>            Assignee: Robert Muir
>            Priority: Minor
>         Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-lucene-buildhelper-maven-plugin.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> It would be really nice for StandardTokenizer to adhere straight to the standard as much as we can with jflex. Then its name would actually make sense.
> Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer, as its javadoc claims:
> bq. This should be a good tokenizer for most European-language documents
> The new StandardTokenizer could then say
> bq. This should be a good tokenizer for most languages.
> All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay with that EuropeanTokenizer, and it could be used by the european analyzers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org