You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2011/08/04 22:04:27 UTC

[jira] [Created] (LUCENE-3361) port url+email tokenizer to standardtokenizerinterface (or similar)

port url+email tokenizer to standardtokenizerinterface (or similar)
-------------------------------------------------------------------

                 Key: LUCENE-3361
                 URL: https://issues.apache.org/jira/browse/LUCENE-3361
             Project: Lucene - Java
          Issue Type: Bug
          Components: modules/analysis
    Affects Versions: 3.3
            Reporter: Robert Muir


We should do this so that we can fix the LUCENE-3358 bug there, and preserve backwards.
We also want this mechanism anyway, for upgrading to new unicode versions in the future.

We can regenerate the new TLD list for 3.4 but, we should ensure the existing one is used for the urlemail33 or whatever,
so that its exactly the same.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3361) port url+email tokenizer to standardtokenizerinterface (or similar)

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080354#comment-13080354 ] 

Steven Rowe commented on LUCENE-3361:
-------------------------------------

The {{jflex}} target depends on the {{clean-jflex}} target, which deletes all {{src/.../standard/*.java}} files whose contents match regex {{/generated.*by.*JFlex/}}.  Your patch leaves intact the first line of {{UAX29URLEmailTokenizer.java}}, which matches the regex in a comment.  As a result, running {{ant jflex}} deletes {{UAX29URLEmailTokenizer.java}}, and since it's no longer generated by JFlex, compilation fails.

When I remove this JFlex comment line from {{UAX29URLEmailTokenizer.java}}, {{ant jflex}} works, everything compiles, and all tests succeed.  +1 to commit after removing this line.

> port url+email tokenizer to standardtokenizerinterface (or similar)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3361
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3361
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 3.3
>            Reporter: Robert Muir
>         Attachments: LUCENE-3361.patch
>
>
> We should do this so that we can fix the LUCENE-3358 bug there, and preserve backwards.
> We also want this mechanism anyway, for upgrading to new unicode versions in the future.
> We can regenerate the new TLD list for 3.4 but, we should ensure the existing one is used for the urlemail33 or whatever,
> so that its exactly the same.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3361) port url+email tokenizer to standardtokenizerinterface (or similar)

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080356#comment-13080356 ] 

Steven Rowe commented on LUCENE-3361:
-------------------------------------

One other minor issue: {{ant clean-jflex}} doesn't remove the JFlex-generated {{*.java}} files under the new directory {{src/.../standard/std31/}}.  

To include them, on line #92 in {{modules/analysis/common/build.xml}}, change {{includes="\*.java"}} to {{includes="\*\*/\*.java"}}.

> port url+email tokenizer to standardtokenizerinterface (or similar)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3361
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3361
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 3.3
>            Reporter: Robert Muir
>         Attachments: LUCENE-3361.patch
>
>
> We should do this so that we can fix the LUCENE-3358 bug there, and preserve backwards.
> We also want this mechanism anyway, for upgrading to new unicode versions in the future.
> We can regenerate the new TLD list for 3.4 but, we should ensure the existing one is used for the urlemail33 or whatever,
> so that its exactly the same.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3361) port url+email tokenizer to standardtokenizerinterface (or similar)

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080353#comment-13080353 ] 

Steven Rowe commented on LUCENE-3361:
-------------------------------------

The {{jflex}} target depends on the {{clean-jflex}} target, which deletes all {{src/.../standard/*.java}} files whose contents match regex {{/generated.*by.*JFlex/}}.  Your patch leaves intact the first line of {{UAX29URLEmailTokenizer.java}}, which matches the regex in a comment.  As a result, running {{ant jflex}} deletes {{UAX29URLEmailTokenizer.java}}, and since it's no longer generated by JFlex, compilation fails.

When I remove this JFlex comment line from {{UAX29URLEmailTokenizer.java}}, {{ant jflex}} works, everything compiles, and all tests succeed.  +1 to commit after removing this line.

> port url+email tokenizer to standardtokenizerinterface (or similar)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3361
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3361
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 3.3
>            Reporter: Robert Muir
>         Attachments: LUCENE-3361.patch
>
>
> We should do this so that we can fix the LUCENE-3358 bug there, and preserve backwards.
> We also want this mechanism anyway, for upgrading to new unicode versions in the future.
> We can regenerate the new TLD list for 3.4 but, we should ensure the existing one is used for the urlemail33 or whatever,
> so that its exactly the same.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Resolved] (LUCENE-3361) port url+email tokenizer to standardtokenizerinterface (or similar)

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-3361.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 4.0
                   3.4
         Assignee: Robert Muir

> port url+email tokenizer to standardtokenizerinterface (or similar)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3361
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3361
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 3.3
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3361.patch
>
>
> We should do this so that we can fix the LUCENE-3358 bug there, and preserve backwards.
> We also want this mechanism anyway, for upgrading to new unicode versions in the future.
> We can regenerate the new TLD list for 3.4 but, we should ensure the existing one is used for the urlemail33 or whatever,
> so that its exactly the same.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3361) port url+email tokenizer to standardtokenizerinterface (or similar)

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079685#comment-13079685 ] 

Robert Muir commented on LUCENE-3361:
-------------------------------------

by the way, the patch is for trunk, but has all the deprecations, including API ones: these can be removed in trunk immediately after porting back,
but I would prefer to do this as a separate step, just so i dont forget anything.

> port url+email tokenizer to standardtokenizerinterface (or similar)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3361
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3361
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 3.3
>            Reporter: Robert Muir
>         Attachments: LUCENE-3361.patch
>
>
> We should do this so that we can fix the LUCENE-3358 bug there, and preserve backwards.
> We also want this mechanism anyway, for upgrading to new unicode versions in the future.
> We can regenerate the new TLD list for 3.4 but, we should ensure the existing one is used for the urlemail33 or whatever,
> so that its exactly the same.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3361) port url+email tokenizer to standardtokenizerinterface (or similar)

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080672#comment-13080672 ] 

Robert Muir commented on LUCENE-3361:
-------------------------------------

good catch, thanks for reviewing and finding these issues!

> port url+email tokenizer to standardtokenizerinterface (or similar)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3361
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3361
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 3.3
>            Reporter: Robert Muir
>         Attachments: LUCENE-3361.patch
>
>
> We should do this so that we can fix the LUCENE-3358 bug there, and preserve backwards.
> We also want this mechanism anyway, for upgrading to new unicode versions in the future.
> We can regenerate the new TLD list for 3.4 but, we should ensure the existing one is used for the urlemail33 or whatever,
> so that its exactly the same.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3361) port url+email tokenizer to standardtokenizerinterface (or similar)

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-3361:
--------------------------------

    Comment: was deleted

(was: The {{jflex}} target depends on the {{clean-jflex}} target, which deletes all {{src/.../standard/*.java}} files whose contents match regex {{/generated.*by.*JFlex/}}.  Your patch leaves intact the first line of {{UAX29URLEmailTokenizer.java}}, which matches the regex in a comment.  As a result, running {{ant jflex}} deletes {{UAX29URLEmailTokenizer.java}}, and since it's no longer generated by JFlex, compilation fails.

When I remove this JFlex comment line from {{UAX29URLEmailTokenizer.java}}, {{ant jflex}} works, everything compiles, and all tests succeed.  +1 to commit after removing this line.)

> port url+email tokenizer to standardtokenizerinterface (or similar)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3361
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3361
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 3.3
>            Reporter: Robert Muir
>         Attachments: LUCENE-3361.patch
>
>
> We should do this so that we can fix the LUCENE-3358 bug there, and preserve backwards.
> We also want this mechanism anyway, for upgrading to new unicode versions in the future.
> We can regenerate the new TLD list for 3.4 but, we should ensure the existing one is used for the urlemail33 or whatever,
> so that its exactly the same.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3361) port url+email tokenizer to standardtokenizerinterface (or similar)

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3361:
--------------------------------

    Attachment: LUCENE-3361.patch

Attached is a patch, before applying it you must move the UAX29URLEmailTokenizer.jflex to UAX29URLEmailTOkenizerImpl.jflex

* ports this tokenizer over to StandardTokenizerInterface
* Fixes LUCENE-3358 bug
* regenerates TLDs for trunk only
* adds backwards 3.1 version with bug and old TLDs and some basic tests.
* adds new ctors that require version, deprecates version-less ones
* deprecates inputstream ctor that uses default charset
* reorganizes constants like standardtokenizer and deprecates the old ones.


> port url+email tokenizer to standardtokenizerinterface (or similar)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-3361
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3361
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 3.3
>            Reporter: Robert Muir
>         Attachments: LUCENE-3361.patch
>
>
> We should do this so that we can fix the LUCENE-3358 bug there, and preserve backwards.
> We also want this mechanism anyway, for upgrading to new unicode versions in the future.
> We can regenerate the new TLD list for 3.4 but, we should ensure the existing one is used for the urlemail33 or whatever,
> so that its exactly the same.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org