You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steven Rowe (JIRA)" <ji...@apache.org> on 2010/11/15 19:26:13 UTC

[jira] Created: (LUCENE-2763) Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer

Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer
---------------------------------------------------------------

                 Key: LUCENE-2763
                 URL: https://issues.apache.org/jira/browse/LUCENE-2763
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Analysis
    Affects Versions: 3.1, 4.0
            Reporter: Steven Rowe
             Fix For: 3.1, 4.0


Currently, in addition to implementing the UAX#29 word boundary rules, StandardTokenizer recognizes email adresses and URLs, but doesn't provide a way to turn this behavior off and/or provide overlapping tokens with the components (username from email address, hostname from URL, etc.).

UAX29Tokenizer should become StandardTokenizer, and current StandardTokenizer should be renamed to something like UAX29TokenizerPlusPlus (or something like that).

For rationale, see [the discussion at the reopened LUCENE-2167|https://issues.apache.org/jira/browse/LUCENE-2167?focusedCommentId=12929325&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12929325].

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2763) Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-2763:
--------------------------------

    Attachment: LUCENE-2763.patch

Updated patch to fix {{solr/CHANGES.txt}}, {{lucene/CHANGES.txt}}, and {{analysis/standard/READ_BEFORE_REGENERATING.txt}}.

I will commit later today if there are no objections.

> Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer
> ---------------------------------------------------------------
>
>                 Key: LUCENE-2763
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2763
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2763.patch, LUCENE-2763.patch
>
>
> Currently, in addition to implementing the UAX#29 word boundary rules, StandardTokenizer recognizes email adresses and URLs, but doesn't provide a way to turn this behavior off and/or provide overlapping tokens with the components (username from email address, hostname from URL, etc.).
> UAX29Tokenizer should become StandardTokenizer, and current StandardTokenizer should be renamed to something like UAX29TokenizerPlusPlus (or something like that).
> For rationale, see [the discussion at the reopened LUCENE-2167|https://issues.apache.org/jira/browse/LUCENE-2167?focusedCommentId=12929325&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12929325].

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2763) Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-2763:
--------------------------------

    Attachment: LUCENE-2763.patch

Patch to perform the switch on trunk.

> Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer
> ---------------------------------------------------------------
>
>                 Key: LUCENE-2763
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2763
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2763.patch
>
>
> Currently, in addition to implementing the UAX#29 word boundary rules, StandardTokenizer recognizes email adresses and URLs, but doesn't provide a way to turn this behavior off and/or provide overlapping tokens with the components (username from email address, hostname from URL, etc.).
> UAX29Tokenizer should become StandardTokenizer, and current StandardTokenizer should be renamed to something like UAX29TokenizerPlusPlus (or something like that).
> For rationale, see [the discussion at the reopened LUCENE-2167|https://issues.apache.org/jira/browse/LUCENE-2167?focusedCommentId=12929325&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12929325].

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-2763) Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe resolved LUCENE-2763.
---------------------------------

       Resolution: Fixed
    Lucene Fields:   (was: [New])

Committed to trunk revision 1043071, branch_3x revision 1043180

> Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer
> ---------------------------------------------------------------
>
>                 Key: LUCENE-2763
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2763
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2763.patch, LUCENE-2763.patch, LUCENE-2763.patch
>
>
> Currently, in addition to implementing the UAX#29 word boundary rules, StandardTokenizer recognizes email adresses and URLs, but doesn't provide a way to turn this behavior off and/or provide overlapping tokens with the components (username from email address, hostname from URL, etc.).
> UAX29Tokenizer should become StandardTokenizer, and current StandardTokenizer should be renamed to something like UAX29TokenizerPlusPlus (or something like that).
> For rationale, see [the discussion at the reopened LUCENE-2167|https://issues.apache.org/jira/browse/LUCENE-2167?focusedCommentId=12929325&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12929325].

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-2763) Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe reassigned LUCENE-2763:
-----------------------------------

    Assignee: Steven Rowe

> Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer
> ---------------------------------------------------------------
>
>                 Key: LUCENE-2763
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2763
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>             Fix For: 3.1, 4.0
>
>
> Currently, in addition to implementing the UAX#29 word boundary rules, StandardTokenizer recognizes email adresses and URLs, but doesn't provide a way to turn this behavior off and/or provide overlapping tokens with the components (username from email address, hostname from URL, etc.).
> UAX29Tokenizer should become StandardTokenizer, and current StandardTokenizer should be renamed to something like UAX29TokenizerPlusPlus (or something like that).
> For rationale, see [the discussion at the reopened LUCENE-2167|https://issues.apache.org/jira/browse/LUCENE-2167?focusedCommentId=12929325&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12929325].

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2763) Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966943#action_12966943 ] 

Robert Muir commented on LUCENE-2763:
-------------------------------------

+1, looks good to me.


> Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer
> ---------------------------------------------------------------
>
>                 Key: LUCENE-2763
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2763
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2763.patch
>
>
> Currently, in addition to implementing the UAX#29 word boundary rules, StandardTokenizer recognizes email adresses and URLs, but doesn't provide a way to turn this behavior off and/or provide overlapping tokens with the components (username from email address, hostname from URL, etc.).
> UAX29Tokenizer should become StandardTokenizer, and current StandardTokenizer should be renamed to something like UAX29TokenizerPlusPlus (or something like that).
> For rationale, see [the discussion at the reopened LUCENE-2167|https://issues.apache.org/jira/browse/LUCENE-2167?focusedCommentId=12929325&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12929325].

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2763) Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-2763:
--------------------------------

    Attachment: LUCENE-2763.patch

Final patch, with URL and E-mail tokenization tests added to Solr's {{TestUAX29URLEmailTokenizerFactory}}.

I will commit shortly.

> Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer
> ---------------------------------------------------------------
>
>                 Key: LUCENE-2763
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2763
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2763.patch, LUCENE-2763.patch, LUCENE-2763.patch
>
>
> Currently, in addition to implementing the UAX#29 word boundary rules, StandardTokenizer recognizes email adresses and URLs, but doesn't provide a way to turn this behavior off and/or provide overlapping tokens with the components (username from email address, hostname from URL, etc.).
> UAX29Tokenizer should become StandardTokenizer, and current StandardTokenizer should be renamed to something like UAX29TokenizerPlusPlus (or something like that).
> For rationale, see [the discussion at the reopened LUCENE-2167|https://issues.apache.org/jira/browse/LUCENE-2167?focusedCommentId=12929325&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12929325].

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org