You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Tom Burton-West (JIRA)" <ji...@apache.org> on 2010/11/01 18:32:24 UTC

[jira] Created: (SOLR-2211) Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support

Create Solr FilterFactory for Lucene StandardTokenizer with  UAX#29 support
---------------------------------------------------------------------------

                 Key: SOLR-2211
                 URL: https://issues.apache.org/jira/browse/SOLR-2211
             Project: Solr
          Issue Type: New Feature
    Affects Versions: 3.1
            Reporter: Tom Burton-West
            Priority: Minor


The Lucene 3.x StandardTokenizer with UAX#29 support provides benefits for non-English tokenizing.  Presently it can be invoked by using the StandardTokenizerFactory and setting the Version to 3.1.  However, it would be useful to be able to use the improved unicode processing without necessarily including the ip address and email address processing of StandardAnalyzer.   A FilterFactory that allowed the use of the StandardTokenizer with UAX#29 support on its own would be useful.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Assigned: (SOLR-2211) Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir reassigned SOLR-2211:
---------------------------------

    Assignee: Robert Muir

> Create Solr FilterFactory for Lucene StandardTokenizer with  UAX#29 support
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-2211
>                 URL: https://issues.apache.org/jira/browse/SOLR-2211
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1
>            Reporter: Tom Burton-West
>            Assignee: Robert Muir
>            Priority: Minor
>         Attachments: SOLR-2211.patch
>
>
> The Lucene 3.x StandardTokenizer with UAX#29 support provides benefits for non-English tokenizing.  Presently it can be invoked by using the StandardTokenizerFactory and setting the Version to 3.1.  However, it would be useful to be able to use the improved unicode processing without necessarily including the ip address and email address processing of StandardAnalyzer.   A FilterFactory that allowed the use of the StandardTokenizer with UAX#29 support on its own would be useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Resolved: (SOLR-2211) Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved SOLR-2211.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 4.0
                   3.1

Committed revision 1032776, 1032779 (3x).

Thanks Tom!

> Create Solr FilterFactory for Lucene StandardTokenizer with  UAX#29 support
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-2211
>                 URL: https://issues.apache.org/jira/browse/SOLR-2211
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1
>            Reporter: Tom Burton-West
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2211.patch
>
>
> The Lucene 3.x StandardTokenizer with UAX#29 support provides benefits for non-English tokenizing.  Presently it can be invoked by using the StandardTokenizerFactory and setting the Version to 3.1.  However, it would be useful to be able to use the improved unicode processing without necessarily including the ip address and email address processing of StandardAnalyzer.   A FilterFactory that allowed the use of the StandardTokenizer with UAX#29 support on its own would be useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2211) Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927069#action_12927069 ] 

Robert Muir commented on SOLR-2211:
-----------------------------------

Sounds great, this one has no external dependencies, so it can just be with the rest of the factories.

I'll look at starting on the ant/build-system-stuff for SOLR-2210.


> Create Solr FilterFactory for Lucene StandardTokenizer with  UAX#29 support
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-2211
>                 URL: https://issues.apache.org/jira/browse/SOLR-2211
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1
>            Reporter: Tom Burton-West
>            Priority: Minor
>
> The Lucene 3.x StandardTokenizer with UAX#29 support provides benefits for non-English tokenizing.  Presently it can be invoked by using the StandardTokenizerFactory and setting the Version to 3.1.  However, it would be useful to be able to use the improved unicode processing without necessarily including the ip address and email address processing of StandardAnalyzer.   A FilterFactory that allowed the use of the StandardTokenizer with UAX#29 support on its own would be useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2211) Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929775#action_12929775 ] 

Robert Muir commented on SOLR-2211:
-----------------------------------

Thanks Tom, looks great. I'll commit soon.

> Create Solr FilterFactory for Lucene StandardTokenizer with  UAX#29 support
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-2211
>                 URL: https://issues.apache.org/jira/browse/SOLR-2211
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1
>            Reporter: Tom Burton-West
>            Assignee: Robert Muir
>            Priority: Minor
>         Attachments: SOLR-2211.patch
>
>
> The Lucene 3.x StandardTokenizer with UAX#29 support provides benefits for non-English tokenizing.  Presently it can be invoked by using the StandardTokenizerFactory and setting the Version to 3.1.  However, it would be useful to be able to use the improved unicode processing without necessarily including the ip address and email address processing of StandardAnalyzer.   A FilterFactory that allowed the use of the StandardTokenizer with UAX#29 support on its own would be useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2211) Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929849#action_12929849 ] 

Robert Muir commented on SOLR-2211:
-----------------------------------

Great, I look forward to the results.

By the way, on SOLR-2210 i also added the ICU filters, you could consider replacing LowerCaseFilterFactory with ICUNormalizer2Factory (just use the defaults).
In addition to better lowercasing (e.g. ß -> ss), this would also bring the advantages described in http://unicode.org/reports/tr15/

Alternatively, if you are already using both LowerCaseFilterFactory and ASCIIFoldingFilterFactory, you can replace both with ICUFoldingFilterFactory,
which goes further and also incorporates http://www.unicode.org/reports/tr30/tr30-4.html


> Create Solr FilterFactory for Lucene StandardTokenizer with  UAX#29 support
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-2211
>                 URL: https://issues.apache.org/jira/browse/SOLR-2211
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1
>            Reporter: Tom Burton-West
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2211.patch
>
>
> The Lucene 3.x StandardTokenizer with UAX#29 support provides benefits for non-English tokenizing.  Presently it can be invoked by using the StandardTokenizerFactory and setting the Version to 3.1.  However, it would be useful to be able to use the improved unicode processing without necessarily including the ip address and email address processing of StandardAnalyzer.   A FilterFactory that allowed the use of the StandardTokenizer with UAX#29 support on its own would be useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2211) Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support

Posted by "Tom Burton-West (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929823#action_12929823 ] 

Tom Burton-West commented on SOLR-2211:
---------------------------------------

Thanks for all your help Robert.   We will be testing this and the ICUTokenizer tomorrow against a few thousand documents to see how it impacts our unique term counts.   I'll post results to the list once I have something interesting to report.

> Create Solr FilterFactory for Lucene StandardTokenizer with  UAX#29 support
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-2211
>                 URL: https://issues.apache.org/jira/browse/SOLR-2211
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1
>            Reporter: Tom Burton-West
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: SOLR-2211.patch
>
>
> The Lucene 3.x StandardTokenizer with UAX#29 support provides benefits for non-English tokenizing.  Presently it can be invoked by using the StandardTokenizerFactory and setting the Version to 3.1.  However, it would be useful to be able to use the improved unicode processing without necessarily including the ip address and email address processing of StandardAnalyzer.   A FilterFactory that allowed the use of the StandardTokenizer with UAX#29 support on its own would be useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2211) Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927052#action_12927052 ] 

Robert Muir commented on SOLR-2211:
-----------------------------------

Tom, for this one we just want to wrap org.apache.lucene.standard.UAX29Tokenizer, care to make a patch?


> Create Solr FilterFactory for Lucene StandardTokenizer with  UAX#29 support
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-2211
>                 URL: https://issues.apache.org/jira/browse/SOLR-2211
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1
>            Reporter: Tom Burton-West
>            Priority: Minor
>
> The Lucene 3.x StandardTokenizer with UAX#29 support provides benefits for non-English tokenizing.  Presently it can be invoked by using the StandardTokenizerFactory and setting the Version to 3.1.  However, it would be useful to be able to use the improved unicode processing without necessarily including the ip address and email address processing of StandardAnalyzer.   A FilterFactory that allowed the use of the StandardTokenizer with UAX#29 support on its own would be useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2211) Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support

Posted by "Tom Burton-West (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927067#action_12927067 ] 

Tom Burton-West commented on SOLR-2211:
---------------------------------------

Sure, I'll give it a try.  I've got  large Monday morning backlog in my todo list today, so it will probably be towards the middle of the week.

> Create Solr FilterFactory for Lucene StandardTokenizer with  UAX#29 support
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-2211
>                 URL: https://issues.apache.org/jira/browse/SOLR-2211
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1
>            Reporter: Tom Burton-West
>            Priority: Minor
>
> The Lucene 3.x StandardTokenizer with UAX#29 support provides benefits for non-English tokenizing.  Presently it can be invoked by using the StandardTokenizerFactory and setting the Version to 3.1.  However, it would be useful to be able to use the improved unicode processing without necessarily including the ip address and email address processing of StandardAnalyzer.   A FilterFactory that allowed the use of the StandardTokenizer with UAX#29 support on its own would be useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (SOLR-2211) Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support

Posted by "Tom Burton-West (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tom Burton-West updated SOLR-2211:
----------------------------------

    Attachment: SOLR-2211.patch

Patch implements Solr UAX29TokenizerFactory and TestUAX29TokenizerFactory.  

Tom

> Create Solr FilterFactory for Lucene StandardTokenizer with  UAX#29 support
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-2211
>                 URL: https://issues.apache.org/jira/browse/SOLR-2211
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 3.1
>            Reporter: Tom Burton-West
>            Priority: Minor
>         Attachments: SOLR-2211.patch
>
>
> The Lucene 3.x StandardTokenizer with UAX#29 support provides benefits for non-English tokenizing.  Presently it can be invoked by using the StandardTokenizerFactory and setting the Version to 3.1.  However, it would be useful to be able to use the improved unicode processing without necessarily including the ip address and email address processing of StandardAnalyzer.   A FilterFactory that allowed the use of the StandardTokenizer with UAX#29 support on its own would be useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org