You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2009/11/18 07:00:42 UTC

[jira] Created: (SOLR-1571) unicode collation support

unicode collation support
-------------------------

                 Key: SOLR-1571
                 URL: https://issues.apache.org/jira/browse/SOLR-1571
             Project: Solr
          Issue Type: New Feature
          Components: Analysis
            Reporter: Robert Muir
            Priority: Minor
         Attachments: SOLR-1571.patch

This patch adds support for unicode collation (searching and sorting).
Unicode collation is helpful in a search engine, for many languages you want things to match or sort differently.
You might even want to use copyfield and support different sort orders/matching schemes if you need to support multiple languages.

This is simply a factory for lucene's CollationKeyFilter, which indexes binary collation keys in a special format that preserves binary sort order.

I've added support for creating a Collator in two ways:
* system collator from a Locale spec (language + country + variant)
* tailored collator from custom rules in a text file

in no way is there an option to use the "default" locale of the jvm, (I consider this a bit dangerous)
in this patch, it is mandatory to define the locale explicitly for a system collator.

The required lucene-collation-2.9.1.jar is only 12KB.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1571) unicode collation support

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated SOLR-1571:
------------------------------

    Attachment: SOLR-1571.patch

initial patch.

> unicode collation support
> -------------------------
>
>                 Key: SOLR-1571
>                 URL: https://issues.apache.org/jira/browse/SOLR-1571
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: SOLR-1571.patch
>
>
> This patch adds support for unicode collation (searching and sorting).
> Unicode collation is helpful in a search engine, for many languages you want things to match or sort differently.
> You might even want to use copyfield and support different sort orders/matching schemes if you need to support multiple languages.
> This is simply a factory for lucene's CollationKeyFilter, which indexes binary collation keys in a special format that preserves binary sort order.
> I've added support for creating a Collator in two ways:
> * system collator from a Locale spec (language + country + variant)
> * tailored collator from custom rules in a text file
> in no way is there an option to use the "default" locale of the jvm, (I consider this a bit dangerous)
> in this patch, it is mandatory to define the locale explicitly for a system collator.
> The required lucene-collation-2.9.1.jar is only 12KB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-1571) unicode collation support

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shalin Shekhar Mangar resolved SOLR-1571.
-----------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.5

Committed revision 885338.

Thanks Robert!

> unicode collation support
> -------------------------
>
>                 Key: SOLR-1571
>                 URL: https://issues.apache.org/jira/browse/SOLR-1571
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>            Assignee: Shalin Shekhar Mangar
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-1571.patch
>
>
> This patch adds support for unicode collation (searching and sorting).
> Unicode collation is helpful in a search engine, for many languages you want things to match or sort differently.
> You might even want to use copyfield and support different sort orders/matching schemes if you need to support multiple languages.
> This is simply a factory for lucene's CollationKeyFilter, which indexes binary collation keys in a special format that preserves binary sort order.
> I've added support for creating a Collator in two ways:
> * system collator from a Locale spec (language + country + variant)
> * tailored collator from custom rules in a text file
> in no way is there an option to use the "default" locale of the jvm, (I consider this a bit dangerous)
> in this patch, it is mandatory to define the locale explicitly for a system collator.
> The required lucene-collation-2.9.1.jar is only 12KB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (SOLR-1571) unicode collation support

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shalin Shekhar Mangar reassigned SOLR-1571:
-------------------------------------------

    Assignee: Shalin Shekhar Mangar

> unicode collation support
> -------------------------
>
>                 Key: SOLR-1571
>                 URL: https://issues.apache.org/jira/browse/SOLR-1571
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>            Assignee: Shalin Shekhar Mangar
>            Priority: Minor
>         Attachments: SOLR-1571.patch
>
>
> This patch adds support for unicode collation (searching and sorting).
> Unicode collation is helpful in a search engine, for many languages you want things to match or sort differently.
> You might even want to use copyfield and support different sort orders/matching schemes if you need to support multiple languages.
> This is simply a factory for lucene's CollationKeyFilter, which indexes binary collation keys in a special format that preserves binary sort order.
> I've added support for creating a Collator in two ways:
> * system collator from a Locale spec (language + country + variant)
> * tailored collator from custom rules in a text file
> in no way is there an option to use the "default" locale of the jvm, (I consider this a bit dangerous)
> in this patch, it is mandatory to define the locale explicitly for a system collator.
> The required lucene-collation-2.9.1.jar is only 12KB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1571) unicode collation support

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783448#action_12783448 ] 

Shalin Shekhar Mangar commented on SOLR-1571:
---------------------------------------------

I tried the patch. All tests pass.

You know more about this topic than I do so if you feel ICUCollationFilter should be a separate issue, that is fine with me. As far as this patch is concerned, it is well baked and I'd be happy to commit it.

> unicode collation support
> -------------------------
>
>                 Key: SOLR-1571
>                 URL: https://issues.apache.org/jira/browse/SOLR-1571
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: SOLR-1571.patch
>
>
> This patch adds support for unicode collation (searching and sorting).
> Unicode collation is helpful in a search engine, for many languages you want things to match or sort differently.
> You might even want to use copyfield and support different sort orders/matching schemes if you need to support multiple languages.
> This is simply a factory for lucene's CollationKeyFilter, which indexes binary collation keys in a special format that preserves binary sort order.
> I've added support for creating a Collator in two ways:
> * system collator from a Locale spec (language + country + variant)
> * tailored collator from custom rules in a text file
> in no way is there an option to use the "default" locale of the jvm, (I consider this a bit dangerous)
> in this patch, it is mandatory to define the locale explicitly for a system collator.
> The required lucene-collation-2.9.1.jar is only 12KB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1571) unicode collation support

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783466#action_12783466 ] 

Robert Muir commented on SOLR-1571:
-----------------------------------

Shalin, yes I think the ICUCollationFilter is much better (faster and smaller index, more languages), but should be a separate factory imo. 
I figured I would start with the JDK impl. since there is no external dependency, its the simplest.

The icu impl has slightly different options and behavior, and doing something fancy like detecting which impl to use with reflection I don't much like either... if the ICU jar file was no longer in the classpath or its version changed, things could suddenly silently stop working correctly.


> unicode collation support
> -------------------------
>
>                 Key: SOLR-1571
>                 URL: https://issues.apache.org/jira/browse/SOLR-1571
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: SOLR-1571.patch
>
>
> This patch adds support for unicode collation (searching and sorting).
> Unicode collation is helpful in a search engine, for many languages you want things to match or sort differently.
> You might even want to use copyfield and support different sort orders/matching schemes if you need to support multiple languages.
> This is simply a factory for lucene's CollationKeyFilter, which indexes binary collation keys in a special format that preserves binary sort order.
> I've added support for creating a Collator in two ways:
> * system collator from a Locale spec (language + country + variant)
> * tailored collator from custom rules in a text file
> in no way is there an option to use the "default" locale of the jvm, (I consider this a bit dangerous)
> in this patch, it is mandatory to define the locale explicitly for a system collator.
> The required lucene-collation-2.9.1.jar is only 12KB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-1571) unicode collation support

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781053#action_12781053 ] 

Robert Muir edited comment on SOLR-1571 at 11/21/09 9:13 PM:
-------------------------------------------------------------

Hi, i wonder if anyone has any comments on this.

I know this is an invisible/covert JIRA issue right now :)

especially I am curious if the approach is sound, particularly regarding using the ICUCollationFilter instead.
In my opinion, this should be a separate integration, even though it will index at a significantly faster speed with much smaller keys.
The reason is that it is not compat with the JDK collation keys, and has different properties, such as the fact Collator is thread-safe in the JDK, but not thread-safe in ICU.
Because of this, I decided to stick with the JDK impl initially.


      was (Author: rcmuir):
    Hi, i wonder if anyone has any comments on this.

I know this is an invisible/convert JIRA issue right now :)

especially I am curious if the approach is sound, particularly regarding using the ICUCollationFilter instead.
In my opinion, this should be a separate integration, even though it will index at a significantly faster speed with much smaller keys.
The reason is that it is not compat with the JDK collation keys, and has different properties, such as the fact Collator is thread-safe in the JDK, but not thread-safe in ICU.
Because of this, I decided to stick with the JDK impl initially.

  
> unicode collation support
> -------------------------
>
>                 Key: SOLR-1571
>                 URL: https://issues.apache.org/jira/browse/SOLR-1571
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: SOLR-1571.patch
>
>
> This patch adds support for unicode collation (searching and sorting).
> Unicode collation is helpful in a search engine, for many languages you want things to match or sort differently.
> You might even want to use copyfield and support different sort orders/matching schemes if you need to support multiple languages.
> This is simply a factory for lucene's CollationKeyFilter, which indexes binary collation keys in a special format that preserves binary sort order.
> I've added support for creating a Collator in two ways:
> * system collator from a Locale spec (language + country + variant)
> * tailored collator from custom rules in a text file
> in no way is there an option to use the "default" locale of the jvm, (I consider this a bit dangerous)
> in this patch, it is mandatory to define the locale explicitly for a system collator.
> The required lucene-collation-2.9.1.jar is only 12KB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1571) unicode collation support

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781053#action_12781053 ] 

Robert Muir commented on SOLR-1571:
-----------------------------------

Hi, i wonder if anyone has any comments on this.

I know this is an invisible/convert JIRA issue right now :)

especially I am curious if the approach is sound, particularly regarding using the ICUCollationFilter instead.
In my opinion, this should be a separate integration, even though it will index at a significantly faster speed with much smaller keys.
The reason is that it is not compat with the JDK collation keys, and has different properties, such as the fact Collator is thread-safe in the JDK, but not thread-safe in ICU.
Because of this, I decided to stick with the JDK impl initially.


> unicode collation support
> -------------------------
>
>                 Key: SOLR-1571
>                 URL: https://issues.apache.org/jira/browse/SOLR-1571
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: SOLR-1571.patch
>
>
> This patch adds support for unicode collation (searching and sorting).
> Unicode collation is helpful in a search engine, for many languages you want things to match or sort differently.
> You might even want to use copyfield and support different sort orders/matching schemes if you need to support multiple languages.
> This is simply a factory for lucene's CollationKeyFilter, which indexes binary collation keys in a special format that preserves binary sort order.
> I've added support for creating a Collator in two ways:
> * system collator from a Locale spec (language + country + variant)
> * tailored collator from custom rules in a text file
> in no way is there an option to use the "default" locale of the jvm, (I consider this a bit dangerous)
> in this patch, it is mandatory to define the locale explicitly for a system collator.
> The required lucene-collation-2.9.1.jar is only 12KB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1571) unicode collation support

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783537#action_12783537 ] 

Shalin Shekhar Mangar commented on SOLR-1571:
---------------------------------------------

{quote}
Shalin, yes I think the ICUCollationFilter is much better (faster and smaller index, more languages), but should be a separate factory imo.
I figured I would start with the JDK impl. since there is no external dependency, its the simplest.
{quote}

Sure, sounds good. I'll commit this soon.

> unicode collation support
> -------------------------
>
>                 Key: SOLR-1571
>                 URL: https://issues.apache.org/jira/browse/SOLR-1571
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Robert Muir
>            Priority: Minor
>         Attachments: SOLR-1571.patch
>
>
> This patch adds support for unicode collation (searching and sorting).
> Unicode collation is helpful in a search engine, for many languages you want things to match or sort differently.
> You might even want to use copyfield and support different sort orders/matching schemes if you need to support multiple languages.
> This is simply a factory for lucene's CollationKeyFilter, which indexes binary collation keys in a special format that preserves binary sort order.
> I've added support for creating a Collator in two ways:
> * system collator from a Locale spec (language + country + variant)
> * tailored collator from custom rules in a text file
> in no way is there an option to use the "default" locale of the jvm, (I consider this a bit dangerous)
> in this patch, it is mandatory to define the locale explicitly for a system collator.
> The required lucene-collation-2.9.1.jar is only 12KB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.