You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Joel Karlsson <87...@gmail.com> on 2008/12/08 23:37:57 UTC

Sorting on text-fields with international characters

Hello,

Is there any way to get Solr to sort properly on a text field containing
international, in my case swedish, letters? It doesn't sort å,ä and ö in the
proper order. Also, is there any way to get Solr to sort, i.e, á, à or â
together with the "regular" a's?

Thanks in advance! // Joel

Re: Sorting on text-fields with international characters

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
Hello Joel,

Using MappingCharFilter with mapping-ISOLatin1Accent.txt on your sort 
field can solve your problem:

<fieldType name="title_sort" class="solr.TextField" omitNorms="true">
    <analyzer>
        <charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping-ISOLatin1Accent.txt" />
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

CharFilter is in trunk/Solr 1.4, though, if you use Solr 1.3, you can 
download a patch for Solr 1.3:

https://issues.apache.org/jira/browse/SOLR-822

Koji


Joel Karlsson wrote:
> Hello,
>
> Is there any way to get Solr to sort properly on a text field containing
> international, in my case swedish, letters? It doesn't sort å,ä and ö in the
> proper order. Also, is there any way to get Solr to sort, i.e, á, à or â
> together with the "regular" a's?
>
> Thanks in advance! // Joel
>
>   


RE: Sorting on text-fields with international characters

Posted by Lance Norskog <go...@gmail.com>.
>  Also, is there any way to get Solr to sort, i.e, á, à or â together with
the "regular" a's?

The ISOLatin1 filter "downconverts" these variants to the ASCII a letter. It
does this in the index, not the stored data. This solves the
Bjork/Bjork-umlaut problem: you can type either and find records for both.
Sorting reads all of the indexed terms and sorts on them, so sorting will
rate all four a letters as the same letter. The variants will be sorted in
with the ASCII letter, but the variants will appear randomly in the output
within the ASCII letter band.

-----Original Message-----
From: Feak, Todd [mailto:Todd.Feak@smss.sony.com] 
Sent: Monday, December 08, 2008 2:45 PM
To: solr-user@lucene.apache.org
Subject: RE: Sorting on text-fields with international characters

One option is to add an additional field for sorting. Create a copy of the
field you want to sort on and modify the data you insert there so that it
will sort the way you want it to.

-ToddFeak

-----Original Message-----
From: Joel Karlsson [mailto:87.karlsson@gmail.com]
Sent: Monday, December 08, 2008 2:38 PM
To: solr-user@lucene.apache.org
Subject: Sorting on text-fields with international characters

Hello,

Is there any way to get Solr to sort properly on a text field containing
international, in my case swedish, letters? It doesn't sort å,ä and ö in the
proper order. Also, is there any way to get Solr to sort, i.e, á, à or â
together with the "regular" a's?

Thanks in advance! // Joel


RE: Sorting on text-fields with international characters

Posted by "Feak, Todd" <To...@smss.sony.com>.
One option is to add an additional field for sorting. Create a copy of the field you want to sort on and modify the data you insert there so that it will sort the way you want it to.

-ToddFeak

-----Original Message-----
From: Joel Karlsson [mailto:87.karlsson@gmail.com] 
Sent: Monday, December 08, 2008 2:38 PM
To: solr-user@lucene.apache.org
Subject: Sorting on text-fields with international characters

Hello,

Is there any way to get Solr to sort properly on a text field containing
international, in my case swedish, letters? It doesn't sort å,ä and ö in the
proper order. Also, is there any way to get Solr to sort, i.e, á, à or â
together with the "regular" a's?

Thanks in advance! // Joel

RE: Sorting on text-fields with international characters

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Joel,

On 12/08/2008 at 5:37 PM, Joel Karlsson wrote:
> Is there any way to get Solr to sort properly on a text field containing
> international, in my case swedish, letters?  It doesn't sort å,ä and ö
> in the proper order.

I wrote a Lucene patch that stores CollationKeys generated by a user-specified Collator as index terms: <https://issues.apache.org/jira/browse/LUCENE-1435> - note that this patch depends on another Lucene patch I wrote to convert arbitrary byte sequences into indexable String terms: <https://issues.apache.org/jira/browse/LUCENE-1434>.  There are two versions of the filter/analyzer in the patch: one that uses Java's built-in Collator, and another that depends on ICU4J for collation.

I haven't written a Solr factory to hook these in, but theoretically :) it would be fairly simple to do so.  That would allow you to copyField from an indexed-as-is field to one that has CollationKeyAnalyzer or ICUCollationKeyAnalyzer in its analyzer chain, and then include a sort param over the collation key field in your query.

Vote for the patch if you'd like to see it included in Lucene.

Caveats: 

1. Mike McCandless posted to the LUCENE-1435 issue <https://issues.apache.org/jira/browse/LUCENE-1435?focusedCommentId=12646525#action_12646525> that the approach taken is not ideal, and that the Lucene index should directly handle collation.  (See his other comments on the issue for more info.)

2. CollationKeys are fragile: to remain comparable, you must insure that the algorithm used to generate them remains constant.  The implementation can differ by JVM vendor and/or version, so the only safe thing to do is to fix the JVM vendor and version.  When you change JVM, you should re-index.

> Also, is there any way to get Solr to sort, i.e, á, à or â together with the "regular" a's?

Assuming you can use the approach outlined above, check out RuleBasedCollator (Java 1.4.2: <http://java.sun.com/j2se/1.4.2/docs/api/java/text/RuleBasedCollator.html>; ICU4J: <http://icu-project.org/apiref/icu4j/com/ibm/icu/text/RuleBasedCollator.html>) - you can write your own collation rules to handle this situation.

Steve