You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Michael Sokolov <so...@ifactory.com> on 2011/11/23 22:39:38 UTC

trouble with CollationKeyFilter

I'm using CollectionKeyFilter to sort my documents using the Unicode 
root collation, and my documents do appear to be getting sorted 
correctly, but I'm getting weird results when performing range filtering 
using the sort key field.  For example:

ifp_sortkey_ls:["youth culture" TO "youth culture"]

and

ifp_sortkey_ls:{"youth culture" TO "youth culture"}

both return 0 hits

but

ifp_sortkey_ls:"youth culture"

returns 1 hit

It seems as if any query using the ifp_sortkey_ls:[A to B] syntax is 
acting as if the terms A, B are greater than all documents whose 
sortkeys start with an A-Z character, but less than a few documents that 
have greek letters as their first characters of their sortkeys.

the analysis chain for ifp_sortkey_ls is:

<fieldType name="sortkey" stored="false" indexed="true" 
class="solr.TextField" positionIncrementGap="100" omitNorms="true" 
omitTermFreqAndPositions="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<!-- The TrimFilter removes any leading or trailing whitespace -->
<filter class="solr.TrimFilterFactory" />
<filter class="solr.CollationKeyFilterFactory"
                 language=""
                 strength="primary"
                 />
</analyzer>
</fieldType>

Does anyone have any idea what might be going on here?

Re: trouble with CollationKeyFilter

Posted by Robert Muir <rc...@gmail.com>.

On Wed, Nov 23, 2011 at 11:22 PM, Michael Sokolov <so...@ifactory.com> wrote:
> Thanks for confirming that, and laying out the options, Robert.
>

FYI: Erick committed the multiterm stuff, so I opened an issue for
this: https://issues.apache.org/jira/browse/SOLR-2919

-- 
lucidimagination.com

Re: trouble with CollationKeyFilter

Posted by Robert Muir <rc...@gmail.com>.

On Sat, Nov 26, 2011 at 8:43 PM, Michael Sokolov <so...@ifactory.com> wrote:
> That's great news!  We can't really track trunk, but it looks like this is
> targeted for 3.6, right? As a short-term alternative, I was considering
> using ICUFoldingFilter; this won't preserve some of the finer distinctions,
> but will at least sort the accented characters in with their unaccented kin,
> which is 90% of what we need. Does that make sense?  It should index regular
> characters then, and not ICU collation keys, I think?
>

yes, should be pretty easy to make the range queries work for these.

As far as doing things with filters as an alternative: it depends what
you need, doing stuff with the analyzer is pretty inflexible because
its just a tokenfilter and still binary order at the end of the day,
so the order might not make sense for some languages.

Because of this its also difficult/impossible if you are picky about
sorting to do things like sort lowercase values first (for when you
care about case), ignore punctuation (so U.S.A. = USA), sort numerics
correctly (so FOOBAR-10 sorts after FOOBAR-9)... etc etc... though the
factory in solr doesn't yet expose these options either :)

also, looking at your configuration, the lowercasefilter is not
needed, you are using primary strength.

-- 
lucidimagination.com

Re: trouble with CollationKeyFilter

Posted by Michael Sokolov <so...@ifactory.com>.

That's great news!  We can't really track trunk, but it looks like this 
is targeted for 3.6, right? As a short-term alternative, I was 
considering using ICUFoldingFilter; this won't preserve some of the 
finer distinctions, but will at least sort the accented characters in 
with their unaccented kin, which is 90% of what we need. Does that make 
sense?  It should index regular characters then, and not ICU collation 
keys, I think?

-Mike

On 11/25/2011 8:34 PM, Erick Erickson wrote:
> It's checked in, SOLR-2438. Although it's getting some surgery so you
> can expect it to morph a bit.
>
> Erick
>
> On Wed, Nov 23, 2011 at 11:22 PM, Michael Sokolov<so...@ifactory.com>  wrote:
>> Thanks for confirming that, and laying out the options, Robert.
>>
>> -Mike
>>
>> On 11/23/2011 9:03 PM, Robert Muir wrote:
>>> hi,
>>>
>>> locale sensitive range queries don't work with these filters, only sort,
>>> although erick erickson has a patch that will enable this (the lowercasing
>>> wildcards patch, then you could add this filter to your multiterm chain).
>>>
>>> separately locale range queries and sort both work easily on trunk (with
>>> binary terms)... just use collationfield or icucollationfield if you are
>>> able to use trunk...
>>>
>>> otherwise for 3.x I think that patch is pretty close any day now, so we
>>> can
>>> add an example for localized range queries that makes use of it.
>>>
>>> On Nov 23, 2011 4:39 PM, "Michael Sokolov"<so...@ifactory.com>    wrote:
>>>> I'm using CollectionKeyFilter to sort my documents using the Unicode root
>>> collation, and my documents do appear to be getting sorted correctly, but
>>> I'm getting weird results when performing range filtering using the sort
>>> key field.  For example:
>>>> ifp_sortkey_ls:["youth culture" TO "youth culture"]
>>>>
>>>> and
>>>>
>>>> ifp_sortkey_ls:{"youth culture" TO "youth culture"}
>>>>
>>>> both return 0 hits
>>>>
>>>> but
>>>>
>>>> ifp_sortkey_ls:"youth culture"
>>>>
>>>> returns 1 hit
>>>>
>>>> It seems as if any query using the ifp_sortkey_ls:[A to B] syntax is
>>> acting as if the terms A, B are greater than all documents whose sortkeys
>>> start with an A-Z character, but less than a few documents that have greek
>>> letters as their first characters of their sortkeys.
>>>> the analysis chain for ifp_sortkey_ls is:
>>>>
>>>> <fieldType name="sortkey" stored="false" indexed="true"
>>> class="solr.TextField" positionIncrementGap="100" omitNorms="true"
>>> omitTermFreqAndPositions="true">
>>>> <analyzer>
>>>> <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>> <!-- The TrimFilter removes any leading or trailing whitespace -->
>>>> <filter class="solr.TrimFilterFactory" />
>>>> <filter class="solr.CollationKeyFilterFactory"
>>>>                 language=""
>>>>                 strength="primary"
>>>>                 />
>>>> </analyzer>
>>>> </fieldType>
>>>>
>>>> Does anyone have any idea what might be going on here?
>>>>
>>

Re: trouble with CollationKeyFilter

Posted by Erick Erickson <er...@gmail.com>.

It's checked in, SOLR-2438. Although it's getting some surgery so you
can expect it to morph a bit.

Erick

On Wed, Nov 23, 2011 at 11:22 PM, Michael Sokolov <so...@ifactory.com> wrote:
> Thanks for confirming that, and laying out the options, Robert.
>
> -Mike
>
> On 11/23/2011 9:03 PM, Robert Muir wrote:
>>
>> hi,
>>
>> locale sensitive range queries don't work with these filters, only sort,
>> although erick erickson has a patch that will enable this (the lowercasing
>> wildcards patch, then you could add this filter to your multiterm chain).
>>
>> separately locale range queries and sort both work easily on trunk (with
>> binary terms)... just use collationfield or icucollationfield if you are
>> able to use trunk...
>>
>> otherwise for 3.x I think that patch is pretty close any day now, so we
>> can
>> add an example for localized range queries that makes use of it.
>>
>> On Nov 23, 2011 4:39 PM, "Michael Sokolov"<so...@ifactory.com>  wrote:
>>>
>>> I'm using CollectionKeyFilter to sort my documents using the Unicode root
>>
>> collation, and my documents do appear to be getting sorted correctly, but
>> I'm getting weird results when performing range filtering using the sort
>> key field.  For example:
>>>
>>> ifp_sortkey_ls:["youth culture" TO "youth culture"]
>>>
>>> and
>>>
>>> ifp_sortkey_ls:{"youth culture" TO "youth culture"}
>>>
>>> both return 0 hits
>>>
>>> but
>>>
>>> ifp_sortkey_ls:"youth culture"
>>>
>>> returns 1 hit
>>>
>>> It seems as if any query using the ifp_sortkey_ls:[A to B] syntax is
>>
>> acting as if the terms A, B are greater than all documents whose sortkeys
>> start with an A-Z character, but less than a few documents that have greek
>> letters as their first characters of their sortkeys.
>>>
>>> the analysis chain for ifp_sortkey_ls is:
>>>
>>> <fieldType name="sortkey" stored="false" indexed="true"
>>
>> class="solr.TextField" positionIncrementGap="100" omitNorms="true"
>> omitTermFreqAndPositions="true">
>>>
>>> <analyzer>
>>> <tokenizer class="solr.KeywordTokenizerFactory"/>
>>> <filter class="solr.LowerCaseFilterFactory"/>
>>> <!-- The TrimFilter removes any leading or trailing whitespace -->
>>> <filter class="solr.TrimFilterFactory" />
>>> <filter class="solr.CollationKeyFilterFactory"
>>>                language=""
>>>                strength="primary"
>>>                />
>>> </analyzer>
>>> </fieldType>
>>>
>>> Does anyone have any idea what might be going on here?
>>>
>
>

Re: trouble with CollationKeyFilter

Posted by Michael Sokolov <so...@ifactory.com>.

Thanks for confirming that, and laying out the options, Robert.

-Mike

On 11/23/2011 9:03 PM, Robert Muir wrote:
> hi,
>
> locale sensitive range queries don't work with these filters, only sort,
> although erick erickson has a patch that will enable this (the lowercasing
> wildcards patch, then you could add this filter to your multiterm chain).
>
> separately locale range queries and sort both work easily on trunk (with
> binary terms)... just use collationfield or icucollationfield if you are
> able to use trunk...
>
> otherwise for 3.x I think that patch is pretty close any day now, so we can
> add an example for localized range queries that makes use of it.
>
> On Nov 23, 2011 4:39 PM, "Michael Sokolov"<so...@ifactory.com>  wrote:
>> I'm using CollectionKeyFilter to sort my documents using the Unicode root
> collation, and my documents do appear to be getting sorted correctly, but
> I'm getting weird results when performing range filtering using the sort
> key field.  For example:
>> ifp_sortkey_ls:["youth culture" TO "youth culture"]
>>
>> and
>>
>> ifp_sortkey_ls:{"youth culture" TO "youth culture"}
>>
>> both return 0 hits
>>
>> but
>>
>> ifp_sortkey_ls:"youth culture"
>>
>> returns 1 hit
>>
>> It seems as if any query using the ifp_sortkey_ls:[A to B] syntax is
> acting as if the terms A, B are greater than all documents whose sortkeys
> start with an A-Z character, but less than a few documents that have greek
> letters as their first characters of their sortkeys.
>> the analysis chain for ifp_sortkey_ls is:
>>
>> <fieldType name="sortkey" stored="false" indexed="true"
> class="solr.TextField" positionIncrementGap="100" omitNorms="true"
> omitTermFreqAndPositions="true">
>> <analyzer>
>> <tokenizer class="solr.KeywordTokenizerFactory"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <!-- The TrimFilter removes any leading or trailing whitespace -->
>> <filter class="solr.TrimFilterFactory" />
>> <filter class="solr.CollationKeyFilterFactory"
>>                 language=""
>>                 strength="primary"
>>                 />
>> </analyzer>
>> </fieldType>
>>
>> Does anyone have any idea what might be going on here?
>>

Re: trouble with CollationKeyFilter

Posted by Robert Muir <rc...@gmail.com>.

hi,

locale sensitive range queries don't work with these filters, only sort,
although erick erickson has a patch that will enable this (the lowercasing
wildcards patch, then you could add this filter to your multiterm chain).

separately locale range queries and sort both work easily on trunk (with
binary terms)... just use collationfield or icucollationfield if you are
able to use trunk...

otherwise for 3.x I think that patch is pretty close any day now, so we can
add an example for localized range queries that makes use of it.

On Nov 23, 2011 4:39 PM, "Michael Sokolov" <so...@ifactory.com> wrote:
>
> I'm using CollectionKeyFilter to sort my documents using the Unicode root
collation, and my documents do appear to be getting sorted correctly, but
I'm getting weird results when performing range filtering using the sort
key field.  For example:
>
> ifp_sortkey_ls:["youth culture" TO "youth culture"]
>
> and
>
> ifp_sortkey_ls:{"youth culture" TO "youth culture"}
>
> both return 0 hits
>
> but
>
> ifp_sortkey_ls:"youth culture"
>
> returns 1 hit
>
> It seems as if any query using the ifp_sortkey_ls:[A to B] syntax is
acting as if the terms A, B are greater than all documents whose sortkeys
start with an A-Z character, but less than a few documents that have greek
letters as their first characters of their sortkeys.
>
> the analysis chain for ifp_sortkey_ls is:
>
> <fieldType name="sortkey" stored="false" indexed="true"
class="solr.TextField" positionIncrementGap="100" omitNorms="true"
omitTermFreqAndPositions="true">
> <analyzer>
> <tokenizer class="solr.KeywordTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <!-- The TrimFilter removes any leading or trailing whitespace -->
> <filter class="solr.TrimFilterFactory" />
> <filter class="solr.CollationKeyFilterFactory"
>                language=""
>                strength="primary"
>                />
> </analyzer>
> </fieldType>
>
> Does anyone have any idea what might be going on here?
>