You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steven Rowe (JIRA)" <ji...@apache.org> on 2008/11/01 05:55:44 UTC

[jira] Created: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
----------------------------------------------------------------------------------------------

                 Key: LUCENE-1435
                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
             Project: Lucene - Java
          Issue Type: New Feature
    Affects Versions: 2.4
            Reporter: Steven Rowe
            Priority: Minor
             Fix For: 2.9


Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.

This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646619#action_12646619 ] 

Steven Rowe commented on LUCENE-1435:
-------------------------------------

Hi Mike,

bq.Could we, alternatively, push this change into DocumentsWriter, such that on writing a segment it uses a per-field Collator (FieldInfo would be extended to record this) to sort the terms dict?

Are you suggesting to not store collation keys in the index?

bq. I haven't fully thought through the tradeoffs... but it seems like this'd be simpler to use? Ie rather than putting a CollationKeyFilter in your analyzer chain, and then doing the reverse of this for all searches at search time, you simply set the Collator on the fields (at indexing & searching time, since I agree we should for now not try to serialize into the index which field has which Collator)?

The query-time process in this patch is not the reverse - it is exactly the same.  The String-encoded collation keys stored in the index are compared directly with those from query terms.  Neither the String-encoding nor the CollationKey needs to be reversed.

bq. I guess there is a performance cost to using the Collator to do live binary search (during searching) and sorting (during indexing) vs doing unicode String comparisions but in practice at search time this is probably a tiny part of the net cost of searching?

In the current code base, for range searching on a collated field, every single term has to be collated with the search term.  This patch allows skipTo to function when using collation, potentially providing a significant speedup.

> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646525#action_12646525 ] 

Michael McCandless commented on LUCENE-1435:
--------------------------------------------

Could we, alternatively, push this change into DocumentsWriter, such that on writing a segment it uses a per-field Collator (FieldInfo would be extended to record this) to sort the terms dict?

I haven't fully thought through the tradeoffs... but it seems like this'd be simpler to use?  Ie rather than putting a CollationKeyFilter in your analyzer chain, and then doing the reverse of this for all searches at search time, you simply set the Collator on the fields (at indexing & searching time, since I agree we should for now not try to serialize into the index which field has which Collator)?

I guess there is a performance cost to using the Collator to do live binary search (during searching) and sorting (during indexing) vs doing unicode String comparisions but in practice at search time this is probably a tiny part of the net cost of searching?

> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683167#action_12683167 ] 

Michael McCandless commented on LUCENE-1435:
--------------------------------------------

Steven, I'm hitting compilation errors, eg:

{code}
    [javac] /tango/mike/src/lucene.collation/contrib/collation/src/test/org/apache/lucene/collation/CollationTestBase.java:42: package org.apache.lucene.queryParser.analyzing does not exist
    [javac] import org.apache.lucene.queryParser.analyzing.AnalyzingQueryParser;
    [javac]                                               ^
    [javac] /tango/mike/src/lucene.collation/contrib/collation/src/test/org/apache/lucene/collation/CollationTestBase.java:89: cannot find symbol
{code}

What is AnalyzingQueryParser?

> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644600#action_12644600 ] 

Steven Rowe commented on LUCENE-1435:
-------------------------------------

Three problems I can think of off the top of my head with attempting an automatically managed solution to the problem of CollationKey comparability:

# There doesn't seem to be any way of ascertaining the RuleBasedCollator version, so one would have to store exact JVM version and Locale used to genenerate the Collator, and the strength used, and then fail any range or sort operations if the indexed CollationKeys were produced with ones different from the current ones.
# Lucene doesn't have an index-level per-field place to store arbitrary information.
# Other implementations of java.text.Collator, besides RuleBasedCollator, are certainly possible.

So, it seems to me, either the user of this functionality has to manage the versioning external to the Lucene index, or they can't use the functionality :).

Would strong warnings in the javadocs be enough to allow people to take appropriate precautions?

> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644526#action_12644526 ] 

Hoss Man commented on LUCENE-1435:
----------------------------------

The one worry i have about an approach like this comes from the fine print of the CollationKey docs...

bq. You can only compare CollationKeys generated from the same Collator object.

"same" tends to have a very specific meaning in Java documentation, .. it's usually used to indicate refrence equality (ie "==" not .equals) ...

bq. The equals method for class Object implements the most discriminating possible equivalence relation on objects; that is, for any non-null reference values x and y, this method returns true if and only if x and y refer to the same object (x == y has the value true).

so the question becomes: did they reall mean "same Collator" or did they mean "a Collator with the same rules" ? 

is it safe to persist a CollationKey from a Collator A and then compare it with a CollationKey from another Collator B where A.equals(B) but A != B (because A and B are from different JVM instances?)

> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1435:
--------------------------------

    Attachment: LUCENE-1435.patch

New patch that compiles.

I'm not sure how this ever worked previously - I must somehow have had lucene-misc-X.jar on the classpath or something.

Anyway, the build.xml in this patch, cribbing from contrib/benchmark/build.xml, first builds contrib/miscellaneous, then adds build/contrib/miscellaneous/classes/java/ to the classpath, so that AnalyzingQueryParser can be linked against.

Everything now compiles, and all contrib tests pass.

> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646699#action_12646699 ] 

Michael McCandless commented on LUCENE-1435:
--------------------------------------------

bq. IndexableBinaryStringTools (LUCENE-1434) implements a base-8000h encoding: the lower 15 bits of each character have 1-7/8 bytes packed into them. It's radically different from the original byte array, at least in terms of looking at it with a text viewer like Luke. And I don't think CollationKeys themselves are intended for human consumption.

Oh OK.  So having done this term conversion, you can't really look at / use the resulting terms in the index for human consumption (you'd have to store stuff yourself).

bq. Perhaps I'm missing something, but o.a.l.index.TermEnum.skipTo(Term) compares the target term using String.compareTo(),

But we could just fix that to pay attention to the Collator for that field, if it has one, right?  (Or with flexible indexing I think the impl really should own this method, ie, it should be abstract in TermEnum).

I think the external approach is fine for starters... I just think long-term it may make sense to have core Lucene respect the Collator, but it really is an invasive change.  We should wait until we make progress on flexible indexing at which point such a change should be far less costly.

> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646667#action_12646667 ] 

Michael McCandless commented on LUCENE-1435:
--------------------------------------------


bq. Are you suggesting to not store collation keys in the index?

Right, I'm proposing storing the original Strings, but sorted
according Collator.compare (for that one field), in the Terms dict.

bq. The query-time process in this patch is not the reverse - it is exactly the same.

OK got it.  Where/how would you implement the query time conversion of
terms?

And wouldn't there be times when you also want to reverse the
encoding?  EG if you enum all terms for presentation (maybe as part of
faceted search for example)?

bq. In the current code base, for range searching on a collated field, every single term has to be collated with the search term. This patch allows skipTo to function when using collation, potentially providing a significant speedup.

Both the original proposed approach (external-to-indexing) and this
internal-to-indexing approach would solve this, right?  Ie, in both
cases the terms have been sorted according to the Collator, but in the
internal-to-indexing case it's the original String term stored in the
terms dict.

Here are some pros of internal-to-indexing:

  - You don't have to convert every single term visited during
    analysis first to a CollationKey then ByteBuffer then encoded
    binary string.  Indexing throughput should be faster?  (Though,
    when writing the segment you do need to sort using
    Collator.compare, which I guess could be slow).

  - Real terms are stored in the index -- tools like Luke can look at
    the index and see normal looking terms.  Though... I don't have a
    sense of what the encoded term would look like -- maybe it's not
    that different from the original in practice?

  - Querying would just work without term conversion

And some cons:

  - It's obviously a more invasive change to Lucene (and probably
    should go after the flex-indexing changes).  The
    external-to-indexing approach is nicely externalized.

  - Performance -- the binary search of the terms index would be
    slower using Collator.compare instead of String.compareTo (though
    I would expect this to be minimal in practice).

I'm sure there are many pros/cons I'm missing...


> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1435.
----------------------------------------

    Resolution: Fixed

Thanks Steven!

> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646717#action_12646717 ] 

Steven Rowe commented on LUCENE-1435:
-------------------------------------

{quote}
bq. Perhaps I'm missing something, but o.a.l.index.TermEnum.skipTo(Term) compares the target term using String.compareTo(),

But we could just fix that to pay attention to the Collator for that field, if it has one, right? (Or with flexible indexing I think the impl really should own this method, ie, it should be abstract in TermEnum).
{quote}

Um, yes.  :) 

bq. I think the external approach is fine for starters... I just think long-term it may make sense to have core Lucene respect the Collator, but it really is an invasive change. We should wait until we make progress on flexible indexing at which point such a change should be far less costly.

Now that I understand it, I too think the internal-to-indexing approach is cleaner/easier to use/better long-term.  This patch is an attempt to improve on the performance of the range collation facilities introduced in LUCENE-1279.  So I guess the question is whether it's worth putting in another less-than-optimal workaround.

> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned LUCENE-1435:
------------------------------------------

    Assignee: Michael McCandless

> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683155#action_12683155 ] 

Michael McCandless commented on LUCENE-1435:
--------------------------------------------

I think we should commit this to contrib/collation as an "external" way to get faster range filters on fields that require custom Collator; at some future point we can consider allowing a given field to sort its terms in some custom way.

Marvin: does KS/Lucy give control over sort order of the terms in a field?

> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646679#action_12646679 ] 

Steven Rowe commented on LUCENE-1435:
-------------------------------------

bq. And wouldn't there be times when you also want to reverse the encoding? EG if you enum all terms for presentation (maybe as part of faceted search for example)?

AFAIK, CollationKey generation is a one-way operation.  If the original terms are required for presentation, they can be stored, right?

{quote}
Here are some pros of internal-to-indexing:
      [...]
    - Real terms are stored in the index - tools like Luke can look at
      the index and see normal looking terms. Though... I don't have a
      sense of what the encoded term would look like - maybe it's not
      that different from the original in practice?
{quote}

IndexableBinaryStringTools (LUCENE-1434) implements a base-8000h encoding: the lower 15 bits of each character have 1-7/8 bytes packed into them.  It's radically different from the original byte array, at least in terms of looking at it with a text viewer like Luke.  And I don't think CollationKeys themselves are intended for human consumption.

{quote}
bq. In the current code base, for range searching on a collated field, every single term has to be collated with the search term. This patch allows skipTo to function when using collation, potentially providing a significant speedup.

Both the original proposed approach (external-to-indexing) and this
internal-to-indexing approach would solve this, right? Ie, in both
cases the terms have been sorted according to the Collator, but in the
internal-to-indexing case it's the original String term stored in the
terms dict.
{quote}

Perhaps I'm missing something, but o.a.l.index.TermEnum.skipTo(Term) compares the target term using String.compareTo(), so regardless of the index term dictionary ordering, skipTo() won't necessarily stop at the correct location, right?  From TermEnum.java:

{code:java}
  public boolean skipTo(Term target) throws IOException {
     do {
        if (!next())
  	        return false;
     } while (target.compareTo(term()) > 0);
     return true;
  }
{code}

and here's o.a.l.index.Term.compareTo(Term):

{code:java}
  public final int compareTo(Term other) {
    if (field == other.field)			  // fields are interned
      return text.compareTo(other.text);
    else
      return field.compareTo(other.field);
  }
{code}


> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683174#action_12683174 ] 

Steven Rowe commented on LUCENE-1435:
-------------------------------------

It's in contrib/miscellaneous/

I used AnalyzingQueryParser in the tests to allow CollationKeyFilter to be applied to the terms in the range query - the standard QueryParser doesn't analyze range terms.

From:

http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html

bq. Overrides Lucene's default QueryParser so that Fuzzy-, Prefix-, Range-, and WildcardQuerys are also passed through the given analyzer, but wild card characters (like *) don't get removed from the search terms. 

This is a (test-only) cross-contrib dependency.  I'm not sure why I didn't have trouble with compilation - I haven't looked at this in months.  I'll take a look later on tonight.

> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644541#action_12644541 ] 

Robert Muir commented on LUCENE-1435:
-------------------------------------

at least in ICU, its not completely safe.  If the different JVM instances are "different" in version (upgrade, etc) then it would be a shame to find your sorts all busted. 

When comparing keys, it is important to know that both keys were generated by the same algorithms and weightings. Otherwise, identical strings with keys generated on two different dates, for example, might compare as unequal. Sort keys can be affected by new versions of ICU or its data tables, new sort key formats, or changes to the Collator.

http://www.icu-project.org/userguide/Collate_ServiceArchitecture.html

> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644782#action_12644782 ] 

Steven Rowe commented on LUCENE-1435:
-------------------------------------

Hoss wrote:

{quote}
bq. So, it seems to me, either the user of this functionality has to manage the versioning external to the Lucene index, or they can't use the functionality .

bq. Would strong warnings in the javadocs be enough to allow people to take appropriate precautions?

I agree with you on both points ... this is really just an extension of warning people to use compatible analyzers when indexing/querying. 
{quote}

I will add warnings about this issue to the javadocs.

> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1435:
--------------------------------

    Attachment: LUCENE-1435.patch

Removed accidentally included IndexableBinaryString and its test from the patch (see LUCENE-1434 for these).

> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644773#action_12644773 ] 

Hoss Man commented on LUCENE-1435:
----------------------------------

bq. So, it seems to me, either the user of this functionality has to manage the versioning external to the Lucene index, or they can't use the functionality .

bq. Would strong warnings in the javadocs be enough to allow people to take appropriate precautions?

I agree with you on both points ... this is really just an extension of warning people to use compatible analyzers when indexing/querying. 

(I only brought it up in my first comment because i know very little about the internals of *any* Collator Implementations out there, and i wasn't sure if *all* Implementations produces keys that were only comparable between "same" instances .. as long as there are *some* implementations of Collator that products keys which can be compared between "equivalent" instances, then this feature certainly seems useful.

> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644780#action_12644780 ] 

Steven Rowe commented on LUCENE-1435:
-------------------------------------

Robert Muir wrote:

bq. One alternative is that the ICU implementation has versioning specifically for this purpose. 

I'll look into using RegexQuery as a model here (it enables use of either java.util.regex or Jakarta Regexp, defaulting to java.util.regex), and try to add CollatorCapable/CollatorCapabilities, so that ICU's Collator implementation will be usable.

> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683182#action_12683182 ] 

Michael McCandless commented on LUCENE-1435:
--------------------------------------------

OK, thanks for the pointer -- I learn something new every day!

> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683388#action_12683388 ] 

Michael McCandless commented on LUCENE-1435:
--------------------------------------------

Super, thanks Steven.  I plan to commit soon.

> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1435:
--------------------------------

    Attachment: LUCENE-1435.patch

> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644604#action_12644604 ] 

Robert Muir commented on LUCENE-1435:
-------------------------------------

One alternative is that the ICU implementation has versioning specifically for this purpose.

The version information of Collator is a 32-bit integer. If a new version of ICU has changes affecting the content of collation elements, the version information will be changed. In that case, to use the new version of ICU collator will require regenerating any saved or stored sort keys. However, since ICU 1.8.1. it is possible to build your program so that it uses more than one version of ICU. Therefore, you could use the current version for the features you need and use the older version for collation.

> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1435:
--------------------------------

    Attachment: LUCENE-1435.patch

Modifications in this patch:

# Added dependency on ICU4J 4.0
# Introduced ICUCollationKeyFilter, which uses ICU collation to produce the collation keys
# Added Analyzer versions of the Filters, creating IndexableBinaryStringTools-encoded collation keys from the single token produced by KeywordTokenizer.
# Centralized testing to a base class, which the four test classes extend, to avoid duplication
# Moved from contrib/analyzers/o/a/l/analysis/miscellaneous/ to a new contrib package: contrib/collation, because it doesn't make sense to add a dependency to the entire contrib/analyzers package just for ICUCollationKeyFilter/Analyzer

The external ICU4J dependency, which should be checked into contrib/collation/lib/, can be downloaded here: [http://download.icu-project.org/files/icu4j/4.0/icu4j-4_0.jar].  The license for this jar is included in the patch at contrib/collation/lib/ICU-LICENSE.txt.


> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649967#action_12649967 ] 

Michael McCandless commented on LUCENE-1435:
--------------------------------------------

Another use-case for allowing per-field custom sorting of Terms would be simpler numeric RangeQuery.  Ie, right now you have to zero-pad numbers to trick Lucene into sorting them numerically (which causes challenges for BigDecimal, being discussed now on java-user).  But if you could have Lucene sort by the number then numeric range queries would be straightforward.

> CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1435
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1435
>             Project: Lucene - Java
>          Issue Type: New Feature
>    Affects Versions: 2.4
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org