You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Steven Rowe (JIRA)" <ji...@apache.org> on 2010/12/04 17:59:11 UTC

[jira] Created: (LUCENE-2798) Randomize collation testing

Randomize collation testing
---------------------------

                 Key: LUCENE-2798
                 URL: https://issues.apache.org/jira/browse/LUCENE-2798
             Project: Lucene - Java
          Issue Type: Test
          Components: contrib/*
    Affects Versions: 3.1, 4.0
            Reporter: Steven Rowe
            Priority: Minor
             Fix For: 3.1, 4.0


Robert Muir noted on #lucene IRC channel today that Lucene's indexed collation key testing is currently fragile (for example, they had to be revisited when Robert upgraded the ICU dependency in LUCENE-2797 because of Unicode 6.0 collation changes) and coverage is trivial (only 5 locales tested, and no collator options are exercised).  This affects both the JDK implementation in {{modules/analysis/common/}} and the ICU implementation under {{modules/icu/}}.

The key thing to test is that the order of the indexed terms is the same as that provided by the Collator itself.  Instead of the current set of static tests, this could be achieved via indexing randomly generated terms' collation keys (and collator options) and then comparing the index terms' order to the order provided by the Collator over the original terms.

Since different terms may produce the same collation key, however, the order of indexed terms is inherently unstable.  When performing runtime collation, the Collator addresses the sort stability issue by adding a secondary sort over the normalized original terms.  In order to directly compare Collator's sort with Lucene's collation key sort, a secondary sort will need to be applied to Lucene's indexed terms as well. Robert has suggested indexing the original terms in addition to their collation keys, then using a Sort over the original terms as the secondary sort.

Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and trunk uses UTF-8 order, so the implemented secondary sort will need to respect that.

>From #lucene:
{quote}
rmuir__: so i think we have to on 3.x, sort the 'expected list' with Collator.compare, if thats equal, then as a tiebreak use String.compareTo
rmuir__: and in the index sort on the collated field, followed by the original term
rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the tiebreak for the expected list
rmuir__: instead compare codepoints (iterating character.codepointAt, or comparing .getBytes("UTF-8"))
{quote}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-2798) Randomize indexed collation key testing

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018353#comment-13018353 ] 

Steven Rowe commented on LUCENE-2798:
-------------------------------------

bq. it may be the use of _TestUtil.randomUnicodeString here.

It may, but the first above-listed seed produces this mismatch (strings are printed out as arrays of codepoints):

{noformat}
java.lang.AssertionError: -----------
Indexed string #45: [141]
 Sorted string #45: [141]
-----------
Indexed string #46: [32]
 Sorted string #46: [28, 777]
-----------
Indexed string #47: [28, 777]
 Sorted string #47: [32]

Collator strength: SECONDARY  Collator decomposition: CANONICAL_DECOMPOSITION
{noformat}

#46 and #47 include neither supplementary chars nor problematic BMP chars.

I wrote a test including just [32] and [28,777] as indexed strings, and the same mismatch occurs for random locales, regardless of collator decomposition, and for all collator strengths except PRIMARY.


> Randomize indexed collation key testing
> ---------------------------------------
>
>                 Key: LUCENE-2798
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2798
>             Project: Lucene - Java
>          Issue Type: Test
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-2798.patch
>
>
> Robert Muir noted on #lucene IRC channel today that Lucene's indexed collation key testing is currently fragile (for example, they had to be revisited when Robert upgraded the ICU dependency in LUCENE-2797 because of Unicode 6.0 collation changes) and coverage is trivial (only 5 locales tested, and no collator options are exercised).  This affects both the JDK implementation in {{modules/analysis/common/}} and the ICU implementation under {{modules/icu/}}.
> The key thing to test is that the order of the indexed terms is the same as that provided by the Collator itself.  Instead of the current set of static tests, this could be achieved via indexing randomly generated terms' collation keys (and collator options) and then comparing the index terms' order to the order provided by the Collator over the original terms.
> Since different terms may produce the same collation key, however, the order of indexed terms is inherently unstable.  When performing runtime collation, the Collator addresses the sort stability issue by adding a secondary sort over the normalized original terms.  In order to directly compare Collator's sort with Lucene's collation key sort, a secondary sort will need to be applied to Lucene's indexed terms as well. Robert has suggested indexing the original terms in addition to their collation keys, then using a Sort over the original terms as the secondary sort.
> Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and trunk uses UTF-8 order, so the implemented secondary sort will need to respect that.
> From #lucene:
> {quote}
> rmuir__: so i think we have to on 3.x, sort the 'expected list' with Collator.compare, if thats equal, then as a tiebreak use String.compareTo
> rmuir__: and in the index sort on the collated field, followed by the original term
> rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the tiebreak for the expected list
> rmuir__: instead compare codepoints (iterating character.codepointAt, or comparing .getBytes("UTF-8"))
> {quote}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2798) Randomize indexed collation key testing

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2798:
--------------------------------

    Fix Version/s:     (was: 3.1)

> Randomize indexed collation key testing
> ---------------------------------------
>
>                 Key: LUCENE-2798
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2798
>             Project: Lucene - Java
>          Issue Type: Test
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>
> Robert Muir noted on #lucene IRC channel today that Lucene's indexed collation key testing is currently fragile (for example, they had to be revisited when Robert upgraded the ICU dependency in LUCENE-2797 because of Unicode 6.0 collation changes) and coverage is trivial (only 5 locales tested, and no collator options are exercised).  This affects both the JDK implementation in {{modules/analysis/common/}} and the ICU implementation under {{modules/icu/}}.
> The key thing to test is that the order of the indexed terms is the same as that provided by the Collator itself.  Instead of the current set of static tests, this could be achieved via indexing randomly generated terms' collation keys (and collator options) and then comparing the index terms' order to the order provided by the Collator over the original terms.
> Since different terms may produce the same collation key, however, the order of indexed terms is inherently unstable.  When performing runtime collation, the Collator addresses the sort stability issue by adding a secondary sort over the normalized original terms.  In order to directly compare Collator's sort with Lucene's collation key sort, a secondary sort will need to be applied to Lucene's indexed terms as well. Robert has suggested indexing the original terms in addition to their collation keys, then using a Sort over the original terms as the secondary sort.
> Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and trunk uses UTF-8 order, so the implemented secondary sort will need to respect that.
> From #lucene:
> {quote}
> rmuir__: so i think we have to on 3.x, sort the 'expected list' with Collator.compare, if thats equal, then as a tiebreak use String.compareTo
> rmuir__: and in the index sort on the collated field, followed by the original term
> rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the tiebreak for the expected list
> rmuir__: instead compare codepoints (iterating character.codepointAt, or comparing .getBytes("UTF-8"))
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2798) Randomize indexed collation key testing

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-2798:
--------------------------------

    Summary: Randomize indexed collation key testing  (was: Randomize collation testing)

> Randomize indexed collation key testing
> ---------------------------------------
>
>                 Key: LUCENE-2798
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2798
>             Project: Lucene - Java
>          Issue Type: Test
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>
> Robert Muir noted on #lucene IRC channel today that Lucene's indexed collation key testing is currently fragile (for example, they had to be revisited when Robert upgraded the ICU dependency in LUCENE-2797 because of Unicode 6.0 collation changes) and coverage is trivial (only 5 locales tested, and no collator options are exercised).  This affects both the JDK implementation in {{modules/analysis/common/}} and the ICU implementation under {{modules/icu/}}.
> The key thing to test is that the order of the indexed terms is the same as that provided by the Collator itself.  Instead of the current set of static tests, this could be achieved via indexing randomly generated terms' collation keys (and collator options) and then comparing the index terms' order to the order provided by the Collator over the original terms.
> Since different terms may produce the same collation key, however, the order of indexed terms is inherently unstable.  When performing runtime collation, the Collator addresses the sort stability issue by adding a secondary sort over the normalized original terms.  In order to directly compare Collator's sort with Lucene's collation key sort, a secondary sort will need to be applied to Lucene's indexed terms as well. Robert has suggested indexing the original terms in addition to their collation keys, then using a Sort over the original terms as the secondary sort.
> Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and trunk uses UTF-8 order, so the implemented secondary sort will need to respect that.
> From #lucene:
> {quote}
> rmuir__: so i think we have to on 3.x, sort the 'expected list' with Collator.compare, if thats equal, then as a tiebreak use String.compareTo
> rmuir__: and in the index sort on the collated field, followed by the original term
> rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the tiebreak for the expected list
> rmuir__: instead compare codepoints (iterating character.codepointAt, or comparing .getBytes("UTF-8"))
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2798) Randomize indexed collation key testing

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966933#action_12966933 ] 

Robert Muir commented on LUCENE-2798:
-------------------------------------

Steven, before working too hard on the jdk collation tests, i just had this idea:

Are we sure we shouldn't deprecate the jdk collation functionality (remove in trunk) and only offer ICU?

I was just thinking that the JDK Collator integration is basically a RAM trap due to its aweful keysize, etc:
http://site.icu-project.org/charts/collation-icu4j-sun



> Randomize indexed collation key testing
> ---------------------------------------
>
>                 Key: LUCENE-2798
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2798
>             Project: Lucene - Java
>          Issue Type: Test
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>
> Robert Muir noted on #lucene IRC channel today that Lucene's indexed collation key testing is currently fragile (for example, they had to be revisited when Robert upgraded the ICU dependency in LUCENE-2797 because of Unicode 6.0 collation changes) and coverage is trivial (only 5 locales tested, and no collator options are exercised).  This affects both the JDK implementation in {{modules/analysis/common/}} and the ICU implementation under {{modules/icu/}}.
> The key thing to test is that the order of the indexed terms is the same as that provided by the Collator itself.  Instead of the current set of static tests, this could be achieved via indexing randomly generated terms' collation keys (and collator options) and then comparing the index terms' order to the order provided by the Collator over the original terms.
> Since different terms may produce the same collation key, however, the order of indexed terms is inherently unstable.  When performing runtime collation, the Collator addresses the sort stability issue by adding a secondary sort over the normalized original terms.  In order to directly compare Collator's sort with Lucene's collation key sort, a secondary sort will need to be applied to Lucene's indexed terms as well. Robert has suggested indexing the original terms in addition to their collation keys, then using a Sort over the original terms as the secondary sort.
> Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and trunk uses UTF-8 order, so the implemented secondary sort will need to respect that.
> From #lucene:
> {quote}
> rmuir__: so i think we have to on 3.x, sort the 'expected list' with Collator.compare, if thats equal, then as a tiebreak use String.compareTo
> rmuir__: and in the index sort on the collated field, followed by the original term
> rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the tiebreak for the expected list
> rmuir__: instead compare codepoints (iterating character.codepointAt, or comparing .getBytes("UTF-8"))
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-2798) Randomize indexed collation key testing

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018283#comment-13018283 ] 

Robert Muir commented on LUCENE-2798:
-------------------------------------

just a glance: 

it may be the use of _TestUtil.randomUnicodeString here.
it is not just avoiding supplementaries, but also avoiding things like U+FFFF

bottom line: there are serious bugs in this stuff, and even my current "testThreadSafe" i think is not completely avoiding them (I seem to trigger a OOM from the jre impl every few days)

I've thought about @Ignore'ing our current testThreadSafe for this reason... I don't like dancing around known bugs in a test like this, it makes the test stupid. Somehow this stuff needs to get fixed in ICU/OpenJDK.


> Randomize indexed collation key testing
> ---------------------------------------
>
>                 Key: LUCENE-2798
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2798
>             Project: Lucene - Java
>          Issue Type: Test
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-2798.patch
>
>
> Robert Muir noted on #lucene IRC channel today that Lucene's indexed collation key testing is currently fragile (for example, they had to be revisited when Robert upgraded the ICU dependency in LUCENE-2797 because of Unicode 6.0 collation changes) and coverage is trivial (only 5 locales tested, and no collator options are exercised).  This affects both the JDK implementation in {{modules/analysis/common/}} and the ICU implementation under {{modules/icu/}}.
> The key thing to test is that the order of the indexed terms is the same as that provided by the Collator itself.  Instead of the current set of static tests, this could be achieved via indexing randomly generated terms' collation keys (and collator options) and then comparing the index terms' order to the order provided by the Collator over the original terms.
> Since different terms may produce the same collation key, however, the order of indexed terms is inherently unstable.  When performing runtime collation, the Collator addresses the sort stability issue by adding a secondary sort over the normalized original terms.  In order to directly compare Collator's sort with Lucene's collation key sort, a secondary sort will need to be applied to Lucene's indexed terms as well. Robert has suggested indexing the original terms in addition to their collation keys, then using a Sort over the original terms as the secondary sort.
> Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and trunk uses UTF-8 order, so the implemented secondary sort will need to respect that.
> From #lucene:
> {quote}
> rmuir__: so i think we have to on 3.x, sort the 'expected list' with Collator.compare, if thats equal, then as a tiebreak use String.compareTo
> rmuir__: and in the index sort on the collated field, followed by the original term
> rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the tiebreak for the expected list
> rmuir__: instead compare codepoints (iterating character.codepointAt, or comparing .getBytes("UTF-8"))
> {quote}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-2798) Randomize indexed collation key testing

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-2798:
--------------------------------

    Attachment: LUCENE-2798.patch

work in progress: JDK-only Analyzer-only test: {{TestCollationKeyAnalyzer.testRandomizedCollationKeySort()}}.

The test succeeds most of the times I run it, but sometimes fails, e.g. for these seeds:

* 3253903552510972177:-5236779063463918718
* 1469913545269555695:-7929666046197505961

Robert, would you please take a look at the code and see if you can figure out why the test fails?

> Randomize indexed collation key testing
> ---------------------------------------
>
>                 Key: LUCENE-2798
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2798
>             Project: Lucene - Java
>          Issue Type: Test
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-2798.patch
>
>
> Robert Muir noted on #lucene IRC channel today that Lucene's indexed collation key testing is currently fragile (for example, they had to be revisited when Robert upgraded the ICU dependency in LUCENE-2797 because of Unicode 6.0 collation changes) and coverage is trivial (only 5 locales tested, and no collator options are exercised).  This affects both the JDK implementation in {{modules/analysis/common/}} and the ICU implementation under {{modules/icu/}}.
> The key thing to test is that the order of the indexed terms is the same as that provided by the Collator itself.  Instead of the current set of static tests, this could be achieved via indexing randomly generated terms' collation keys (and collator options) and then comparing the index terms' order to the order provided by the Collator over the original terms.
> Since different terms may produce the same collation key, however, the order of indexed terms is inherently unstable.  When performing runtime collation, the Collator addresses the sort stability issue by adding a secondary sort over the normalized original terms.  In order to directly compare Collator's sort with Lucene's collation key sort, a secondary sort will need to be applied to Lucene's indexed terms as well. Robert has suggested indexing the original terms in addition to their collation keys, then using a Sort over the original terms as the secondary sort.
> Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and trunk uses UTF-8 order, so the implemented secondary sort will need to respect that.
> From #lucene:
> {quote}
> rmuir__: so i think we have to on 3.x, sort the 'expected list' with Collator.compare, if thats equal, then as a tiebreak use String.compareTo
> rmuir__: and in the index sort on the collated field, followed by the original term
> rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the tiebreak for the expected list
> rmuir__: instead compare codepoints (iterating character.codepointAt, or comparing .getBytes("UTF-8"))
> {quote}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-2798) Randomize indexed collation key testing

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018383#comment-13018383 ] 

Robert Muir commented on LUCENE-2798:
-------------------------------------

also i don't see any check that preflex codec isn't in use for this test?



> Randomize indexed collation key testing
> ---------------------------------------
>
>                 Key: LUCENE-2798
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2798
>             Project: Lucene - Java
>          Issue Type: Test
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-2798.patch
>
>
> Robert Muir noted on #lucene IRC channel today that Lucene's indexed collation key testing is currently fragile (for example, they had to be revisited when Robert upgraded the ICU dependency in LUCENE-2797 because of Unicode 6.0 collation changes) and coverage is trivial (only 5 locales tested, and no collator options are exercised).  This affects both the JDK implementation in {{modules/analysis/common/}} and the ICU implementation under {{modules/icu/}}.
> The key thing to test is that the order of the indexed terms is the same as that provided by the Collator itself.  Instead of the current set of static tests, this could be achieved via indexing randomly generated terms' collation keys (and collator options) and then comparing the index terms' order to the order provided by the Collator over the original terms.
> Since different terms may produce the same collation key, however, the order of indexed terms is inherently unstable.  When performing runtime collation, the Collator addresses the sort stability issue by adding a secondary sort over the normalized original terms.  In order to directly compare Collator's sort with Lucene's collation key sort, a secondary sort will need to be applied to Lucene's indexed terms as well. Robert has suggested indexing the original terms in addition to their collation keys, then using a Sort over the original terms as the secondary sort.
> Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and trunk uses UTF-8 order, so the implemented secondary sort will need to respect that.
> From #lucene:
> {quote}
> rmuir__: so i think we have to on 3.x, sort the 'expected list' with Collator.compare, if thats equal, then as a tiebreak use String.compareTo
> rmuir__: and in the index sort on the collated field, followed by the original term
> rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the tiebreak for the expected list
> rmuir__: instead compare codepoints (iterating character.codepointAt, or comparing .getBytes("UTF-8"))
> {quote}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-2798) Randomize indexed collation key testing

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018386#comment-13018386 ] 

Steven Rowe commented on LUCENE-2798:
-------------------------------------

bq. also i don't see any check that preflex codec isn't in use for this test?

{{TestCollationKeyAnalyzer.setUp()}} handles it:
{code:java}
  @Override
  public void setUp() throws Exception {
    super.setUp();
    assumeFalse("preflex format only supports UTF-8 encoded bytes", "PreFlex".equals(CodecProvider.getDefault().getDefaultFieldCodec()));
  }
{code}

And in practice, the test gets skipped 25% of the time as a result of this.


> Randomize indexed collation key testing
> ---------------------------------------
>
>                 Key: LUCENE-2798
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2798
>             Project: Lucene - Java
>          Issue Type: Test
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-2798.patch
>
>
> Robert Muir noted on #lucene IRC channel today that Lucene's indexed collation key testing is currently fragile (for example, they had to be revisited when Robert upgraded the ICU dependency in LUCENE-2797 because of Unicode 6.0 collation changes) and coverage is trivial (only 5 locales tested, and no collator options are exercised).  This affects both the JDK implementation in {{modules/analysis/common/}} and the ICU implementation under {{modules/icu/}}.
> The key thing to test is that the order of the indexed terms is the same as that provided by the Collator itself.  Instead of the current set of static tests, this could be achieved via indexing randomly generated terms' collation keys (and collator options) and then comparing the index terms' order to the order provided by the Collator over the original terms.
> Since different terms may produce the same collation key, however, the order of indexed terms is inherently unstable.  When performing runtime collation, the Collator addresses the sort stability issue by adding a secondary sort over the normalized original terms.  In order to directly compare Collator's sort with Lucene's collation key sort, a secondary sort will need to be applied to Lucene's indexed terms as well. Robert has suggested indexing the original terms in addition to their collation keys, then using a Sort over the original terms as the secondary sort.
> Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and trunk uses UTF-8 order, so the implemented secondary sort will need to respect that.
> From #lucene:
> {quote}
> rmuir__: so i think we have to on 3.x, sort the 'expected list' with Collator.compare, if thats equal, then as a tiebreak use String.compareTo
> rmuir__: and in the index sort on the collated field, followed by the original term
> rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the tiebreak for the expected list
> rmuir__: instead compare codepoints (iterating character.codepointAt, or comparing .getBytes("UTF-8"))
> {quote}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2798) Randomize collation testing

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-2798:
--------------------------------

    Component/s:     (was: contrib/*)
                 Analysis
       Assignee: Steven Rowe

> Randomize collation testing
> ---------------------------
>
>                 Key: LUCENE-2798
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2798
>             Project: Lucene - Java
>          Issue Type: Test
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>
> Robert Muir noted on #lucene IRC channel today that Lucene's indexed collation key testing is currently fragile (for example, they had to be revisited when Robert upgraded the ICU dependency in LUCENE-2797 because of Unicode 6.0 collation changes) and coverage is trivial (only 5 locales tested, and no collator options are exercised).  This affects both the JDK implementation in {{modules/analysis/common/}} and the ICU implementation under {{modules/icu/}}.
> The key thing to test is that the order of the indexed terms is the same as that provided by the Collator itself.  Instead of the current set of static tests, this could be achieved via indexing randomly generated terms' collation keys (and collator options) and then comparing the index terms' order to the order provided by the Collator over the original terms.
> Since different terms may produce the same collation key, however, the order of indexed terms is inherently unstable.  When performing runtime collation, the Collator addresses the sort stability issue by adding a secondary sort over the normalized original terms.  In order to directly compare Collator's sort with Lucene's collation key sort, a secondary sort will need to be applied to Lucene's indexed terms as well. Robert has suggested indexing the original terms in addition to their collation keys, then using a Sort over the original terms as the secondary sort.
> Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and trunk uses UTF-8 order, so the implemented secondary sort will need to respect that.
> From #lucene:
> {quote}
> rmuir__: so i think we have to on 3.x, sort the 'expected list' with Collator.compare, if thats equal, then as a tiebreak use String.compareTo
> rmuir__: and in the index sort on the collated field, followed by the original term
> rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the tiebreak for the expected list
> rmuir__: instead compare codepoints (iterating character.codepointAt, or comparing .getBytes("UTF-8"))
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-2798) Randomize indexed collation key testing

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018365#comment-13018365 ] 

Robert Muir commented on LUCENE-2798:
-------------------------------------

{quote}
I wrote a test including just [32] and [28,777] as indexed strings, and the same mismatch occurs for random locales, regardless of collator decomposition, and for all collator strengths except PRIMARY.
{quote}

Without looking too hard (are these hex values?) in your debugging it would be useful to print the sort key as well. Are the sort keys the same?

But FYI the bugs i found in collation, somehow corrupted the internal state of RuleBasedCollator, so the exact strings you are looking at might simply be a symptom.


> Randomize indexed collation key testing
> ---------------------------------------
>
>                 Key: LUCENE-2798
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2798
>             Project: Lucene - Java
>          Issue Type: Test
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-2798.patch
>
>
> Robert Muir noted on #lucene IRC channel today that Lucene's indexed collation key testing is currently fragile (for example, they had to be revisited when Robert upgraded the ICU dependency in LUCENE-2797 because of Unicode 6.0 collation changes) and coverage is trivial (only 5 locales tested, and no collator options are exercised).  This affects both the JDK implementation in {{modules/analysis/common/}} and the ICU implementation under {{modules/icu/}}.
> The key thing to test is that the order of the indexed terms is the same as that provided by the Collator itself.  Instead of the current set of static tests, this could be achieved via indexing randomly generated terms' collation keys (and collator options) and then comparing the index terms' order to the order provided by the Collator over the original terms.
> Since different terms may produce the same collation key, however, the order of indexed terms is inherently unstable.  When performing runtime collation, the Collator addresses the sort stability issue by adding a secondary sort over the normalized original terms.  In order to directly compare Collator's sort with Lucene's collation key sort, a secondary sort will need to be applied to Lucene's indexed terms as well. Robert has suggested indexing the original terms in addition to their collation keys, then using a Sort over the original terms as the secondary sort.
> Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and trunk uses UTF-8 order, so the implemented secondary sort will need to respect that.
> From #lucene:
> {quote}
> rmuir__: so i think we have to on 3.x, sort the 'expected list' with Collator.compare, if thats equal, then as a tiebreak use String.compareTo
> rmuir__: and in the index sort on the collated field, followed by the original term
> rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the tiebreak for the expected list
> rmuir__: instead compare codepoints (iterating character.codepointAt, or comparing .getBytes("UTF-8"))
> {quote}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-2798) Randomize indexed collation key testing

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018374#comment-13018374 ] 

Steven Rowe commented on LUCENE-2798:
-------------------------------------

bq. Without looking too hard (are these hex values?) 

No, it's just the output from Arrays.toString(int[]), which outputs decimal.

bq. in your debugging it would be useful to print the sort key as well.

Agreed. Here's the output:

{quote}
java.lang.AssertionError: -----------
Indexed string #0: [32]
Indexed collation key: [0, 0, 0, 119, 0, 0]
 Sorted string #0: [28, 777]
Sorted collation key: [0, 0, 0, -101, 0, 0]
-----------
Indexed string #1: [28, 777]
Indexed collation key: [0, 0, 0, -101, 0, 0]
 Sorted string #1: [32]
Sorted collation key: [0, 0, 0, 119, 0, 0]

Collator strength: SECONDARY  Collator decomposition: NO_DECOMPOSITION
{quote}

(again with the Arrays.toString() for the byte array from the collation keys - obviously not ideal in that they're first converted to signed integers...)

bq. Are the sort keys the same?

No.

> Randomize indexed collation key testing
> ---------------------------------------
>
>                 Key: LUCENE-2798
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2798
>             Project: Lucene - Java
>          Issue Type: Test
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-2798.patch
>
>
> Robert Muir noted on #lucene IRC channel today that Lucene's indexed collation key testing is currently fragile (for example, they had to be revisited when Robert upgraded the ICU dependency in LUCENE-2797 because of Unicode 6.0 collation changes) and coverage is trivial (only 5 locales tested, and no collator options are exercised).  This affects both the JDK implementation in {{modules/analysis/common/}} and the ICU implementation under {{modules/icu/}}.
> The key thing to test is that the order of the indexed terms is the same as that provided by the Collator itself.  Instead of the current set of static tests, this could be achieved via indexing randomly generated terms' collation keys (and collator options) and then comparing the index terms' order to the order provided by the Collator over the original terms.
> Since different terms may produce the same collation key, however, the order of indexed terms is inherently unstable.  When performing runtime collation, the Collator addresses the sort stability issue by adding a secondary sort over the normalized original terms.  In order to directly compare Collator's sort with Lucene's collation key sort, a secondary sort will need to be applied to Lucene's indexed terms as well. Robert has suggested indexing the original terms in addition to their collation keys, then using a Sort over the original terms as the secondary sort.
> Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and trunk uses UTF-8 order, so the implemented secondary sort will need to respect that.
> From #lucene:
> {quote}
> rmuir__: so i think we have to on 3.x, sort the 'expected list' with Collator.compare, if thats equal, then as a tiebreak use String.compareTo
> rmuir__: and in the index sort on the collated field, followed by the original term
> rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the tiebreak for the expected list
> rmuir__: instead compare codepoints (iterating character.codepointAt, or comparing .getBytes("UTF-8"))
> {quote}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-2798) Randomize indexed collation key testing

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-2798:
--------------------------------

    Attachment: LUCENE-2798.patch

Added two-term collation sort test; added collation key debug printing.

> Randomize indexed collation key testing
> ---------------------------------------
>
>                 Key: LUCENE-2798
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2798
>             Project: Lucene - Java
>          Issue Type: Test
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-2798.patch, LUCENE-2798.patch
>
>
> Robert Muir noted on #lucene IRC channel today that Lucene's indexed collation key testing is currently fragile (for example, they had to be revisited when Robert upgraded the ICU dependency in LUCENE-2797 because of Unicode 6.0 collation changes) and coverage is trivial (only 5 locales tested, and no collator options are exercised).  This affects both the JDK implementation in {{modules/analysis/common/}} and the ICU implementation under {{modules/icu/}}.
> The key thing to test is that the order of the indexed terms is the same as that provided by the Collator itself.  Instead of the current set of static tests, this could be achieved via indexing randomly generated terms' collation keys (and collator options) and then comparing the index terms' order to the order provided by the Collator over the original terms.
> Since different terms may produce the same collation key, however, the order of indexed terms is inherently unstable.  When performing runtime collation, the Collator addresses the sort stability issue by adding a secondary sort over the normalized original terms.  In order to directly compare Collator's sort with Lucene's collation key sort, a secondary sort will need to be applied to Lucene's indexed terms as well. Robert has suggested indexing the original terms in addition to their collation keys, then using a Sort over the original terms as the secondary sort.
> Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and trunk uses UTF-8 order, so the implemented secondary sort will need to respect that.
> From #lucene:
> {quote}
> rmuir__: so i think we have to on 3.x, sort the 'expected list' with Collator.compare, if thats equal, then as a tiebreak use String.compareTo
> rmuir__: and in the index sort on the collated field, followed by the original term
> rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the tiebreak for the expected list
> rmuir__: instead compare codepoints (iterating character.codepointAt, or comparing .getBytes("UTF-8"))
> {quote}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2798) Randomize indexed collation key testing

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969395#action_12969395 ] 

Steven Rowe commented on LUCENE-2798:
-------------------------------------

{quote}
Are we sure we shouldn't deprecate the jdk collation functionality (remove in trunk) and only offer ICU?

I was just thinking that the JDK Collator integration is basically a RAM trap due to its aweful keysize, etc:
http://site.icu-project.org/charts/collation-icu4j-sun
{quote}

I don't like this idea, because it removes the choice.

If there were some way to perform deprecation without eventual removal, I'd be okay with it.  The issue, as I see it, is documentaiton.  Here is an excerpt from the current class-level javadoc for {{CollationKeyFilter}}:

{quote}
The <code>ICUCollationKeyFilter</code> in the icu package of Lucene's contrib area uses ICU4J's Collator, which makes its version available, thus allowing collation to be versioned independently from the JVM.  ICUCollationKeyFilter is also significantly faster and generates significantly shorter keys than CollationKeyFilter.  See http://site.icu-project.org/charts/collation-icu4j-sun for key generation timing and key length comparisons between ICU4J and java.text.Collator over several languages.
{quote}

So an attempt is already being made to inform potential victims of the choice they're making - it even links to the same web page you mentioned.

Maybe if we move the JDK variant out of core and into a module, rather than on trunk, it would at least send a message that it's on par with the ICU variant.


> Randomize indexed collation key testing
> ---------------------------------------
>
>                 Key: LUCENE-2798
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2798
>             Project: Lucene - Java
>          Issue Type: Test
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>
> Robert Muir noted on #lucene IRC channel today that Lucene's indexed collation key testing is currently fragile (for example, they had to be revisited when Robert upgraded the ICU dependency in LUCENE-2797 because of Unicode 6.0 collation changes) and coverage is trivial (only 5 locales tested, and no collator options are exercised).  This affects both the JDK implementation in {{modules/analysis/common/}} and the ICU implementation under {{modules/icu/}}.
> The key thing to test is that the order of the indexed terms is the same as that provided by the Collator itself.  Instead of the current set of static tests, this could be achieved via indexing randomly generated terms' collation keys (and collator options) and then comparing the index terms' order to the order provided by the Collator over the original terms.
> Since different terms may produce the same collation key, however, the order of indexed terms is inherently unstable.  When performing runtime collation, the Collator addresses the sort stability issue by adding a secondary sort over the normalized original terms.  In order to directly compare Collator's sort with Lucene's collation key sort, a secondary sort will need to be applied to Lucene's indexed terms as well. Robert has suggested indexing the original terms in addition to their collation keys, then using a Sort over the original terms as the secondary sort.
> Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and trunk uses UTF-8 order, so the implemented secondary sort will need to respect that.
> From #lucene:
> {quote}
> rmuir__: so i think we have to on 3.x, sort the 'expected list' with Collator.compare, if thats equal, then as a tiebreak use String.compareTo
> rmuir__: and in the index sort on the collated field, followed by the original term
> rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the tiebreak for the expected list
> rmuir__: instead compare codepoints (iterating character.codepointAt, or comparing .getBytes("UTF-8"))
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org