You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steven Rowe (JIRA)" <ji...@apache.org> on 2009/06/28 06:48:47 UTC

[jira] Created: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter

Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter
---------------------------------------------------------------------------------------

                 Key: LUCENE-1719
                 URL: https://issues.apache.org/jira/browse/LUCENE-1719
             Project: Lucene - Java
          Issue Type: Improvement
          Components: contrib/*
    Affects Versions: 2.4.1
            Reporter: Steven Rowe
            Priority: Trivial
             Fix For: 2.9


contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package.  The javadocs of these classes should be modified to add a note to this effect.

My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter.

I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine.  I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination, and then subtracted it from both of the collation key analysis chains' times.  The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) / (JVM-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows:

||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
|1.4.2_17 (32 bit)|English|522|212|13|2.6x|
|1.4.2_17 (32 bit)|French|716|243|14|3.1x|
|1.4.2_17 (32 bit)|German|669|264|16|2.6x|
|1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x|
|1.5.0_15 (32 bit)|English|604|176|16|3.7x|
|1.5.0_15 (32 bit)|French|817|209|17|4.2x|
|1.5.0_15 (32 bit)|German|799|225|20|3.8x|
|1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x|
|1.5.0_15 (64 bit)|English|431|89|10|5.3x|
|1.5.0_15 (64 bit)|French|562|112|11|5.5x|
|1.5.0_15 (64 bit)|German|567|116|13|5.4x|
|1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x|
|1.6.0_13 (64 bit)|English|162|81|9|2.1x|
|1.6.0_13 (64 bit)|French|192|92|10|2.2x|
|1.6.0_13 (64 bit)|German|204|99|14|2.2x|
|1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x|


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724986#action_12724986 ] 

Robert Muir commented on LUCENE-1719:
-------------------------------------

steven, no thank you for running the calculations!

yeah i think the sort key length is worth mentioning. in practice i wonder how much it helps lucene at runtime, maybe for things like SORT at least it would improve runtime performance by some small amount.

> Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-1719
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1719
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Steven Rowe
>            Priority: Trivial
>             Fix For: 2.9
>
>         Attachments: LUCENE-1719.patch
>
>
> contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package.  The javadocs of these classes should be modified to add a note to this effect.
> My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter.
> I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine.  I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination.  The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (JVM-ICU) / (ICU-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows:
> ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
> |1.4.2_17 (32 bit)|English|522|212|13|156%|
> |1.4.2_17 (32 bit)|French|716|243|14|207%|
> |1.4.2_17 (32 bit)|German|669|264|16|163%|
> |1.4.2_17 (32 bit)|Ukranian|931|474|25|102%|
> |1.5.0_15 (32 bit)|English|604|176|16|268%|
> |1.5.0_15 (32 bit)|French|817|209|17|317%|
> |1.5.0_15 (32 bit)|German|799|225|20|280%|
> |1.5.0_15 (32 bit)|Ukranian|1029|436|26|145%|
> |1.5.0_15 (64 bit)|English|431|89|10|433%|
> |1.5.0_15 (64 bit)|French|562|112|11|446%|
> |1.5.0_15 (64 bit)|German|567|116|13|438%|
> |1.5.0_15 (64 bit)|Ukranian|734|281|21|174%|
> |1.6.0_13 (64 bit)|English|162|81|9|113%|
> |1.6.0_13 (64 bit)|French|192|92|10|122%|
> |1.6.0_13 (64 bit)|German|204|99|14|124%|
> |1.6.0_13 (64 bit)|Ukranian|273|202|21|39%|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer resolved LUCENE-1719.
-------------------------------------

    Resolution: Fixed

I committed your patch and removed the last "NB:" in the ICUCollationKeyFilter.java for consistency.

Thanks Steven!

> Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-1719
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1719
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Steven Rowe
>            Assignee: Simon Willnauer
>            Priority: Trivial
>             Fix For: 2.9
>
>         Attachments: LUCENE-1719.patch, LUCENE-1719.patch
>
>
> contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package.  The javadocs of these classes should be modified to add a note to this effect.
> My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter.
> I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine.  I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination.  The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (JVM-ICU) / (ICU-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows:
> ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
> |1.4.2_17 (32 bit)|English|522|212|13|156%|
> |1.4.2_17 (32 bit)|French|716|243|14|207%|
> |1.4.2_17 (32 bit)|German|669|264|16|163%|
> |1.4.2_17 (32 bit)|Ukranian|931|474|25|102%|
> |1.5.0_15 (32 bit)|English|604|176|16|268%|
> |1.5.0_15 (32 bit)|French|817|209|17|317%|
> |1.5.0_15 (32 bit)|German|799|225|20|280%|
> |1.5.0_15 (32 bit)|Ukranian|1029|436|26|145%|
> |1.5.0_15 (64 bit)|English|431|89|10|433%|
> |1.5.0_15 (64 bit)|French|562|112|11|446%|
> |1.5.0_15 (64 bit)|German|567|116|13|438%|
> |1.5.0_15 (64 bit)|Ukranian|734|281|21|174%|
> |1.6.0_13 (64 bit)|English|162|81|9|113%|
> |1.6.0_13 (64 bit)|French|192|92|10|122%|
> |1.6.0_13 (64 bit)|German|204|99|14|124%|
> |1.6.0_13 (64 bit)|Ukranian|273|202|21|39%|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725023#action_12725023 ] 

Steven Rowe commented on LUCENE-1719:
-------------------------------------

bq. [...] i searched lucene source code for java.text.Collator and found some uses of it (the incremental facility). I wonder if in the future we could find a way to allow usage of com.ibm.icu.text.Collator in these spots.

+1

I guess the way to go would be to make the implementation pluggable.

> Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-1719
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1719
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Steven Rowe
>            Priority: Trivial
>             Fix For: 2.9
>
>         Attachments: LUCENE-1719.patch
>
>
> contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package.  The javadocs of these classes should be modified to add a note to this effect.
> My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter.
> I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine.  I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination.  The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (JVM-ICU) / (ICU-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows:
> ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
> |1.4.2_17 (32 bit)|English|522|212|13|156%|
> |1.4.2_17 (32 bit)|French|716|243|14|207%|
> |1.4.2_17 (32 bit)|German|669|264|16|163%|
> |1.4.2_17 (32 bit)|Ukranian|931|474|25|102%|
> |1.5.0_15 (32 bit)|English|604|176|16|268%|
> |1.5.0_15 (32 bit)|French|817|209|17|317%|
> |1.5.0_15 (32 bit)|German|799|225|20|280%|
> |1.5.0_15 (32 bit)|Ukranian|1029|436|26|145%|
> |1.5.0_15 (64 bit)|English|431|89|10|433%|
> |1.5.0_15 (64 bit)|French|562|112|11|446%|
> |1.5.0_15 (64 bit)|German|567|116|13|438%|
> |1.5.0_15 (64 bit)|Ukranian|734|281|21|174%|
> |1.6.0_13 (64 bit)|English|162|81|9|113%|
> |1.6.0_13 (64 bit)|French|192|92|10|122%|
> |1.6.0_13 (64 bit)|German|204|99|14|124%|
> |1.6.0_13 (64 bit)|Ukranian|273|202|21|39%|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724974#action_12724974 ] 

Steven Rowe commented on LUCENE-1719:
-------------------------------------

Cool! Thanks for the link, Robert.

Key comparison under Lucene when using *CollationKeyAnalyzer will utilize neither ICU4J's nor the java.text incremental collation facilities - the base-8000h-String-encoded raw collation keys will be directly compared (and sorted) as Strings.  So key generation time and, as you point out, key length are the appropriate measures here.

I'll post a patch shortly that includes your ICU4J link, and mentions the key length aspect as well.  I'll also remove specific numbers from the javadoc notes - people can follow the ICU4J link if they're interested.

> Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter
> ---------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1719
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1719
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Steven Rowe
>            Priority: Trivial
>             Fix For: 2.9
>
>         Attachments: LUCENE-1719.patch
>
>
> contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package.  The javadocs of these classes should be modified to add a note to this effect.
> My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter.
> I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine.  I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination.  The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) / (JVM-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows:
> ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
> |1.4.2_17 (32 bit)|English|522|212|13|2.6x|
> |1.4.2_17 (32 bit)|French|716|243|14|3.1x|
> |1.4.2_17 (32 bit)|German|669|264|16|2.6x|
> |1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x|
> |1.5.0_15 (32 bit)|English|604|176|16|3.7x|
> |1.5.0_15 (32 bit)|French|817|209|17|4.2x|
> |1.5.0_15 (32 bit)|German|799|225|20|3.8x|
> |1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x|
> |1.5.0_15 (64 bit)|English|431|89|10|5.3x|
> |1.5.0_15 (64 bit)|French|562|112|11|5.5x|
> |1.5.0_15 (64 bit)|German|567|116|13|5.4x|
> |1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x|
> |1.6.0_13 (64 bit)|English|162|81|9|2.1x|
> |1.6.0_13 (64 bit)|French|192|92|10|2.2x|
> |1.6.0_13 (64 bit)|German|204|99|14|2.2x|
> |1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1719:
--------------------------------

    Description: 
contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package.  The javadocs of these classes should be modified to add a note to this effect.

My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter.

I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine.  I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination.  The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (JVM-ICU) / (ICU-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows:

||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
|1.4.2_17 (32 bit)|English|522|212|13|156%|
|1.4.2_17 (32 bit)|French|716|243|14|207%|
|1.4.2_17 (32 bit)|German|669|264|16|163%|
|1.4.2_17 (32 bit)|Ukranian|931|474|25|102%|
|1.5.0_15 (32 bit)|English|604|176|16|268%|
|1.5.0_15 (32 bit)|French|817|209|17|317%|
|1.5.0_15 (32 bit)|German|799|225|20|280%|
|1.5.0_15 (32 bit)|Ukranian|1029|436|26|145%|
|1.5.0_15 (64 bit)|English|431|89|10|433%|
|1.5.0_15 (64 bit)|French|562|112|11|446%|
|1.5.0_15 (64 bit)|German|567|116|13|438%|
|1.5.0_15 (64 bit)|Ukranian|734|281|21|174%|
|1.6.0_13 (64 bit)|English|162|81|9|113%|
|1.6.0_13 (64 bit)|French|192|92|10|122%|
|1.6.0_13 (64 bit)|German|204|99|14|124%|
|1.6.0_13 (64 bit)|Ukranian|273|202|21|39%|


  was:
contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package.  The javadocs of these classes should be modified to add a note to this effect.

My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter.

I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine.  I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination.  The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) / (JVM-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows:

||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
|1.4.2_17 (32 bit)|English|522|212|13|2.6x|
|1.4.2_17 (32 bit)|French|716|243|14|3.1x|
|1.4.2_17 (32 bit)|German|669|264|16|2.6x|
|1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x|
|1.5.0_15 (32 bit)|English|604|176|16|3.7x|
|1.5.0_15 (32 bit)|French|817|209|17|4.2x|
|1.5.0_15 (32 bit)|German|799|225|20|3.8x|
|1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x|
|1.5.0_15 (64 bit)|English|431|89|10|5.3x|
|1.5.0_15 (64 bit)|French|562|112|11|5.5x|
|1.5.0_15 (64 bit)|German|567|116|13|5.4x|
|1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x|
|1.6.0_13 (64 bit)|English|162|81|9|2.1x|
|1.6.0_13 (64 bit)|French|192|92|10|2.2x|
|1.6.0_13 (64 bit)|German|204|99|14|2.2x|
|1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x|


        Summary: Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter  (was: Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter)

Edited title to reflect addition of key length concerns, and switched performance improvement column to be percentage improvements rather than multipliers.

> Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-1719
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1719
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Steven Rowe
>            Priority: Trivial
>             Fix For: 2.9
>
>         Attachments: LUCENE-1719.patch
>
>
> contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package.  The javadocs of these classes should be modified to add a note to this effect.
> My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter.
> I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine.  I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination.  The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (JVM-ICU) / (ICU-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows:
> ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
> |1.4.2_17 (32 bit)|English|522|212|13|156%|
> |1.4.2_17 (32 bit)|French|716|243|14|207%|
> |1.4.2_17 (32 bit)|German|669|264|16|163%|
> |1.4.2_17 (32 bit)|Ukranian|931|474|25|102%|
> |1.5.0_15 (32 bit)|English|604|176|16|268%|
> |1.5.0_15 (32 bit)|French|817|209|17|317%|
> |1.5.0_15 (32 bit)|German|799|225|20|280%|
> |1.5.0_15 (32 bit)|Ukranian|1029|436|26|145%|
> |1.5.0_15 (64 bit)|English|431|89|10|433%|
> |1.5.0_15 (64 bit)|French|562|112|11|446%|
> |1.5.0_15 (64 bit)|German|567|116|13|438%|
> |1.5.0_15 (64 bit)|Ukranian|734|281|21|174%|
> |1.6.0_13 (64 bit)|English|162|81|9|113%|
> |1.6.0_13 (64 bit)|French|192|92|10|122%|
> |1.6.0_13 (64 bit)|German|204|99|14|124%|
> |1.6.0_13 (64 bit)|Ukranian|273|202|21|39%|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724996#action_12724996 ] 

Robert Muir commented on LUCENE-1719:
-------------------------------------

steven, another note i thought i would mention.

along these same lines i searched lucene source code for java.text.Collator and found some uses of it (the incremental facility). I wonder if in the future we could find a way to allow usage of com.ibm.icu.text.Collator in these spots.

this could give some healthy performance improvements. I found it in:

QueryParser (for localized RangeQuery)
RangeQuery/RangeFilter/RangeTermEnum/ConstantScoreRangeQuery
FieldComparator/FieldSortedHitQueue/FieldDocSortedHitQueue



> Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-1719
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1719
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Steven Rowe
>            Priority: Trivial
>             Fix For: 2.9
>
>         Attachments: LUCENE-1719.patch
>
>
> contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package.  The javadocs of these classes should be modified to add a note to this effect.
> My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter.
> I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine.  I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination.  The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (JVM-ICU) / (ICU-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows:
> ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
> |1.4.2_17 (32 bit)|English|522|212|13|156%|
> |1.4.2_17 (32 bit)|French|716|243|14|207%|
> |1.4.2_17 (32 bit)|German|669|264|16|163%|
> |1.4.2_17 (32 bit)|Ukranian|931|474|25|102%|
> |1.5.0_15 (32 bit)|English|604|176|16|268%|
> |1.5.0_15 (32 bit)|French|817|209|17|317%|
> |1.5.0_15 (32 bit)|German|799|225|20|280%|
> |1.5.0_15 (32 bit)|Ukranian|1029|436|26|145%|
> |1.5.0_15 (64 bit)|English|431|89|10|433%|
> |1.5.0_15 (64 bit)|French|562|112|11|446%|
> |1.5.0_15 (64 bit)|German|567|116|13|438%|
> |1.5.0_15 (64 bit)|Ukranian|734|281|21|174%|
> |1.6.0_13 (64 bit)|English|162|81|9|113%|
> |1.6.0_13 (64 bit)|French|192|92|10|122%|
> |1.6.0_13 (64 bit)|German|204|99|14|124%|
> |1.6.0_13 (64 bit)|Ukranian|273|202|21|39%|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1719:
--------------------------------

    Attachment: LUCENE-1719.patch

Patch containing notes to add to collation key filter/analyzer classes' javadocs.

> Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter
> ---------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1719
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1719
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Steven Rowe
>            Priority: Trivial
>             Fix For: 2.9
>
>         Attachments: LUCENE-1719.patch
>
>
> contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package.  The javadocs of these classes should be modified to add a note to this effect.
> My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter.
> I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine.  I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination, and then subtracted it from both of the collation key analysis chains' times.  The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) / (JVM-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows:
> ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
> |1.4.2_17 (32 bit)|English|522|212|13|2.6x|
> |1.4.2_17 (32 bit)|French|716|243|14|3.1x|
> |1.4.2_17 (32 bit)|German|669|264|16|2.6x|
> |1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x|
> |1.5.0_15 (32 bit)|English|604|176|16|3.7x|
> |1.5.0_15 (32 bit)|French|817|209|17|4.2x|
> |1.5.0_15 (32 bit)|German|799|225|20|3.8x|
> |1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x|
> |1.5.0_15 (64 bit)|English|431|89|10|5.3x|
> |1.5.0_15 (64 bit)|French|562|112|11|5.5x|
> |1.5.0_15 (64 bit)|German|567|116|13|5.4x|
> |1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x|
> |1.6.0_13 (64 bit)|English|162|81|9|2.1x|
> |1.6.0_13 (64 bit)|French|192|92|10|2.2x|
> |1.6.0_13 (64 bit)|German|204|99|14|2.2x|
> |1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724923#action_12724923 ] 

Steven Rowe commented on LUCENE-1719:
-------------------------------------

I also tested ICU4J version 4.2 (released 6 weeks ago), and the timings were nearly identical to those from ICU4J version 4.0 (the one that's in contrib/collation/lib/).

The timings given in the table above were not produced with the "-server" option to the JVM.  I separately tested all combinations using the "-server" option, but there was no difference for the 32-bit JVMs, though roughly 3-4% faster for the 64-bit JVMs.  I got the impression (didn't actually calculate) that although the best times of 5 runs were better for the 64-bit JVMs when using the "-server" option, the average times seemed to be slightly worse.  In any case, the performance improvement of the ICU4J implementation over the java.text.Collator implementation was basically unaffected by the use of the "-server" JVM option.


> Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter
> ---------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1719
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1719
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Steven Rowe
>            Priority: Trivial
>             Fix For: 2.9
>
>         Attachments: LUCENE-1719.patch
>
>
> contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package.  The javadocs of these classes should be modified to add a note to this effect.
> My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter.
> I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine.  I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination.  The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) / (JVM-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows:
> ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
> |1.4.2_17 (32 bit)|English|522|212|13|2.6x|
> |1.4.2_17 (32 bit)|French|716|243|14|3.1x|
> |1.4.2_17 (32 bit)|German|669|264|16|2.6x|
> |1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x|
> |1.5.0_15 (32 bit)|English|604|176|16|3.7x|
> |1.5.0_15 (32 bit)|French|817|209|17|4.2x|
> |1.5.0_15 (32 bit)|German|799|225|20|3.8x|
> |1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x|
> |1.5.0_15 (64 bit)|English|431|89|10|5.3x|
> |1.5.0_15 (64 bit)|French|562|112|11|5.5x|
> |1.5.0_15 (64 bit)|German|567|116|13|5.4x|
> |1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x|
> |1.6.0_13 (64 bit)|English|162|81|9|2.1x|
> |1.6.0_13 (64 bit)|French|192|92|10|2.2x|
> |1.6.0_13 (64 bit)|German|204|99|14|2.2x|
> |1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Assigned: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Simon Willnauer reassigned LUCENE-1719:
---------------------------------------

    Assignee: Simon Willnauer

> Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-1719
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1719
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Steven Rowe
>            Assignee: Simon Willnauer
>            Priority: Trivial
>             Fix For: 2.9
>
>         Attachments: LUCENE-1719.patch, LUCENE-1719.patch
>
>
> contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package.  The javadocs of these classes should be modified to add a note to this effect.
> My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter.
> I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine.  I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination.  The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (JVM-ICU) / (ICU-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows:
> ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
> |1.4.2_17 (32 bit)|English|522|212|13|156%|
> |1.4.2_17 (32 bit)|French|716|243|14|207%|
> |1.4.2_17 (32 bit)|German|669|264|16|163%|
> |1.4.2_17 (32 bit)|Ukranian|931|474|25|102%|
> |1.5.0_15 (32 bit)|English|604|176|16|268%|
> |1.5.0_15 (32 bit)|French|817|209|17|317%|
> |1.5.0_15 (32 bit)|German|799|225|20|280%|
> |1.5.0_15 (32 bit)|Ukranian|1029|436|26|145%|
> |1.5.0_15 (64 bit)|English|431|89|10|433%|
> |1.5.0_15 (64 bit)|French|562|112|11|446%|
> |1.5.0_15 (64 bit)|German|567|116|13|438%|
> |1.5.0_15 (64 bit)|Ukranian|734|281|21|174%|
> |1.6.0_13 (64 bit)|English|162|81|9|113%|
> |1.6.0_13 (64 bit)|French|192|92|10|122%|
> |1.6.0_13 (64 bit)|German|204|99|14|124%|
> |1.6.0_13 (64 bit)|Ukranian|273|202|21|39%|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725648#action_12725648 ] 

Simon Willnauer commented on LUCENE-1719:
-----------------------------------------

Steven, patch looks good to me. I will look at it again in a day or two.

simon

> Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-1719
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1719
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Steven Rowe
>            Assignee: Simon Willnauer
>            Priority: Trivial
>             Fix For: 2.9
>
>         Attachments: LUCENE-1719.patch, LUCENE-1719.patch
>
>
> contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package.  The javadocs of these classes should be modified to add a note to this effect.
> My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter.
> I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine.  I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination.  The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (JVM-ICU) / (ICU-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows:
> ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
> |1.4.2_17 (32 bit)|English|522|212|13|156%|
> |1.4.2_17 (32 bit)|French|716|243|14|207%|
> |1.4.2_17 (32 bit)|German|669|264|16|163%|
> |1.4.2_17 (32 bit)|Ukranian|931|474|25|102%|
> |1.5.0_15 (32 bit)|English|604|176|16|268%|
> |1.5.0_15 (32 bit)|French|817|209|17|317%|
> |1.5.0_15 (32 bit)|German|799|225|20|280%|
> |1.5.0_15 (32 bit)|Ukranian|1029|436|26|145%|
> |1.5.0_15 (64 bit)|English|431|89|10|433%|
> |1.5.0_15 (64 bit)|French|562|112|11|446%|
> |1.5.0_15 (64 bit)|German|567|116|13|438%|
> |1.5.0_15 (64 bit)|Ukranian|734|281|21|174%|
> |1.6.0_13 (64 bit)|English|162|81|9|113%|
> |1.6.0_13 (64 bit)|French|192|92|10|122%|
> |1.6.0_13 (64 bit)|German|204|99|14|124%|
> |1.6.0_13 (64 bit)|Ukranian|273|202|21|39%|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724941#action_12724941 ] 

Robert Muir commented on LUCENE-1719:
-------------------------------------

steven, you are correct. 

i should have clarified, the gain is not as much when generating keys. but there is still huge gains for runtime comparison. see recent numbers here for a few languages:

http://site.icu-project.org/charts/collation-icu4j-sun

but you should also mention that key size is smaller too!  (smaller term dictionary)

> Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter
> ---------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1719
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1719
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Steven Rowe
>            Priority: Trivial
>             Fix For: 2.9
>
>         Attachments: LUCENE-1719.patch
>
>
> contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package.  The javadocs of these classes should be modified to add a note to this effect.
> My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter.
> I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine.  I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination.  The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) / (JVM-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows:
> ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
> |1.4.2_17 (32 bit)|English|522|212|13|2.6x|
> |1.4.2_17 (32 bit)|French|716|243|14|3.1x|
> |1.4.2_17 (32 bit)|German|669|264|16|2.6x|
> |1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x|
> |1.5.0_15 (32 bit)|English|604|176|16|3.7x|
> |1.5.0_15 (32 bit)|French|817|209|17|4.2x|
> |1.5.0_15 (32 bit)|German|799|225|20|3.8x|
> |1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x|
> |1.5.0_15 (64 bit)|English|431|89|10|5.3x|
> |1.5.0_15 (64 bit)|French|562|112|11|5.5x|
> |1.5.0_15 (64 bit)|German|567|116|13|5.4x|
> |1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x|
> |1.6.0_13 (64 bit)|English|162|81|9|2.1x|
> |1.6.0_13 (64 bit)|French|192|92|10|2.2x|
> |1.6.0_13 (64 bit)|German|204|99|14|2.2x|
> |1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1719:
--------------------------------

      Description: 
contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package.  The javadocs of these classes should be modified to add a note to this effect.

My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter.

I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine.  I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination.  The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) / (JVM-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows:

||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
|1.4.2_17 (32 bit)|English|522|212|13|2.6x|
|1.4.2_17 (32 bit)|French|716|243|14|3.1x|
|1.4.2_17 (32 bit)|German|669|264|16|2.6x|
|1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x|
|1.5.0_15 (32 bit)|English|604|176|16|3.7x|
|1.5.0_15 (32 bit)|French|817|209|17|4.2x|
|1.5.0_15 (32 bit)|German|799|225|20|3.8x|
|1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x|
|1.5.0_15 (64 bit)|English|431|89|10|5.3x|
|1.5.0_15 (64 bit)|French|562|112|11|5.5x|
|1.5.0_15 (64 bit)|German|567|116|13|5.4x|
|1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x|
|1.6.0_13 (64 bit)|English|162|81|9|2.1x|
|1.6.0_13 (64 bit)|French|192|92|10|2.2x|
|1.6.0_13 (64 bit)|German|204|99|14|2.2x|
|1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x|


  was:
contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package.  The javadocs of these classes should be modified to add a note to this effect.

My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter.

I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine.  I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination, and then subtracted it from both of the collation key analysis chains' times.  The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) / (JVM-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows:

||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
|1.4.2_17 (32 bit)|English|522|212|13|2.6x|
|1.4.2_17 (32 bit)|French|716|243|14|3.1x|
|1.4.2_17 (32 bit)|German|669|264|16|2.6x|
|1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x|
|1.5.0_15 (32 bit)|English|604|176|16|3.7x|
|1.5.0_15 (32 bit)|French|817|209|17|4.2x|
|1.5.0_15 (32 bit)|German|799|225|20|3.8x|
|1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x|
|1.5.0_15 (64 bit)|English|431|89|10|5.3x|
|1.5.0_15 (64 bit)|French|562|112|11|5.5x|
|1.5.0_15 (64 bit)|German|567|116|13|5.4x|
|1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x|
|1.6.0_13 (64 bit)|English|162|81|9|2.1x|
|1.6.0_13 (64 bit)|French|192|92|10|2.2x|
|1.6.0_13 (64 bit)|German|204|99|14|2.2x|
|1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x|


    Lucene Fields: [New, Patch Available]  (was: [New])

> Add javadoc notes about ICUCollationKeyFilter's speed advantage over CollationKeyFilter
> ---------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1719
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1719
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Steven Rowe
>            Priority: Trivial
>             Fix For: 2.9
>
>         Attachments: LUCENE-1719.patch
>
>
> contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package.  The javadocs of these classes should be modified to add a note to this effect.
> My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter.
> I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine.  I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination.  The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (ICU-WST) / (JVM-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows:
> ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
> |1.4.2_17 (32 bit)|English|522|212|13|2.6x|
> |1.4.2_17 (32 bit)|French|716|243|14|3.1x|
> |1.4.2_17 (32 bit)|German|669|264|16|2.6x|
> |1.4.2_17 (32 bit)|Ukranian|931|474|25|2.0x|
> |1.5.0_15 (32 bit)|English|604|176|16|3.7x|
> |1.5.0_15 (32 bit)|French|817|209|17|4.2x|
> |1.5.0_15 (32 bit)|German|799|225|20|3.8x|
> |1.5.0_15 (32 bit)|Ukranian|1029|436|26|2.4x|
> |1.5.0_15 (64 bit)|English|431|89|10|5.3x|
> |1.5.0_15 (64 bit)|French|562|112|11|5.5x|
> |1.5.0_15 (64 bit)|German|567|116|13|5.4x|
> |1.5.0_15 (64 bit)|Ukranian|734|281|21|2.7x|
> |1.6.0_13 (64 bit)|English|162|81|9|2.1x|
> |1.6.0_13 (64 bit)|French|192|92|10|2.2x|
> |1.6.0_13 (64 bit)|German|204|99|14|2.2x|
> |1.6.0_13 (64 bit)|Ukranian|273|202|21|1.4x|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1719) Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter

Posted by "Steven Rowe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1719:
--------------------------------

    Attachment: LUCENE-1719.patch

Updated patch including information about ICU4J's shorter key length; adding a link to the ICU4J documentation's comparison of ICU4J and java.text.Collator key generation time and key length; and removing specific performance numbers.

> Add javadoc notes about ICUCollationKeyFilter's advantages over CollationKeyFilter
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-1719
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1719
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.4.1
>            Reporter: Steven Rowe
>            Priority: Trivial
>             Fix For: 2.9
>
>         Attachments: LUCENE-1719.patch, LUCENE-1719.patch
>
>
> contrib/collation's ICUCollationKeyFilter, which uses ICU4J collation, is faster than CollationKeyFilter, the JVM-provided java.text.Collator implementation in the same package.  The javadocs of these classes should be modified to add a note to this effect.
> My curiosity was piqued by [Robert Muir's comment|https://issues.apache.org/jira/browse/LUCENE-1581?focusedCommentId=12720300#action_12720300] on LUCENE-1581, in which he states that ICUCollationKeyFilter is up to 30x faster than CollationKeyFilter.
> I timed the operation of these two classes, with Sun JVM versions 1.4.2/32-bit, 1.5.0/32- and 64-bit, and 1.6.0/64-bit, using 90k word lists of 4 languages (taken from the corresponding Debian wordlist packages and truncated to the first 90k words after a fixed random shuffling), using Collators at the default strength, on a Windows Vista 64-bit machine.  I used an analysis pipeline consisting of WhitespaceTokenizer chained to the collation key filter, so to isolate the time taken by the collation key filters, I also timed WhitespaceTokenizer operating alone for each combination.  The rightmost column represents the performance advantage of the ICU4J implemtation (ICU) over the java.text.Collator implementation (JVM), after discounting the WhitespaceTokenizer time (WST): (JVM-ICU) / (ICU-WST). The best times out of 5 runs for each combination, in milliseconds, are as follows:
> ||Sun JVM||Language||java.text||ICU4J||WhitespaceTokenizer||ICU4J Improvement||
> |1.4.2_17 (32 bit)|English|522|212|13|156%|
> |1.4.2_17 (32 bit)|French|716|243|14|207%|
> |1.4.2_17 (32 bit)|German|669|264|16|163%|
> |1.4.2_17 (32 bit)|Ukranian|931|474|25|102%|
> |1.5.0_15 (32 bit)|English|604|176|16|268%|
> |1.5.0_15 (32 bit)|French|817|209|17|317%|
> |1.5.0_15 (32 bit)|German|799|225|20|280%|
> |1.5.0_15 (32 bit)|Ukranian|1029|436|26|145%|
> |1.5.0_15 (64 bit)|English|431|89|10|433%|
> |1.5.0_15 (64 bit)|French|562|112|11|446%|
> |1.5.0_15 (64 bit)|German|567|116|13|438%|
> |1.5.0_15 (64 bit)|Ukranian|734|281|21|174%|
> |1.6.0_13 (64 bit)|English|162|81|9|113%|
> |1.6.0_13 (64 bit)|French|192|92|10|122%|
> |1.6.0_13 (64 bit)|German|204|99|14|124%|
> |1.6.0_13 (64 bit)|Ukranian|273|202|21|39%|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org