You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Shay Banon (Created) (JIRA)" <ji...@apache.org> on 2011/12/17 22:29:30 UTC

[jira] [Created] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
--------------------------------------------------------------------------------

                 Key: LUCENE-3654
                 URL: https://issues.apache.org/jira/browse/LUCENE-3654
             Project: Lucene - Java
          Issue Type: Improvement
          Components: core/index, core/search
            Reporter: Shay Banon


Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171787#comment-13171787 ] 

Robert Muir commented on LUCENE-3654:
-------------------------------------

where is the hotspot that requires this?

If there is many comparisons being performed somewhere, we should use a better datastructure instead.

My vote: -1

We have everything to lose and little to gain.
                
> Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3654
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3654
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>            Reporter: Shay Banon
>         Attachments: LUCENE-3654.patch
>
>
> Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

Posted by "Uwe Schindler (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173973#comment-13173973 ] 

Uwe Schindler edited comment on LUCENE-3654 at 12/21/11 10:09 AM:
------------------------------------------------------------------

The SIGSEGV can be solved by doing some safety checks at the beginning of compare: check that offset>=0 and offset+length<=bytes.length. If you use Unsafe, you have to make sure that your parameters are 1000% correct, that's all. This is why java.nio does lots of checks in their Buffer methods.

*EDIT*
You also have to copy offset, length and the actual byte[] reference to a local variable at the beginning and before the bounds checks (because otherwise another thread could change the *public* non-final fields in BytesRef and cause SIGSEGV). BytesRef is a user-visible class so it must be 100% safe against all usage-violations.

Based on this additional overhead, the whole comparator makes no sense except for terms with a size of 200 bytes. But Lucene terms are in 99% of all cases shorter.

If you want to use this comparator, just subclass Lucene40Codec and return it as term comparator, this can be completely outside Lucene. You can even use Guava.
                
      was (Author: thetaphi):
    The SIGSEGV can be solved by doing some safety checks at the beginning of compare: check that offset>=0 and offset+length<=bytes.length. If you use Unsafe, you have to make sure that your parameters are 1000% correct, that's all. This is why java.nio does lots of checks in their Buffer methods.

*EDIT*
You also have to copy offset, length and the actual byte[] reference to a local variable at the beginning and before the bounds checks (because otherwise another thread could change the *public* npon-final fields in BytesRef and cause OOM). BytesRef is a user-visible class so it must be 100% safe against all usage-violations.

Based on this additional overhead, the whole comparator makes no sense except for terms with a size of 200 bytes. But Lucene terms are in 99% of all cases shorter.

If you want to use this comparator, just subclass Lucene40Codec and return it as term comparator, this can be completely outside Lucene. You can even use Guava.
                  
> Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3654
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3654
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>            Reporter: Shay Banon
>         Attachments: LUCENE-3654.patch
>
>
> Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

Posted by "Simon Willnauer (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171695#comment-13171695 ] 

Simon Willnauer commented on LUCENE-3654:
-----------------------------------------

Shay, thanks for doing this. Somebody mentioned this on the list earlier. I will take your patch and remove all the formatting if you don't mind. I think this makes lots of sense especially if you have really long references.
                
> Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3654
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3654
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>            Reporter: Shay Banon
>         Attachments: LUCENE-3654.patch
>
>
> Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

Posted by "Uwe Schindler (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173976#comment-13173976 ] 

Uwe Schindler commented on LUCENE-3654:
---------------------------------------

bq. can we have some -DXX:LuceneUseUnsafe option to enable this. I mean there are two camps here and that could make everybody happy? I mean if you use this option you have to expect possible problems no?

We can put the whole comparator to contrib and BytesRef can have a static setter to change the default impl. Or we use SPI for it (contrib exports it in META-INF) :-)
                
> Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3654
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3654
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>            Reporter: Shay Banon
>         Attachments: LUCENE-3654.patch
>
>
> Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

Posted by "Shay Banon (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shay Banon updated LUCENE-3654:
-------------------------------

    Attachment: LUCENE-3654.patch

Patch attached, all tests seem to pass.
                
> Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3654
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3654
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>            Reporter: Shay Banon
>         Attachments: LUCENE-3654.patch
>
>
> Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171860#comment-13171860 ] 

Robert Muir commented on LUCENE-3654:
-------------------------------------

Its not like mmapdirectory at all, there we are basically doing what 
we must as a workaround to one of the top bugs at bugs.sun.com to reduce
number of open files and transient disk usage. It seems like a reasonable
calculated risk, but its still terrible.

I know you want to discuss this, but this is so ridiculous that I'm not
even going to have the conversation. I'm not going to change my vote here.
                
> Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3654
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3654
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>            Reporter: Shay Banon
>         Attachments: LUCENE-3654.patch
>
>
> Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

Posted by "Uwe Schindler (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173954#comment-13173954 ] 

Uwe Schindler commented on LUCENE-3654:
---------------------------------------

I agree here, but before doing this, I want some non-micro-benchmarks to show the effect. If there is no real effect, don't do it. Inside Lucene the comparator is not so often used (mostly only in indexer/BytesRefHash) and in TermRangeQuery. The other use cases are asserts all over the place, but they don't count.

I would agree to the patch if the class would be renamed to something like UnsignedBytesComparator and the part importing sun.misc.Unsafe to be outside the main compilation unit. So if somebody compiles with a strange JVM like Harmony (although its dead) and sun.misc.Unsafe is not available, the build succeeds. The code in BytesRef is using reflection to load the comparator implementation, so all is fine, it would just get ClassNotFoundEx and fallback to the Java one. I could help in doing the ANT magic.
                
> Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3654
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3654
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>            Reporter: Shay Banon
>         Attachments: LUCENE-3654.patch
>
>
> Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

Posted by "Shay Banon (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173843#comment-13173843 ] 

Shay Banon commented on LUCENE-3654:
------------------------------------

ok, so Mr Muir is -1, if people/devs still want to see it in Lucene, tell me and I can try and work on making it more Committable based on Uwe comments.
                
> Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3654
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3654
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>            Reporter: Shay Banon
>         Attachments: LUCENE-3654.patch
>
>
> Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

Posted by "Uwe Schindler (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171750#comment-13171750 ] 

Uwe Schindler commented on LUCENE-3654:
---------------------------------------

Nice, I remember Jason brought this up!

My problem is only: sun.misc.Unsafe is undocumented, so it might not be available on compile time. Can we handle that somehow without reflection, maybe in a separate file where we dont fail if the javac ant task fails? I know, Unsafe is always available, but I dont want to rely on it - maybe someone wants to compile our code with SomeStrangeJDK™.
The other thing is the inner class holder. Can we rename it to not contain "lexicographical", maybe UnsignedByteComparatorHolder? Lucene works on byte[] that can be anything, even binary terms.
                
> Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3654
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3654
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>            Reporter: Shay Banon
>         Attachments: LUCENE-3654.patch
>
>
> Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

Posted by "Simon Willnauer (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173972#comment-13173972 ] 

Simon Willnauer commented on LUCENE-3654:
-----------------------------------------

can we have some -DXX:LuceneUseUnsafe option to enable this. I mean there are two camps here and that could make everybody happy? I mean if you use this option you have to expect possible problems no?
                
> Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3654
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3654
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>            Reporter: Shay Banon
>         Attachments: LUCENE-3654.patch
>
>
> Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

Posted by "Jason Rutherglen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171716#comment-13171716 ] 

Jason Rutherglen commented on LUCENE-3654:
------------------------------------------

Nice, I mentioned this on the dev list over a month ago (originally it was mentioned on the Hadoop list), nice to see it get into Lucene, though am curious where the speed improvement will be for Lucene.
                
> Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3654
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3654
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>            Reporter: Shay Banon
>         Attachments: LUCENE-3654.patch
>
>
> Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173971#comment-13173971 ] 

Robert Muir commented on LUCENE-3654:
-------------------------------------

Here's an example, since so much of the lucene codebase has bugs with bytesref offsets, i figure its a good example:
{noformat}
  public void testOops() {
    BytesRef b = new BytesRef("abcdefghijklmnop");
    b.offset = -545454544; // some bug, integer overflows and goes negative or other problem
    System.out.println(b.compareTo(new BytesRef("abcdefghijklmnop")));
  }
{noformat}

With this patch, this gives me a SIGSEGV:
{noformat}
junit-sequential:
    [junit] Testsuite: org.apache.lucene.util.TestBytesRef
    [junit] #
    [junit] # A fatal error has been detected by the Java Runtime Environment:
    [junit] #
    [junit] #  SIGSEGV (0xb) at pc=0x00007f386e7dcf64, pid=6093, tid=139880338200320
    [junit] #
    [junit] # JRE version: 6.0_24-b07
    [junit] # Java VM: Java HotSpot(TM) 64-Bit Server VM (19.1-b02 mixed mode linux-amd64 compressed oops)
    [junit] # Problematic frame:
    [junit] # V  [libjvm.so+0x76ef64]
    [junit] #
{noformat}

                
> Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3654
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3654
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>            Reporter: Shay Banon
>         Attachments: LUCENE-3654.patch
>
>
> Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

Posted by "Uwe Schindler (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171802#comment-13171802 ] 

Uwe Schindler commented on LUCENE-3654:
---------------------------------------

What do we have to lose, it's like with MMapDirectory, a few tests on startup and setting a reference to the improved comparator and done? Indeed we need compile-time checking (import of the Unsafe class), but if there are incompatible changes we will see on runtime linking and automatically fall back to the Java-Only comparator.
                
> Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3654
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3654
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>            Reporter: Shay Banon
>         Attachments: LUCENE-3654.patch
>
>
> Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

Posted by "Jason Rutherglen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173853#comment-13173853 ] 

Jason Rutherglen commented on LUCENE-3654:
------------------------------------------

+1 There are 3 other MAJOR Apache projects that have already integrated this efficiency.  It's completely silly not to use it.
                
> Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3654
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3654
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>            Reporter: Shay Banon
>         Attachments: LUCENE-3654.patch
>
>
> Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

Posted by "Uwe Schindler (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173973#comment-13173973 ] 

Uwe Schindler commented on LUCENE-3654:
---------------------------------------

The SIGSEGV can be solved by doing some safety checks at the beginning of compare: check that offset>=0 and offset+length<=bytes.length. If you use Unsafe, you have to make sure that your parameters are 1000% correct, that's all. This is why java.nio does lots of checks in their Buffer methods.
                
> Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3654
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3654
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>            Reporter: Shay Banon
>         Attachments: LUCENE-3654.patch
>
>
> Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

Posted by "Shay Banon (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171699#comment-13171699 ] 

Shay Banon commented on LUCENE-3654:
------------------------------------

Cool. 

p.s. I formatted it using the IDEA formatting template provided in Lucene :)
                
> Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3654
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3654
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>            Reporter: Shay Banon
>         Attachments: LUCENE-3654.patch
>
>
> Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173962#comment-13173962 ] 

Robert Muir commented on LUCENE-3654:
-------------------------------------

the reason I am -1, I don't want JVM crashes.

This is lucene java, users can expect not to have JVM crashes because of bytesref 
bugs in lucene (this class is used all over the place), they shoudl get AIOOBE and NPE
and other things.

So all is not fine just because it has a fallback. 

Convincing that there is performance win is a waste of time, this method is not a hotspot.
Convincing me that nobody will get jvm crashes is going to be difficult.

                
> Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3654
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3654
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>            Reporter: Shay Banon
>         Attachments: LUCENE-3654.patch
>
>
> Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

Posted by "Uwe Schindler (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173973#comment-13173973 ] 

Uwe Schindler edited comment on LUCENE-3654 at 12/21/11 10:08 AM:
------------------------------------------------------------------

The SIGSEGV can be solved by doing some safety checks at the beginning of compare: check that offset>=0 and offset+length<=bytes.length. If you use Unsafe, you have to make sure that your parameters are 1000% correct, that's all. This is why java.nio does lots of checks in their Buffer methods.

*EDIT*
You also have to copy offset, length and the actual byte[] reference to a local variable at the beginning and before the bounds checks (because otherwise another thread could change the *public* npon-final fields in BytesRef and cause OOM). BytesRef is a user-visible class so it must be 100% safe against all usage-violations.

Based on this additional overhead, the whole comparator makes no sense except for terms with a size of 200 bytes. But Lucene terms are in 99% of all cases shorter.

If you want to use this comparator, just subclass Lucene40Codec and return it as term comparator, this can be completely outside Lucene. You can even use Guava.
                
      was (Author: thetaphi):
    The SIGSEGV can be solved by doing some safety checks at the beginning of compare: check that offset>=0 and offset+length<=bytes.length. If you use Unsafe, you have to make sure that your parameters are 1000% correct, that's all. This is why java.nio does lots of checks in their Buffer methods.
                  
> Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3654
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3654
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>            Reporter: Shay Banon
>         Attachments: LUCENE-3654.patch
>
>
> Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173977#comment-13173977 ] 

Robert Muir commented on LUCENE-3654:
-------------------------------------

Sorry, i'm totally against the change, even with safety checks. 

I think this will hurt the reputation of project, and i think it will be a nightmare for developers too (Sorry, i dont want to debug avoidable jvm crashes).
And I don't want to see Lucene start using unsafe everywhere. This is lucene-java, things like bounds checking are part of the language.


                
> Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3654
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3654
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>            Reporter: Shay Banon
>         Attachments: LUCENE-3654.patch
>
>
> Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3654) Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

Posted by "Dawid Weiss (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173941#comment-13173941 ] 

Dawid Weiss commented on LUCENE-3654:
-------------------------------------

There's been an interesting discussion about the common use of Unsafe on hotspot mailing list recently (can't recall the thread now though). Some people even wanted Unsafe to become part of standard library (not unsafe accesses -- the lock checking part, but nonetheless). This guy wrote an entire off-heap collections library on top of Unsafe:

http://www.ohloh.net/p/java-huge-collections

I think using Unsafe with a fallback is fine, especially in small-scope methods that are used frequently and can be thoroughly tested. BytesRef is such an example to me.

This said, it would certainly help to convince Robert and others if you ran benchmarks alongside with and without Unsafe and show how much there is to gain, Shay.
                
> Optimize BytesRef comparator to use Unsafe long based comparison (when possible)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3654
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3654
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index, core/search
>            Reporter: Shay Banon
>         Attachments: LUCENE-3654.patch
>
>
> Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org