You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Alan Chaney <al...@mechnicality.com> on 2013/04/01 16:54:02 UTC

Inconsistencies in comparisons using KeyComparator

Hi

I need to write some code that sorts row keys identically to HBase.

I looked at the KeyValue.KeyComparator code, and it seems that, by 
default, HBase elects to use the 'Unsafe' comparator as the basis of its 
comparison, with a fall-back to to the PureJavaComparer should Unsafe 
not be available (for example, in tests.)

However, I'm finding that the sort order from a call to 
KeyValue.KeyComparator appears to be inconsistent between the two forms.

As an example, comparing:

(first param) (second param)
0000000000000000ffffffffffffffffffffffffffffffff616c1b to 
0000000000000000ffffffffffffffffffffffffffffffff61741b

gives 1 for the default (presumably, Unsafe) call, and -1 using the 
PureJavaComparator.

I would actually expect it to be a -ve number, based on the difference 
of 6c to 74 in the 3rd from last byte above.

Similarly

000000000000000000000000000000000000000000000000616c1b to 
0000000000000000000000000000000000000000000000061741b

gives > 0 instead of < 0. The PureJavaComparator does a byte-by-byte 
comparison by

Is this expected? From the definition of lexicographical compare that I 
found, I don't think so. There's no issue of signed comparison here, 
because 0x6c and 0x74 are still +ve byte values.

Regards

Alan



Re: Inconsistencies in comparisons using KeyComparator

Posted by Ted Yu <yu...@gmail.com>.
Looking at
http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/9b8c96f96a0f/src/share/classes/sun/misc/Unsafe.java,
looks like Unsafe is provided by openjdk as well.

I guess this issue, though disturbing, wouldn't show up.


On Mon, Apr 1, 2013 at 10:04 AM, Alan Chaney <al...@mechnicality.com> wrote:

>
> On 4/1/2013 9:42 AM, Stack wrote:
>
>> That is an interesting (disturbing) find Alan.  Hopefully the fallback is
>> rare.  Did you have a technique for making the compare fallback to pure
>> java compare?
>>
>> Thank you,
>> St.Ack
>>
>
> I agree its disturbing! I based my findings on reading the source code for
> 0.92.1  (the CDH4.1.2 distro).
>
> It seems to me that, from org.apache.hadoop.hbase.**KeyValue$KVComparator
> the KeyComparator calls KeyComparator.compareRows which in turn calls
>
> Bytes.compareTo(left, loffset, llength, righ, roffset, rlength) which in
> turn calls Bytes.compareTo which calls LexicographicalCompareHolder.**
> BEST_COMPARER
>
> which appears to be implemented thus:
>
>   static class LexicographicalComparerHolder {
>     static final String UNSAFE_COMPARER_NAME =
>         LexicographicalComparerHolder.**class.getName() +
> "$UnsafeComparer";
>
>     static final Comparer<byte[]> BEST_COMPARER = getBestComparer();
>     /**
>      * Returns the Unsafe-using Comparer, or falls back to the pure-Java
>      * implementation if unable to do so.
>      */
>     static Comparer<byte[]> getBestComparer() {
>       try {
>         Class<?> theClass = Class.forName(UNSAFE_COMPARER_**NAME);
> ...
>     }
>
>     enum PureJavaComparer implements Comparer<byte[]> {
>       INSTANCE;
>
>       @Override
>       public int compareTo(byte[] buffer1, int offset1, int length1,
>    ...
>       }
>     }
>
> So, it looks like to me that Unsafe is the default. However, its not
> really very easy to debug this, except by invoking the
> KeyValue.KeyComparator and seeing what you get, which is what I did. Either
> I'm doing something very stupid (extremely plausible) or there is a bit of
> an issue here. I was hoping that someone would point out my error!
>
> I've got some unit tests that appear to show the difference.
>
> Thanks
>
> Alan
>
>
>
>
>>
>> On Mon, Apr 1, 2013 at 7:54 AM, Alan Chaney <al...@mechnicality.com>
>> wrote:
>>
>>  Hi
>>>
>>> I need to write some code that sorts row keys identically to HBase.
>>>
>>> I looked at the KeyValue.KeyComparator code, and it seems that, by
>>> default, HBase elects to use the 'Unsafe' comparator as the basis of its
>>> comparison, with a fall-back to to the PureJavaComparer should Unsafe not
>>> be available (for example, in tests.)
>>>
>>> However, I'm finding that the sort order from a call to
>>> KeyValue.KeyComparator appears to be inconsistent between the two forms.
>>>
>>> As an example, comparing:
>>>
>>> (first param) (second param)
>>> 0000000000000000ffffffffffffff****ffffffffffffffffff616c1b to
>>> 0000000000000000ffffffffffffff****ffffffffffffffffff61741b
>>>
>>> gives 1 for the default (presumably, Unsafe) call, and -1 using the
>>> PureJavaComparator.
>>>
>>> I would actually expect it to be a -ve number, based on the difference of
>>> 6c to 74 in the 3rd from last byte above.
>>>
>>> Similarly
>>>
>>> 000000000000000000000000000000****000000000000000000616c1b to
>>> 000000000000000000000000000000****0000000000000000061741b
>>>
>>> gives > 0 instead of < 0. The PureJavaComparator does a byte-by-byte
>>> comparison by
>>>
>>> Is this expected? From the definition of lexicographical compare that I
>>> found, I don't think so. There's no issue of signed comparison here,
>>> because 0x6c and 0x74 are still +ve byte values.
>>>
>>> Regards
>>>
>>> Alan
>>>
>>>
>>>
>>>
>

Re: Inconsistencies in comparisons using KeyComparator

Posted by Alan Chaney <al...@mechnicality.com>.
On 4/1/2013 9:42 AM, Stack wrote:
> That is an interesting (disturbing) find Alan.  Hopefully the fallback is
> rare.  Did you have a technique for making the compare fallback to pure
> java compare?
>
> Thank you,
> St.Ack

I agree its disturbing! I based my findings on reading the source code 
for 0.92.1  (the CDH4.1.2 distro).

It seems to me that, from org.apache.hadoop.hbase.KeyValue$KVComparator 
the KeyComparator calls KeyComparator.compareRows which in turn calls

Bytes.compareTo(left, loffset, llength, righ, roffset, rlength) which in 
turn calls Bytes.compareTo which calls 
LexicographicalCompareHolder.BEST_COMPARER

which appears to be implemented thus:

   static class LexicographicalComparerHolder {
     static final String UNSAFE_COMPARER_NAME =
         LexicographicalComparerHolder.class.getName() + "$UnsafeComparer";

     static final Comparer<byte[]> BEST_COMPARER = getBestComparer();
     /**
      * Returns the Unsafe-using Comparer, or falls back to the pure-Java
      * implementation if unable to do so.
      */
     static Comparer<byte[]> getBestComparer() {
       try {
         Class<?> theClass = Class.forName(UNSAFE_COMPARER_NAME);
...
     }

     enum PureJavaComparer implements Comparer<byte[]> {
       INSTANCE;

       @Override
       public int compareTo(byte[] buffer1, int offset1, int length1,
    ...
       }
     }

So, it looks like to me that Unsafe is the default. However, its not 
really very easy to debug this, except by invoking the 
KeyValue.KeyComparator and seeing what you get, which is what I did. 
Either I'm doing something very stupid (extremely plausible) or there is 
a bit of an issue here. I was hoping that someone would point out my error!

I've got some unit tests that appear to show the difference.

Thanks

Alan


>
>
> On Mon, Apr 1, 2013 at 7:54 AM, Alan Chaney <al...@mechnicality.com> wrote:
>
>> Hi
>>
>> I need to write some code that sorts row keys identically to HBase.
>>
>> I looked at the KeyValue.KeyComparator code, and it seems that, by
>> default, HBase elects to use the 'Unsafe' comparator as the basis of its
>> comparison, with a fall-back to to the PureJavaComparer should Unsafe not
>> be available (for example, in tests.)
>>
>> However, I'm finding that the sort order from a call to
>> KeyValue.KeyComparator appears to be inconsistent between the two forms.
>>
>> As an example, comparing:
>>
>> (first param) (second param)
>> 0000000000000000ffffffffffffff**ffffffffffffffffff616c1b to
>> 0000000000000000ffffffffffffff**ffffffffffffffffff61741b
>>
>> gives 1 for the default (presumably, Unsafe) call, and -1 using the
>> PureJavaComparator.
>>
>> I would actually expect it to be a -ve number, based on the difference of
>> 6c to 74 in the 3rd from last byte above.
>>
>> Similarly
>>
>> 000000000000000000000000000000**000000000000000000616c1b to
>> 000000000000000000000000000000**0000000000000000061741b
>>
>> gives > 0 instead of < 0. The PureJavaComparator does a byte-by-byte
>> comparison by
>>
>> Is this expected? From the definition of lexicographical compare that I
>> found, I don't think so. There's no issue of signed comparison here,
>> because 0x6c and 0x74 are still +ve byte values.
>>
>> Regards
>>
>> Alan
>>
>>
>>


Re: Inconsistencies in comparisons using KeyComparator

Posted by Stack <st...@duboce.net>.
That is an interesting (disturbing) find Alan.  Hopefully the fallback is
rare.  Did you have a technique for making the compare fallback to pure
java compare?

Thank you,
St.Ack


On Mon, Apr 1, 2013 at 7:54 AM, Alan Chaney <al...@mechnicality.com> wrote:

> Hi
>
> I need to write some code that sorts row keys identically to HBase.
>
> I looked at the KeyValue.KeyComparator code, and it seems that, by
> default, HBase elects to use the 'Unsafe' comparator as the basis of its
> comparison, with a fall-back to to the PureJavaComparer should Unsafe not
> be available (for example, in tests.)
>
> However, I'm finding that the sort order from a call to
> KeyValue.KeyComparator appears to be inconsistent between the two forms.
>
> As an example, comparing:
>
> (first param) (second param)
> 0000000000000000ffffffffffffff**ffffffffffffffffff616c1b to
> 0000000000000000ffffffffffffff**ffffffffffffffffff61741b
>
> gives 1 for the default (presumably, Unsafe) call, and -1 using the
> PureJavaComparator.
>
> I would actually expect it to be a -ve number, based on the difference of
> 6c to 74 in the 3rd from last byte above.
>
> Similarly
>
> 000000000000000000000000000000**000000000000000000616c1b to
> 000000000000000000000000000000**0000000000000000061741b
>
> gives > 0 instead of < 0. The PureJavaComparator does a byte-by-byte
> comparison by
>
> Is this expected? From the definition of lexicographical compare that I
> found, I don't think so. There's no issue of signed comparison here,
> because 0x6c and 0x74 are still +ve byte values.
>
> Regards
>
> Alan
>
>
>