You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Piotr Findeisen <pi...@starburstdata.com> on 2021/07/16 22:48:53 UTC

string bucketing compatibility issue

Hi,

It was discovered by @Mateusz Gajewski
<ma...@starburstdata.com> that
Iceberg bucketing transformation for string isn't regular Murmur3 32-bit
hash.

Upon closer investigation we found out that the code:

https://github.com/apache/iceberg/blob/0c50b2074cd5dad59bbcb4b4599ec3ae11a34b49/api/src/main/java/org/apache/iceberg/transforms/Bucket.java#L239

is affected by Guava issue https://github.com/google/guava/issues/5648 that
causes wrong results for input containing surrogate pairs (Unicode
codepooints outside of Basic Multilingual Plane).

Assuming it's indeed a bug and it gets fixed (I posted a PR to Guava with
the proposed fix), this can cause incorrect query results, since bucketing
function definition will effectively change.

This is mostly FYI, unless we can do something more about it.

Best
PF

Re: string bucketing compatibility issue

Posted by Ryan Blue <bl...@tabular.io>.
Thanks, Piotr! I've added this to the agenda for our next sync. I think the
main question is whether we think users have hit this problem or not. If we
don't think that it is something users have probably hit, then we can just
fix the problem. If we think someone has data stored with the incorrect
bucket values, we'll have to update Iceberg to account for it.

It would be great to hear from other people on this list that may be able
to comment about how many codepoints are affected and whether those are
expected to be used in these cases. For example, a handful of emoji that
probably aren't used in bucketing is not a big concern, but if certain
languages are heavily affected then this is likely a problem we should
address.

On Mon, Jul 19, 2021 at 2:37 AM Piotr Findeisen <pi...@starburstdata.com>
wrote:

> Hi,
>
> I've filed https://github.com/apache/iceberg/issues/2837 for this as well.
>
> Best
> PF
>
>
>
> On Sat, Jul 17, 2021 at 12:48 AM Piotr Findeisen <pi...@starburstdata.com>
> wrote:
>
>> Hi,
>>
>> It was discovered by @Mateusz Gajewski
>> <ma...@starburstdata.com> that Iceberg bucketing
>> transformation for string isn't regular Murmur3 32-bit hash.
>>
>> Upon closer investigation we found out that the code:
>>
>>
>> https://github.com/apache/iceberg/blob/0c50b2074cd5dad59bbcb4b4599ec3ae11a34b49/api/src/main/java/org/apache/iceberg/transforms/Bucket.java#L239
>>
>> is affected by Guava issue https://github.com/google/guava/issues/5648
>> that causes wrong results for input containing surrogate pairs (Unicode
>> codepooints outside of Basic Multilingual Plane).
>>
>> Assuming it's indeed a bug and it gets fixed (I posted a PR to Guava with
>> the proposed fix), this can cause incorrect query results, since bucketing
>> function definition will effectively change.
>>
>> This is mostly FYI, unless we can do something more about it.
>>
>> Best
>> PF
>>
>>

-- 
Ryan Blue
Tabular

Re: string bucketing compatibility issue

Posted by Piotr Findeisen <pi...@starburstdata.com>.
Hi,

I've filed https://github.com/apache/iceberg/issues/2837 for this as well.

Best
PF



On Sat, Jul 17, 2021 at 12:48 AM Piotr Findeisen <pi...@starburstdata.com>
wrote:

> Hi,
>
> It was discovered by @Mateusz Gajewski
> <ma...@starburstdata.com> that Iceberg bucketing
> transformation for string isn't regular Murmur3 32-bit hash.
>
> Upon closer investigation we found out that the code:
>
>
> https://github.com/apache/iceberg/blob/0c50b2074cd5dad59bbcb4b4599ec3ae11a34b49/api/src/main/java/org/apache/iceberg/transforms/Bucket.java#L239
>
> is affected by Guava issue https://github.com/google/guava/issues/5648
> that causes wrong results for input containing surrogate pairs (Unicode
> codepooints outside of Basic Multilingual Plane).
>
> Assuming it's indeed a bug and it gets fixed (I posted a PR to Guava with
> the proposed fix), this can cause incorrect query results, since bucketing
> function definition will effectively change.
>
> This is mostly FYI, unless we can do something more about it.
>
> Best
> PF
>
>