You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Piotr Findeisen <pi...@starburstdata.com> on 2021/09/13 11:30:02 UTC

Re: [DISCUSS] UUID type

Hi,

It seems we converged here that UUID should remain included.
I read this as a consensus reached, but it may be subjective. Did we
objectively reached consensus on this?

From Iceberg project perspective there isn't anything to do, as UUID
already *is* part of the spec (
https://iceberg.apache.org/spec/#schemas-and-data-types).
Trino Iceberg PR adding support for UUID
https://github.com/trinodb/trino/pull/8747 was pending merge while this
conversation has been ongoing.

Best,
PF



On Mon, Aug 2, 2021 at 6:22 AM Kyle B <kj...@gmail.com> wrote:

> Hi Ryan and all,
>
> That sounds like a reasonable reason to leave IP address types out. In my
> experience, dedicated IP address types are mostly found in logging tools
> and other things for sysadmins / DevOps etc.
>
> When querying data with IP addresses, I’ve seen it done quite a lot (eg
> security reasons) but usually stored as string or manipulated in a UDF.
> They’re not commonly supported types.
>
> I would also draw the line at UUID types.
>
> - Kyle Bendickson
>
> On Jul 30, 2021, at 3:15 PM, Ryan Blue <bl...@tabular.io> wrote:
>
> 
> Jacques, you make some good points here. I think my argument about
> usability leading to performance issues is a stronger argument for engines
> than for Iceberg. Still, there are inefficiencies in Iceberg if someone
> chooses to use a string in an engine that doesn't have a UUID type.
>
> Another thing to consider is cross-engine support. If Iceberg removes
> UUID, then Trino would probably translate to fixed[16]. That results in a
> table that's difficult to query in other engines, where people would
> probably choose to store the data as a string. On the other hand, if
> Iceberg keeps the UUID type then integrations would simply translate to the
> UUID string representation before passing data to the other engines.
> While the engines would be using 36-byte values in join keys, the user
> experience issue is fixed and the data is more compact on disk and in
> Iceberg's bounds metadata.
>
> While having a UUID type in Iceberg can't really help engines that don't
> support UUID take advantage of the type at runtime, it does seem slightly
> better to have the UUID type in general since at least one engine supports
> it and it provides the expected user experience with a compact
> representation.
>
> IPv4 addresses are a good thing to think about as well, since most of the
> same arguments apply. If we keep the UUID type, should we also add IPv4 or
> IPv6 types? I would probably draw the line at UUID because it helps in
> joins, which are an important operation. IPv4 representations aren't that
> big of an inconvenience unless you need to do IP manipulation, which is
> typically in a UDF and not the query engine. And you can always keep both
> representations in a table fairly inexpensively. Does this sound like a
> valid rationale for having UUID but not IP types?
>
> Ryan
>
> On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau <ja...@gmail.com>
> wrote:
>
>> It seems like Spark, Hive, Dremio and Impala all lack UUID as a native
>> type. Which engines are you thinking of that have a native UUID type
>> besides the Presto derivatives and support Iceberg?
>>
>> I agree that Trino should expose a UUID type on top of Iceberg tables.
>> All the user experience things that you are describing as important
>> (compact storage, friendly display, ddl, clean literals) are possible
>> without it being a first class type in Iceberg using a trino specific
>> property.
>>
>> I don't really have a strong opinion about UUID. In general, type bloat
>> is probably just a part of this kind of project. Generally, CHAR(X) and
>> VARCHAR(X) feel like much bigger concerns given that they exist in all of
>> the engines but not Iceberg--especially when we start talking about views.
>>
>> Some of this argues for physical vs logical type abstraction. (Something
>> that was always challenging in Parquet but also helped to resolve how these
>> types are managed in engines that don't support them.)
>>
>> thanks,
>> Jacques
>>
>> PS: Funny aside, the bloat on an ip address is actually worse than a
>> UUID, right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% bloat.
>> UUID 36/16 => 125% bloat.
>>
>> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <bl...@tabular.io> wrote:
>>
>>> I don't think this is just a problem in Trino.
>>>
>>> If there is no UUID type, then a user must choose between a 36-byte
>>> string and a 16-byte binary. That's not a good choice to force people into.
>>> If someone chooses binary, then it's harder to work with rows and construct
>>> queries even though there is a standard representation for UUIDs. To avoid
>>> the user headache, people will probably choose to store values as strings.
>>> Using a string would mean that more than half the value is needlessly
>>> discarded by default in Iceberg lower/upper bounds instead of keeping the
>>> entire value. And since engines don't know what's in the string, the full
>>> value must be used in comparison, which is extra work and extra space.
>>>
>>> Inflated values may not be a problem in some cases. IPv4 addresses are
>>> one case where you could argue that it doesn't matter very much that they
>>> are typically stored as strings. But I expect the use of UUIDs to be common
>>> for ID columns because you can generate them without coordination (unlike
>>> an incrementing ID) and that's a concern because the use as an ID makes
>>> them likely to be join keys.
>>>
>>> If we want the values to be stored as 16-byte fixed, then we need to
>>> make it easy to get the expected string representation in and out, just
>>> like we do with date/time types. I don't think that's specific to any
>>> engine.
>>>
>>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <ja...@gmail.com>
>>> wrote:
>>>
>>>> I think points 1&2 don't really apply since a fixed width binary
>>>> already covers those properties.
>>>>
>>>> It seems like this isn't really a concern of iceberg but rather a
>>>> cosmetic layer that exists primarily (only?) in trino. In that case I would
>>>> be inclined to say that trino should just use custom metadata and a fixed
>>>> binary type. That way you still have the desired ux without exposing those
>>>> extra concepts to the  iceberg. It actually feels like better encapsulation
>>>> imo.
>>>>
>>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <pi...@starburstdata.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I agree with Ryan, that it takes some precautions before one can
>>>>> assume uniqueness of UUID values, and that this shouldn't be any special
>>>>> for UUIDs at all.
>>>>> After all, this is just a primitive type, which is commonly used for
>>>>> certain things, but "commonly" doesn't mean "always".
>>>>>
>>>>> The advantages of having a dedicated type are on 3 layers.
>>>>> The compact representation in the file, and compact representation in
>>>>> memory in the query engine are the ones mentioned above.
>>>>>
>>>>> The third layer is the usability. Seeing a UUID column i know what
>>>>> values i can expect, so it's more descriptive than `id char(36)`.
>>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), .... without
>>>>> need for casting to varchar.
>>>>> It also removes temptation of casting uuid to varbinary to achieve
>>>>> compact representation.
>>>>>
>>>>> Thus i think it would be good to have them.
>>>>>
>>>>> Best
>>>>> PF
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>
>>>>>> The original reason why I added UUID to the spec was that I thought
>>>>>> there would be opportunities to take advantage of UUIDs as unique values
>>>>>> and to optimize the use of UUIDs. I was thinking about auto-increment ID
>>>>>> fields and how we might do something similar in Iceberg.
>>>>>>
>>>>>> The reason we have thought about removing UUID is that there aren't
>>>>>> as many opportunities to take advantage of UUIDs as I thought. My original
>>>>>> assumption was that we could do things like bucket on UUID fields or assume
>>>>>> that a UUID field has a high NDV. But that's not necessarily the case with
>>>>>> when a UUID field is a foreign key, only when it is used as an identifier
>>>>>> or primary key. Before Jack added tracking for row identifier fields, we
>>>>>> couldn't know that a UUID was unique in a table. As a result, we didn't
>>>>>> invest in support for UUID.
>>>>>>
>>>>>> Quick aside: Now that row identifier fields are tracked, we can do
>>>>>> some of these things with the row identifier fields. Engines can assume
>>>>>> that the tuple of row identifier fields is unique in a table for join
>>>>>> estimation. And engines can use row identifier fields in sort keys to
>>>>>> ensure lots of partition split locations (this is really important for
>>>>>> Spark).
>>>>>>
>>>>>> Coming back to UUIDs, the second reason to have a UUID type is still
>>>>>> valid: it is better to represent UUIDs as fixed[16] than as 36 byte UTF-8
>>>>>> strings that are more than twice as large, or even worse UCS-16 Strings
>>>>>> that are 4x as large. Since UUIDs are likely to be used in joins, this
>>>>>> could really help engines as long as they can keep the values as
>>>>>> fixed-width binary.
>>>>>>
>>>>>> I could go either way on this. I think it is valuable to have a
>>>>>> compact representation for UUIDs rather than using the string
>>>>>> representation. But that will require investing in the type and building
>>>>>> support in engines that won't take advantage of it. If Trino can use this,
>>>>>> I think it may be worth keeping and investing in.
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <ye...@gmail.com> wrote:
>>>>>>
>>>>>>> Yes I agree with Jacques that fixed binary is what it is in the end.
>>>>>>> I think It is more about user experience, whether the conversion is done at
>>>>>>> the user side or Iceberg and engine side. Many people just store UUID as a
>>>>>>> 36 byte string instead of a 16 byte binary, so with an explicit UUID type,
>>>>>>> Iceberg can optimize this common use case internally for users. There might
>>>>>>> be some other benefits I overlooked, but maybe the complication introduced
>>>>>>> by this type does not really justify the slightly better user experience. I
>>>>>>> am also on the fence about it.
>>>>>>>
>>>>>>> -Jack Ye
>>>>>>>
>>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <
>>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>>
>>>>>>>> What specific arguments are there for it being a first class type
>>>>>>>> besides it is elsewhere? Is there some kind of optimization iceberg or an
>>>>>>>> engine could do if it was typed versus just a bucket of bits? Fixed width
>>>>>>>> binary seems to cover the cases I see in terms of actual functionality in
>>>>>>>> the iceberg libraries or engines…
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yy...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> One conversation I used to come across regarding UUID deprecation
>>>>>>>>> was from https://github.com/apache/iceberg/pull/1611
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Yan
>>>>>>>>>
>>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary
>>>>>>>>> <pv...@cloudera.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Joshua,
>>>>>>>>>>
>>>>>>>>>> I do not have a strong preference about the UUID type, but I
>>>>>>>>>> would like the highlight, that the type is handled inconsistently in
>>>>>>>>>> Iceberg with different file formats. (See:
>>>>>>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>>>>>>
>>>>>>>>>> If we keep the type, it would be good to standardize the handling
>>>>>>>>>> in every file format.
>>>>>>>>>>
>>>>>>>>>> Thanks, Peter
>>>>>>>>>>
>>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <jo...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi.
>>>>>>>>>>>
>>>>>>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but there
>>>>>>>>>>> seems to have been some discussion about removing it? I could not find the
>>>>>>>>>>> original discussion, but a reference to the discussion can be found here (
>>>>>>>>>>> https://github.com/trinodb/trino/issues/6663).
>>>>>>>>>>>
>>>>>>>>>>> I generally agree with the consensus in the Trino issue to keep
>>>>>>>>>>> UUID in Iceberg. To summarize…
>>>>>>>>>>>
>>>>>>>>>>> - It makes sense to keep the type now that row identifiers are
>>>>>>>>>>> supported
>>>>>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>>>>>
>>>>>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>
>

Re: [DISCUSS] UUID type

Posted by Jacques Nadeau <ja...@gmail.com>.

I already added it to Substrait because of Iceberg lazy consensus :D


On Fri, Sep 17, 2021 at 2:05 PM Ryan Blue <bl...@tabular.io> wrote:

> Let's move forward with it. I'm not hearing much dissent after saying the
> general trend is to keep UUID. So let's call it lazy consensus.
>
> Ryan
>
> On Fri, Sep 17, 2021 at 1:32 PM Piotr Findeisen <pi...@starburstdata.com>
> wrote:
>
>> Hi Ryan,
>>
>> Please advise whatever feels more appropriate from your perspective.
>> From my perspective, we could just go ahead and merge Trino Iceberg
>> support for UUID, since this is just fulfilling the spec as it is defined
>> today.
>>
>> Best
>> PF
>>
>>
>> On Wed, Sep 15, 2021 at 10:17 PM Ryan Blue <bl...@tabular.io> wrote:
>>
>>> I don't think we necessarily reached consensus, but I think the general
>>> trend toward the end was to keep support for UUID. Should we start a vote
>>> to validate consensus?
>>>
>>> On Wed, Sep 15, 2021 at 1:15 PM Joshua Howard <jo...@gmail.com>
>>> wrote:
>>>
>>>> Just following up on Piotr's message here.
>>>>
>>>> Have we converged? I think most people would assume that silence is a
>>>> vote for the status-quo.
>>>>
>>>> On Mon, Sep 13, 2021 at 7:30 AM Piotr Findeisen <
>>>> piotr@starburstdata.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> It seems we converged here that UUID should remain included.
>>>>> I read this as a consensus reached, but it may be subjective. Did we
>>>>> objectively reached consensus on this?
>>>>>
>>>>> From Iceberg project perspective there isn't anything to do, as UUID
>>>>> already *is* part of the spec (
>>>>> https://iceberg.apache.org/spec/#schemas-and-data-types).
>>>>> Trino Iceberg PR adding support for UUID
>>>>> https://github.com/trinodb/trino/pull/8747 was pending merge while
>>>>> this conversation has been ongoing.
>>>>>
>>>>> Best,
>>>>> PF
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Aug 2, 2021 at 6:22 AM Kyle B <kj...@gmail.com> wrote:
>>>>>
>>>>>> Hi Ryan and all,
>>>>>>
>>>>>> That sounds like a reasonable reason to leave IP address types out.
>>>>>> In my experience, dedicated IP address types are mostly found in logging
>>>>>> tools and other things for sysadmins / DevOps etc.
>>>>>>
>>>>>> When querying data with IP addresses, I’ve seen it done quite a lot
>>>>>> (eg security reasons) but usually stored as string or manipulated in a UDF.
>>>>>> They’re not commonly supported types.
>>>>>>
>>>>>> I would also draw the line at UUID types.
>>>>>>
>>>>>> - Kyle Bendickson
>>>>>>
>>>>>> On Jul 30, 2021, at 3:15 PM, Ryan Blue <bl...@tabular.io> wrote:
>>>>>>
>>>>>> 
>>>>>> Jacques, you make some good points here. I think my argument about
>>>>>> usability leading to performance issues is a stronger argument for engines
>>>>>> than for Iceberg. Still, there are inefficiencies in Iceberg if someone
>>>>>> chooses to use a string in an engine that doesn't have a UUID type.
>>>>>>
>>>>>> Another thing to consider is cross-engine support. If Iceberg removes
>>>>>> UUID, then Trino would probably translate to fixed[16]. That results in a
>>>>>> table that's difficult to query in other engines, where people would
>>>>>> probably choose to store the data as a string. On the other hand, if
>>>>>> Iceberg keeps the UUID type then integrations would simply translate to the
>>>>>> UUID string representation before passing data to the other engines.
>>>>>> While the engines would be using 36-byte values in join keys, the user
>>>>>> experience issue is fixed and the data is more compact on disk and in
>>>>>> Iceberg's bounds metadata.
>>>>>>
>>>>>> While having a UUID type in Iceberg can't really help engines that
>>>>>> don't support UUID take advantage of the type at runtime, it does seem
>>>>>> slightly better to have the UUID type in general since at least one engine
>>>>>> supports it and it provides the expected user experience with a compact
>>>>>> representation.
>>>>>>
>>>>>> IPv4 addresses are a good thing to think about as well, since most of
>>>>>> the same arguments apply. If we keep the UUID type, should we also add IPv4
>>>>>> or IPv6 types? I would probably draw the line at UUID because it helps in
>>>>>> joins, which are an important operation. IPv4 representations aren't that
>>>>>> big of an inconvenience unless you need to do IP manipulation, which is
>>>>>> typically in a UDF and not the query engine. And you can always keep both
>>>>>> representations in a table fairly inexpensively. Does this sound like a
>>>>>> valid rationale for having UUID but not IP types?
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau <
>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>
>>>>>>> It seems like Spark, Hive, Dremio and Impala all lack UUID as a
>>>>>>> native type. Which engines are you thinking of that have a native UUID type
>>>>>>> besides the Presto derivatives and support Iceberg?
>>>>>>>
>>>>>>> I agree that Trino should expose a UUID type on top of Iceberg
>>>>>>> tables. All the user experience things that you are describing as important
>>>>>>> (compact storage, friendly display, ddl, clean literals) are possible
>>>>>>> without it being a first class type in Iceberg using a trino specific
>>>>>>> property.
>>>>>>>
>>>>>>> I don't really have a strong opinion about UUID. In general, type
>>>>>>> bloat is probably just a part of this kind of project. Generally, CHAR(X)
>>>>>>> and VARCHAR(X) feel like much bigger concerns given that they exist in all
>>>>>>> of the engines but not Iceberg--especially when we start talking about
>>>>>>> views.
>>>>>>>
>>>>>>> Some of this argues for physical vs logical type abstraction.
>>>>>>> (Something that was always challenging in Parquet but also helped to
>>>>>>> resolve how these types are managed in engines that don't support them.)
>>>>>>>
>>>>>>> thanks,
>>>>>>> Jacques
>>>>>>>
>>>>>>> PS: Funny aside, the bloat on an ip address is actually worse than a
>>>>>>> UUID, right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% bloat.
>>>>>>> UUID 36/16 => 125% bloat.
>>>>>>>
>>>>>>> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>
>>>>>>>> I don't think this is just a problem in Trino.
>>>>>>>>
>>>>>>>> If there is no UUID type, then a user must choose between a 36-byte
>>>>>>>> string and a 16-byte binary. That's not a good choice to force people into.
>>>>>>>> If someone chooses binary, then it's harder to work with rows and construct
>>>>>>>> queries even though there is a standard representation for UUIDs. To avoid
>>>>>>>> the user headache, people will probably choose to store values as strings.
>>>>>>>> Using a string would mean that more than half the value is needlessly
>>>>>>>> discarded by default in Iceberg lower/upper bounds instead of keeping the
>>>>>>>> entire value. And since engines don't know what's in the string, the full
>>>>>>>> value must be used in comparison, which is extra work and extra space.
>>>>>>>>
>>>>>>>> Inflated values may not be a problem in some cases. IPv4 addresses
>>>>>>>> are one case where you could argue that it doesn't matter very much that
>>>>>>>> they are typically stored as strings. But I expect the use of UUIDs to be
>>>>>>>> common for ID columns because you can generate them without coordination
>>>>>>>> (unlike an incrementing ID) and that's a concern because the use as an ID
>>>>>>>> makes them likely to be join keys.
>>>>>>>>
>>>>>>>> If we want the values to be stored as 16-byte fixed, then we need
>>>>>>>> to make it easy to get the expected string representation in and out, just
>>>>>>>> like we do with date/time types. I don't think that's specific to any
>>>>>>>> engine.
>>>>>>>>
>>>>>>>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <
>>>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I think points 1&2 don't really apply since a fixed width binary
>>>>>>>>> already covers those properties.
>>>>>>>>>
>>>>>>>>> It seems like this isn't really a concern of iceberg but rather a
>>>>>>>>> cosmetic layer that exists primarily (only?) in trino. In that case I would
>>>>>>>>> be inclined to say that trino should just use custom metadata and a fixed
>>>>>>>>> binary type. That way you still have the desired ux without exposing those
>>>>>>>>> extra concepts to the  iceberg. It actually feels like better encapsulation
>>>>>>>>> imo.
>>>>>>>>>
>>>>>>>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <
>>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I agree with Ryan, that it takes some precautions before one can
>>>>>>>>>> assume uniqueness of UUID values, and that this shouldn't be any special
>>>>>>>>>> for UUIDs at all.
>>>>>>>>>> After all, this is just a primitive type, which is commonly used
>>>>>>>>>> for certain things, but "commonly" doesn't mean "always".
>>>>>>>>>>
>>>>>>>>>> The advantages of having a dedicated type are on 3 layers.
>>>>>>>>>> The compact representation in the file, and compact
>>>>>>>>>> representation in memory in the query engine are the ones mentioned above.
>>>>>>>>>>
>>>>>>>>>> The third layer is the usability. Seeing a UUID column i know
>>>>>>>>>> what values i can expect, so it's more descriptive than `id char(36)`.
>>>>>>>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), ....
>>>>>>>>>> without need for casting to varchar.
>>>>>>>>>> It also removes temptation of casting uuid to varbinary to
>>>>>>>>>> achieve compact representation.
>>>>>>>>>>
>>>>>>>>>> Thus i think it would be good to have them.
>>>>>>>>>>
>>>>>>>>>> Best
>>>>>>>>>> PF
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <bl...@tabular.io>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> The original reason why I added UUID to the spec was that I
>>>>>>>>>>> thought there would be opportunities to take advantage of UUIDs as unique
>>>>>>>>>>> values and to optimize the use of UUIDs. I was thinking about
>>>>>>>>>>> auto-increment ID fields and how we might do something similar in Iceberg.
>>>>>>>>>>>
>>>>>>>>>>> The reason we have thought about removing UUID is that there
>>>>>>>>>>> aren't as many opportunities to take advantage of UUIDs as I thought. My
>>>>>>>>>>> original assumption was that we could do things like bucket on UUID fields
>>>>>>>>>>> or assume that a UUID field has a high NDV. But that's not necessarily the
>>>>>>>>>>> case with when a UUID field is a foreign key, only when it is used as an
>>>>>>>>>>> identifier or primary key. Before Jack added tracking for row identifier
>>>>>>>>>>> fields, we couldn't know that a UUID was unique in a table. As a result, we
>>>>>>>>>>> didn't invest in support for UUID.
>>>>>>>>>>>
>>>>>>>>>>> Quick aside: Now that row identifier fields are tracked, we can
>>>>>>>>>>> do some of these things with the row identifier fields. Engines can assume
>>>>>>>>>>> that the tuple of row identifier fields is unique in a table for join
>>>>>>>>>>> estimation. And engines can use row identifier fields in sort keys to
>>>>>>>>>>> ensure lots of partition split locations (this is really important for
>>>>>>>>>>> Spark).
>>>>>>>>>>>
>>>>>>>>>>> Coming back to UUIDs, the second reason to have a UUID type is
>>>>>>>>>>> still valid: it is better to represent UUIDs as fixed[16] than as 36 byte
>>>>>>>>>>> UTF-8 strings that are more than twice as large, or even worse UCS-16
>>>>>>>>>>> Strings that are 4x as large. Since UUIDs are likely to be used in joins,
>>>>>>>>>>> this could really help engines as long as they can keep the values as
>>>>>>>>>>> fixed-width binary.
>>>>>>>>>>>
>>>>>>>>>>> I could go either way on this. I think it is valuable to have a
>>>>>>>>>>> compact representation for UUIDs rather than using the string
>>>>>>>>>>> representation. But that will require investing in the type and building
>>>>>>>>>>> support in engines that won't take advantage of it. If Trino can use this,
>>>>>>>>>>> I think it may be worth keeping and investing in.
>>>>>>>>>>>
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <ye...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes I agree with Jacques that fixed binary is what it is in the
>>>>>>>>>>>> end. I think It is more about user experience, whether the conversion is
>>>>>>>>>>>> done at the user side or Iceberg and engine side. Many people just store
>>>>>>>>>>>> UUID as a 36 byte string instead of a 16 byte binary, so with an explicit
>>>>>>>>>>>> UUID type, Iceberg can optimize this common use case internally for users.
>>>>>>>>>>>> There might be some other benefits I overlooked, but maybe the complication
>>>>>>>>>>>> introduced by this type does not really justify the slightly better user
>>>>>>>>>>>> experience. I am also on the fence about it.
>>>>>>>>>>>>
>>>>>>>>>>>> -Jack Ye
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <
>>>>>>>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> What specific arguments are there for it being a first class
>>>>>>>>>>>>> type besides it is elsewhere? Is there some kind of optimization iceberg or
>>>>>>>>>>>>> an engine could do if it was typed versus just a bucket of bits? Fixed
>>>>>>>>>>>>> width binary seems to cover the cases I see in terms of actual
>>>>>>>>>>>>> functionality in the iceberg libraries or engines…
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yy...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> One conversation I used to come across regarding UUID
>>>>>>>>>>>>>> deprecation was from
>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/1611
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Yan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary
>>>>>>>>>>>>>> <pv...@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Joshua,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I do not have a strong preference about the UUID type, but I
>>>>>>>>>>>>>>> would like the highlight, that the type is handled inconsistently in
>>>>>>>>>>>>>>> Iceberg with different file formats. (See:
>>>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If we keep the type, it would be good to standardize the
>>>>>>>>>>>>>>> handling in every file format.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <
>>>>>>>>>>>>>>> joshthoward@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>>>>>>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but
>>>>>>>>>>>>>>>> there seems to have been some discussion about removing it? I could not
>>>>>>>>>>>>>>>> find the original discussion, but a reference to the discussion can be
>>>>>>>>>>>>>>>> found here (https://github.com/trinodb/trino/issues/6663).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I generally agree with the consensus in the Trino issue to
>>>>>>>>>>>>>>>> keep UUID in Iceberg. To summarize…
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - It makes sense to keep the type now that row identifiers
>>>>>>>>>>>>>>>> are supported
>>>>>>>>>>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>>>>>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Tabular
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Josh Howard
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: [DISCUSS] UUID type

Posted by Ryan Blue <bl...@tabular.io>.

Let's move forward with it. I'm not hearing much dissent after saying the
general trend is to keep UUID. So let's call it lazy consensus.

Ryan

On Fri, Sep 17, 2021 at 1:32 PM Piotr Findeisen <pi...@starburstdata.com>
wrote:

> Hi Ryan,
>
> Please advise whatever feels more appropriate from your perspective.
> From my perspective, we could just go ahead and merge Trino Iceberg
> support for UUID, since this is just fulfilling the spec as it is defined
> today.
>
> Best
> PF
>
>
> On Wed, Sep 15, 2021 at 10:17 PM Ryan Blue <bl...@tabular.io> wrote:
>
>> I don't think we necessarily reached consensus, but I think the general
>> trend toward the end was to keep support for UUID. Should we start a vote
>> to validate consensus?
>>
>> On Wed, Sep 15, 2021 at 1:15 PM Joshua Howard <jo...@gmail.com>
>> wrote:
>>
>>> Just following up on Piotr's message here.
>>>
>>> Have we converged? I think most people would assume that silence is a
>>> vote for the status-quo.
>>>
>>> On Mon, Sep 13, 2021 at 7:30 AM Piotr Findeisen <pi...@starburstdata.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> It seems we converged here that UUID should remain included.
>>>> I read this as a consensus reached, but it may be subjective. Did we
>>>> objectively reached consensus on this?
>>>>
>>>> From Iceberg project perspective there isn't anything to do, as UUID
>>>> already *is* part of the spec (
>>>> https://iceberg.apache.org/spec/#schemas-and-data-types).
>>>> Trino Iceberg PR adding support for UUID
>>>> https://github.com/trinodb/trino/pull/8747 was pending merge while
>>>> this conversation has been ongoing.
>>>>
>>>> Best,
>>>> PF
>>>>
>>>>
>>>>
>>>> On Mon, Aug 2, 2021 at 6:22 AM Kyle B <kj...@gmail.com> wrote:
>>>>
>>>>> Hi Ryan and all,
>>>>>
>>>>> That sounds like a reasonable reason to leave IP address types out. In
>>>>> my experience, dedicated IP address types are mostly found in logging tools
>>>>> and other things for sysadmins / DevOps etc.
>>>>>
>>>>> When querying data with IP addresses, I’ve seen it done quite a lot
>>>>> (eg security reasons) but usually stored as string or manipulated in a UDF.
>>>>> They’re not commonly supported types.
>>>>>
>>>>> I would also draw the line at UUID types.
>>>>>
>>>>> - Kyle Bendickson
>>>>>
>>>>> On Jul 30, 2021, at 3:15 PM, Ryan Blue <bl...@tabular.io> wrote:
>>>>>
>>>>> 
>>>>> Jacques, you make some good points here. I think my argument about
>>>>> usability leading to performance issues is a stronger argument for engines
>>>>> than for Iceberg. Still, there are inefficiencies in Iceberg if someone
>>>>> chooses to use a string in an engine that doesn't have a UUID type.
>>>>>
>>>>> Another thing to consider is cross-engine support. If Iceberg removes
>>>>> UUID, then Trino would probably translate to fixed[16]. That results in a
>>>>> table that's difficult to query in other engines, where people would
>>>>> probably choose to store the data as a string. On the other hand, if
>>>>> Iceberg keeps the UUID type then integrations would simply translate to the
>>>>> UUID string representation before passing data to the other engines.
>>>>> While the engines would be using 36-byte values in join keys, the user
>>>>> experience issue is fixed and the data is more compact on disk and in
>>>>> Iceberg's bounds metadata.
>>>>>
>>>>> While having a UUID type in Iceberg can't really help engines that
>>>>> don't support UUID take advantage of the type at runtime, it does seem
>>>>> slightly better to have the UUID type in general since at least one engine
>>>>> supports it and it provides the expected user experience with a compact
>>>>> representation.
>>>>>
>>>>> IPv4 addresses are a good thing to think about as well, since most of
>>>>> the same arguments apply. If we keep the UUID type, should we also add IPv4
>>>>> or IPv6 types? I would probably draw the line at UUID because it helps in
>>>>> joins, which are an important operation. IPv4 representations aren't that
>>>>> big of an inconvenience unless you need to do IP manipulation, which is
>>>>> typically in a UDF and not the query engine. And you can always keep both
>>>>> representations in a table fairly inexpensively. Does this sound like a
>>>>> valid rationale for having UUID but not IP types?
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau <
>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>
>>>>>> It seems like Spark, Hive, Dremio and Impala all lack UUID as a
>>>>>> native type. Which engines are you thinking of that have a native UUID type
>>>>>> besides the Presto derivatives and support Iceberg?
>>>>>>
>>>>>> I agree that Trino should expose a UUID type on top of Iceberg
>>>>>> tables. All the user experience things that you are describing as important
>>>>>> (compact storage, friendly display, ddl, clean literals) are possible
>>>>>> without it being a first class type in Iceberg using a trino specific
>>>>>> property.
>>>>>>
>>>>>> I don't really have a strong opinion about UUID. In general, type
>>>>>> bloat is probably just a part of this kind of project. Generally, CHAR(X)
>>>>>> and VARCHAR(X) feel like much bigger concerns given that they exist in all
>>>>>> of the engines but not Iceberg--especially when we start talking about
>>>>>> views.
>>>>>>
>>>>>> Some of this argues for physical vs logical type abstraction.
>>>>>> (Something that was always challenging in Parquet but also helped to
>>>>>> resolve how these types are managed in engines that don't support them.)
>>>>>>
>>>>>> thanks,
>>>>>> Jacques
>>>>>>
>>>>>> PS: Funny aside, the bloat on an ip address is actually worse than a
>>>>>> UUID, right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% bloat.
>>>>>> UUID 36/16 => 125% bloat.
>>>>>>
>>>>>> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>
>>>>>>> I don't think this is just a problem in Trino.
>>>>>>>
>>>>>>> If there is no UUID type, then a user must choose between a 36-byte
>>>>>>> string and a 16-byte binary. That's not a good choice to force people into.
>>>>>>> If someone chooses binary, then it's harder to work with rows and construct
>>>>>>> queries even though there is a standard representation for UUIDs. To avoid
>>>>>>> the user headache, people will probably choose to store values as strings.
>>>>>>> Using a string would mean that more than half the value is needlessly
>>>>>>> discarded by default in Iceberg lower/upper bounds instead of keeping the
>>>>>>> entire value. And since engines don't know what's in the string, the full
>>>>>>> value must be used in comparison, which is extra work and extra space.
>>>>>>>
>>>>>>> Inflated values may not be a problem in some cases. IPv4 addresses
>>>>>>> are one case where you could argue that it doesn't matter very much that
>>>>>>> they are typically stored as strings. But I expect the use of UUIDs to be
>>>>>>> common for ID columns because you can generate them without coordination
>>>>>>> (unlike an incrementing ID) and that's a concern because the use as an ID
>>>>>>> makes them likely to be join keys.
>>>>>>>
>>>>>>> If we want the values to be stored as 16-byte fixed, then we need to
>>>>>>> make it easy to get the expected string representation in and out, just
>>>>>>> like we do with date/time types. I don't think that's specific to any
>>>>>>> engine.
>>>>>>>
>>>>>>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <
>>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>>
>>>>>>>> I think points 1&2 don't really apply since a fixed width binary
>>>>>>>> already covers those properties.
>>>>>>>>
>>>>>>>> It seems like this isn't really a concern of iceberg but rather a
>>>>>>>> cosmetic layer that exists primarily (only?) in trino. In that case I would
>>>>>>>> be inclined to say that trino should just use custom metadata and a fixed
>>>>>>>> binary type. That way you still have the desired ux without exposing those
>>>>>>>> extra concepts to the  iceberg. It actually feels like better encapsulation
>>>>>>>> imo.
>>>>>>>>
>>>>>>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <
>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I agree with Ryan, that it takes some precautions before one can
>>>>>>>>> assume uniqueness of UUID values, and that this shouldn't be any special
>>>>>>>>> for UUIDs at all.
>>>>>>>>> After all, this is just a primitive type, which is commonly used
>>>>>>>>> for certain things, but "commonly" doesn't mean "always".
>>>>>>>>>
>>>>>>>>> The advantages of having a dedicated type are on 3 layers.
>>>>>>>>> The compact representation in the file, and compact representation
>>>>>>>>> in memory in the query engine are the ones mentioned above.
>>>>>>>>>
>>>>>>>>> The third layer is the usability. Seeing a UUID column i know what
>>>>>>>>> values i can expect, so it's more descriptive than `id char(36)`.
>>>>>>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), ....
>>>>>>>>> without need for casting to varchar.
>>>>>>>>> It also removes temptation of casting uuid to varbinary to achieve
>>>>>>>>> compact representation.
>>>>>>>>>
>>>>>>>>> Thus i think it would be good to have them.
>>>>>>>>>
>>>>>>>>> Best
>>>>>>>>> PF
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>>>
>>>>>>>>>> The original reason why I added UUID to the spec was that I
>>>>>>>>>> thought there would be opportunities to take advantage of UUIDs as unique
>>>>>>>>>> values and to optimize the use of UUIDs. I was thinking about
>>>>>>>>>> auto-increment ID fields and how we might do something similar in Iceberg.
>>>>>>>>>>
>>>>>>>>>> The reason we have thought about removing UUID is that there
>>>>>>>>>> aren't as many opportunities to take advantage of UUIDs as I thought. My
>>>>>>>>>> original assumption was that we could do things like bucket on UUID fields
>>>>>>>>>> or assume that a UUID field has a high NDV. But that's not necessarily the
>>>>>>>>>> case with when a UUID field is a foreign key, only when it is used as an
>>>>>>>>>> identifier or primary key. Before Jack added tracking for row identifier
>>>>>>>>>> fields, we couldn't know that a UUID was unique in a table. As a result, we
>>>>>>>>>> didn't invest in support for UUID.
>>>>>>>>>>
>>>>>>>>>> Quick aside: Now that row identifier fields are tracked, we can
>>>>>>>>>> do some of these things with the row identifier fields. Engines can assume
>>>>>>>>>> that the tuple of row identifier fields is unique in a table for join
>>>>>>>>>> estimation. And engines can use row identifier fields in sort keys to
>>>>>>>>>> ensure lots of partition split locations (this is really important for
>>>>>>>>>> Spark).
>>>>>>>>>>
>>>>>>>>>> Coming back to UUIDs, the second reason to have a UUID type is
>>>>>>>>>> still valid: it is better to represent UUIDs as fixed[16] than as 36 byte
>>>>>>>>>> UTF-8 strings that are more than twice as large, or even worse UCS-16
>>>>>>>>>> Strings that are 4x as large. Since UUIDs are likely to be used in joins,
>>>>>>>>>> this could really help engines as long as they can keep the values as
>>>>>>>>>> fixed-width binary.
>>>>>>>>>>
>>>>>>>>>> I could go either way on this. I think it is valuable to have a
>>>>>>>>>> compact representation for UUIDs rather than using the string
>>>>>>>>>> representation. But that will require investing in the type and building
>>>>>>>>>> support in engines that won't take advantage of it. If Trino can use this,
>>>>>>>>>> I think it may be worth keeping and investing in.
>>>>>>>>>>
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <ye...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes I agree with Jacques that fixed binary is what it is in the
>>>>>>>>>>> end. I think It is more about user experience, whether the conversion is
>>>>>>>>>>> done at the user side or Iceberg and engine side. Many people just store
>>>>>>>>>>> UUID as a 36 byte string instead of a 16 byte binary, so with an explicit
>>>>>>>>>>> UUID type, Iceberg can optimize this common use case internally for users.
>>>>>>>>>>> There might be some other benefits I overlooked, but maybe the complication
>>>>>>>>>>> introduced by this type does not really justify the slightly better user
>>>>>>>>>>> experience. I am also on the fence about it.
>>>>>>>>>>>
>>>>>>>>>>> -Jack Ye
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <
>>>>>>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> What specific arguments are there for it being a first class
>>>>>>>>>>>> type besides it is elsewhere? Is there some kind of optimization iceberg or
>>>>>>>>>>>> an engine could do if it was typed versus just a bucket of bits? Fixed
>>>>>>>>>>>> width binary seems to cover the cases I see in terms of actual
>>>>>>>>>>>> functionality in the iceberg libraries or engines…
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yy...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> One conversation I used to come across regarding UUID
>>>>>>>>>>>>> deprecation was from
>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/1611
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Yan
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary
>>>>>>>>>>>>> <pv...@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Joshua,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I do not have a strong preference about the UUID type, but I
>>>>>>>>>>>>>> would like the highlight, that the type is handled inconsistently in
>>>>>>>>>>>>>> Iceberg with different file formats. (See:
>>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If we keep the type, it would be good to standardize the
>>>>>>>>>>>>>> handling in every file format.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <
>>>>>>>>>>>>>> joshthoward@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>>>>>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but
>>>>>>>>>>>>>>> there seems to have been some discussion about removing it? I could not
>>>>>>>>>>>>>>> find the original discussion, but a reference to the discussion can be
>>>>>>>>>>>>>>> found here (https://github.com/trinodb/trino/issues/6663).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I generally agree with the consensus in the Trino issue to
>>>>>>>>>>>>>>> keep UUID in Iceberg. To summarize…
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - It makes sense to keep the type now that row identifiers
>>>>>>>>>>>>>>> are supported
>>>>>>>>>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>>>>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Tabular
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>>
>>>
>>> --
>>> Josh Howard
>>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: [DISCUSS] UUID type

Posted by Piotr Findeisen <pi...@starburstdata.com>.

Hi Ryan,

Please advise whatever feels more appropriate from your perspective.
From my perspective, we could just go ahead and merge Trino Iceberg support
for UUID, since this is just fulfilling the spec as it is defined today.

Best
PF


On Wed, Sep 15, 2021 at 10:17 PM Ryan Blue <bl...@tabular.io> wrote:

> I don't think we necessarily reached consensus, but I think the general
> trend toward the end was to keep support for UUID. Should we start a vote
> to validate consensus?
>
> On Wed, Sep 15, 2021 at 1:15 PM Joshua Howard <jo...@gmail.com>
> wrote:
>
>> Just following up on Piotr's message here.
>>
>> Have we converged? I think most people would assume that silence is a
>> vote for the status-quo.
>>
>> On Mon, Sep 13, 2021 at 7:30 AM Piotr Findeisen <pi...@starburstdata.com>
>> wrote:
>>
>>> Hi,
>>>
>>> It seems we converged here that UUID should remain included.
>>> I read this as a consensus reached, but it may be subjective. Did we
>>> objectively reached consensus on this?
>>>
>>> From Iceberg project perspective there isn't anything to do, as UUID
>>> already *is* part of the spec (
>>> https://iceberg.apache.org/spec/#schemas-and-data-types).
>>> Trino Iceberg PR adding support for UUID
>>> https://github.com/trinodb/trino/pull/8747 was pending merge while this
>>> conversation has been ongoing.
>>>
>>> Best,
>>> PF
>>>
>>>
>>>
>>> On Mon, Aug 2, 2021 at 6:22 AM Kyle B <kj...@gmail.com> wrote:
>>>
>>>> Hi Ryan and all,
>>>>
>>>> That sounds like a reasonable reason to leave IP address types out. In
>>>> my experience, dedicated IP address types are mostly found in logging tools
>>>> and other things for sysadmins / DevOps etc.
>>>>
>>>> When querying data with IP addresses, I’ve seen it done quite a lot (eg
>>>> security reasons) but usually stored as string or manipulated in a UDF.
>>>> They’re not commonly supported types.
>>>>
>>>> I would also draw the line at UUID types.
>>>>
>>>> - Kyle Bendickson
>>>>
>>>> On Jul 30, 2021, at 3:15 PM, Ryan Blue <bl...@tabular.io> wrote:
>>>>
>>>> 
>>>> Jacques, you make some good points here. I think my argument about
>>>> usability leading to performance issues is a stronger argument for engines
>>>> than for Iceberg. Still, there are inefficiencies in Iceberg if someone
>>>> chooses to use a string in an engine that doesn't have a UUID type.
>>>>
>>>> Another thing to consider is cross-engine support. If Iceberg removes
>>>> UUID, then Trino would probably translate to fixed[16]. That results in a
>>>> table that's difficult to query in other engines, where people would
>>>> probably choose to store the data as a string. On the other hand, if
>>>> Iceberg keeps the UUID type then integrations would simply translate to the
>>>> UUID string representation before passing data to the other engines.
>>>> While the engines would be using 36-byte values in join keys, the user
>>>> experience issue is fixed and the data is more compact on disk and in
>>>> Iceberg's bounds metadata.
>>>>
>>>> While having a UUID type in Iceberg can't really help engines that
>>>> don't support UUID take advantage of the type at runtime, it does seem
>>>> slightly better to have the UUID type in general since at least one engine
>>>> supports it and it provides the expected user experience with a compact
>>>> representation.
>>>>
>>>> IPv4 addresses are a good thing to think about as well, since most of
>>>> the same arguments apply. If we keep the UUID type, should we also add IPv4
>>>> or IPv6 types? I would probably draw the line at UUID because it helps in
>>>> joins, which are an important operation. IPv4 representations aren't that
>>>> big of an inconvenience unless you need to do IP manipulation, which is
>>>> typically in a UDF and not the query engine. And you can always keep both
>>>> representations in a table fairly inexpensively. Does this sound like a
>>>> valid rationale for having UUID but not IP types?
>>>>
>>>> Ryan
>>>>
>>>> On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> It seems like Spark, Hive, Dremio and Impala all lack UUID as a native
>>>>> type. Which engines are you thinking of that have a native UUID type
>>>>> besides the Presto derivatives and support Iceberg?
>>>>>
>>>>> I agree that Trino should expose a UUID type on top of Iceberg tables.
>>>>> All the user experience things that you are describing as important
>>>>> (compact storage, friendly display, ddl, clean literals) are possible
>>>>> without it being a first class type in Iceberg using a trino specific
>>>>> property.
>>>>>
>>>>> I don't really have a strong opinion about UUID. In general, type
>>>>> bloat is probably just a part of this kind of project. Generally, CHAR(X)
>>>>> and VARCHAR(X) feel like much bigger concerns given that they exist in all
>>>>> of the engines but not Iceberg--especially when we start talking about
>>>>> views.
>>>>>
>>>>> Some of this argues for physical vs logical type abstraction.
>>>>> (Something that was always challenging in Parquet but also helped to
>>>>> resolve how these types are managed in engines that don't support them.)
>>>>>
>>>>> thanks,
>>>>> Jacques
>>>>>
>>>>> PS: Funny aside, the bloat on an ip address is actually worse than a
>>>>> UUID, right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% bloat.
>>>>> UUID 36/16 => 125% bloat.
>>>>>
>>>>> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>
>>>>>> I don't think this is just a problem in Trino.
>>>>>>
>>>>>> If there is no UUID type, then a user must choose between a 36-byte
>>>>>> string and a 16-byte binary. That's not a good choice to force people into.
>>>>>> If someone chooses binary, then it's harder to work with rows and construct
>>>>>> queries even though there is a standard representation for UUIDs. To avoid
>>>>>> the user headache, people will probably choose to store values as strings.
>>>>>> Using a string would mean that more than half the value is needlessly
>>>>>> discarded by default in Iceberg lower/upper bounds instead of keeping the
>>>>>> entire value. And since engines don't know what's in the string, the full
>>>>>> value must be used in comparison, which is extra work and extra space.
>>>>>>
>>>>>> Inflated values may not be a problem in some cases. IPv4 addresses
>>>>>> are one case where you could argue that it doesn't matter very much that
>>>>>> they are typically stored as strings. But I expect the use of UUIDs to be
>>>>>> common for ID columns because you can generate them without coordination
>>>>>> (unlike an incrementing ID) and that's a concern because the use as an ID
>>>>>> makes them likely to be join keys.
>>>>>>
>>>>>> If we want the values to be stored as 16-byte fixed, then we need to
>>>>>> make it easy to get the expected string representation in and out, just
>>>>>> like we do with date/time types. I don't think that's specific to any
>>>>>> engine.
>>>>>>
>>>>>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <
>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>
>>>>>>> I think points 1&2 don't really apply since a fixed width binary
>>>>>>> already covers those properties.
>>>>>>>
>>>>>>> It seems like this isn't really a concern of iceberg but rather a
>>>>>>> cosmetic layer that exists primarily (only?) in trino. In that case I would
>>>>>>> be inclined to say that trino should just use custom metadata and a fixed
>>>>>>> binary type. That way you still have the desired ux without exposing those
>>>>>>> extra concepts to the  iceberg. It actually feels like better encapsulation
>>>>>>> imo.
>>>>>>>
>>>>>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <
>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I agree with Ryan, that it takes some precautions before one can
>>>>>>>> assume uniqueness of UUID values, and that this shouldn't be any special
>>>>>>>> for UUIDs at all.
>>>>>>>> After all, this is just a primitive type, which is commonly used
>>>>>>>> for certain things, but "commonly" doesn't mean "always".
>>>>>>>>
>>>>>>>> The advantages of having a dedicated type are on 3 layers.
>>>>>>>> The compact representation in the file, and compact representation
>>>>>>>> in memory in the query engine are the ones mentioned above.
>>>>>>>>
>>>>>>>> The third layer is the usability. Seeing a UUID column i know what
>>>>>>>> values i can expect, so it's more descriptive than `id char(36)`.
>>>>>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), ....
>>>>>>>> without need for casting to varchar.
>>>>>>>> It also removes temptation of casting uuid to varbinary to achieve
>>>>>>>> compact representation.
>>>>>>>>
>>>>>>>> Thus i think it would be good to have them.
>>>>>>>>
>>>>>>>> Best
>>>>>>>> PF
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>>
>>>>>>>>> The original reason why I added UUID to the spec was that I
>>>>>>>>> thought there would be opportunities to take advantage of UUIDs as unique
>>>>>>>>> values and to optimize the use of UUIDs. I was thinking about
>>>>>>>>> auto-increment ID fields and how we might do something similar in Iceberg.
>>>>>>>>>
>>>>>>>>> The reason we have thought about removing UUID is that there
>>>>>>>>> aren't as many opportunities to take advantage of UUIDs as I thought. My
>>>>>>>>> original assumption was that we could do things like bucket on UUID fields
>>>>>>>>> or assume that a UUID field has a high NDV. But that's not necessarily the
>>>>>>>>> case with when a UUID field is a foreign key, only when it is used as an
>>>>>>>>> identifier or primary key. Before Jack added tracking for row identifier
>>>>>>>>> fields, we couldn't know that a UUID was unique in a table. As a result, we
>>>>>>>>> didn't invest in support for UUID.
>>>>>>>>>
>>>>>>>>> Quick aside: Now that row identifier fields are tracked, we can do
>>>>>>>>> some of these things with the row identifier fields. Engines can assume
>>>>>>>>> that the tuple of row identifier fields is unique in a table for join
>>>>>>>>> estimation. And engines can use row identifier fields in sort keys to
>>>>>>>>> ensure lots of partition split locations (this is really important for
>>>>>>>>> Spark).
>>>>>>>>>
>>>>>>>>> Coming back to UUIDs, the second reason to have a UUID type is
>>>>>>>>> still valid: it is better to represent UUIDs as fixed[16] than as 36 byte
>>>>>>>>> UTF-8 strings that are more than twice as large, or even worse UCS-16
>>>>>>>>> Strings that are 4x as large. Since UUIDs are likely to be used in joins,
>>>>>>>>> this could really help engines as long as they can keep the values as
>>>>>>>>> fixed-width binary.
>>>>>>>>>
>>>>>>>>> I could go either way on this. I think it is valuable to have a
>>>>>>>>> compact representation for UUIDs rather than using the string
>>>>>>>>> representation. But that will require investing in the type and building
>>>>>>>>> support in engines that won't take advantage of it. If Trino can use this,
>>>>>>>>> I think it may be worth keeping and investing in.
>>>>>>>>>
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <ye...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Yes I agree with Jacques that fixed binary is what it is in the
>>>>>>>>>> end. I think It is more about user experience, whether the conversion is
>>>>>>>>>> done at the user side or Iceberg and engine side. Many people just store
>>>>>>>>>> UUID as a 36 byte string instead of a 16 byte binary, so with an explicit
>>>>>>>>>> UUID type, Iceberg can optimize this common use case internally for users.
>>>>>>>>>> There might be some other benefits I overlooked, but maybe the complication
>>>>>>>>>> introduced by this type does not really justify the slightly better user
>>>>>>>>>> experience. I am also on the fence about it.
>>>>>>>>>>
>>>>>>>>>> -Jack Ye
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <
>>>>>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> What specific arguments are there for it being a first class
>>>>>>>>>>> type besides it is elsewhere? Is there some kind of optimization iceberg or
>>>>>>>>>>> an engine could do if it was typed versus just a bucket of bits? Fixed
>>>>>>>>>>> width binary seems to cover the cases I see in terms of actual
>>>>>>>>>>> functionality in the iceberg libraries or engines…
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yy...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> One conversation I used to come across regarding UUID
>>>>>>>>>>>> deprecation was from
>>>>>>>>>>>> https://github.com/apache/iceberg/pull/1611
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Yan
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary
>>>>>>>>>>>> <pv...@cloudera.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Joshua,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I do not have a strong preference about the UUID type, but I
>>>>>>>>>>>>> would like the highlight, that the type is handled inconsistently in
>>>>>>>>>>>>> Iceberg with different file formats. (See:
>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>>>>>>>>>
>>>>>>>>>>>>> If we keep the type, it would be good to standardize the
>>>>>>>>>>>>> handling in every file format.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <
>>>>>>>>>>>>> joshthoward@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>>>>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but there
>>>>>>>>>>>>>> seems to have been some discussion about removing it? I could not find the
>>>>>>>>>>>>>> original discussion, but a reference to the discussion can be found here (
>>>>>>>>>>>>>> https://github.com/trinodb/trino/issues/6663).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I generally agree with the consensus in the Trino issue to
>>>>>>>>>>>>>> keep UUID in Iceberg. To summarize…
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - It makes sense to keep the type now that row identifiers
>>>>>>>>>>>>>> are supported
>>>>>>>>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>>>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Tabular
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>>
>>
>> --
>> Josh Howard
>>
>
>
> --
> Ryan Blue
> Tabular
>

Re: [DISCUSS] UUID type

Posted by Ryan Blue <bl...@tabular.io>.

I don't think we necessarily reached consensus, but I think the general
trend toward the end was to keep support for UUID. Should we start a vote
to validate consensus?

On Wed, Sep 15, 2021 at 1:15 PM Joshua Howard <jo...@gmail.com> wrote:

> Just following up on Piotr's message here.
>
> Have we converged? I think most people would assume that silence is a vote
> for the status-quo.
>
> On Mon, Sep 13, 2021 at 7:30 AM Piotr Findeisen <pi...@starburstdata.com>
> wrote:
>
>> Hi,
>>
>> It seems we converged here that UUID should remain included.
>> I read this as a consensus reached, but it may be subjective. Did we
>> objectively reached consensus on this?
>>
>> From Iceberg project perspective there isn't anything to do, as UUID
>> already *is* part of the spec (
>> https://iceberg.apache.org/spec/#schemas-and-data-types).
>> Trino Iceberg PR adding support for UUID
>> https://github.com/trinodb/trino/pull/8747 was pending merge while this
>> conversation has been ongoing.
>>
>> Best,
>> PF
>>
>>
>>
>> On Mon, Aug 2, 2021 at 6:22 AM Kyle B <kj...@gmail.com> wrote:
>>
>>> Hi Ryan and all,
>>>
>>> That sounds like a reasonable reason to leave IP address types out. In
>>> my experience, dedicated IP address types are mostly found in logging tools
>>> and other things for sysadmins / DevOps etc.
>>>
>>> When querying data with IP addresses, I’ve seen it done quite a lot (eg
>>> security reasons) but usually stored as string or manipulated in a UDF.
>>> They’re not commonly supported types.
>>>
>>> I would also draw the line at UUID types.
>>>
>>> - Kyle Bendickson
>>>
>>> On Jul 30, 2021, at 3:15 PM, Ryan Blue <bl...@tabular.io> wrote:
>>>
>>> 
>>> Jacques, you make some good points here. I think my argument about
>>> usability leading to performance issues is a stronger argument for engines
>>> than for Iceberg. Still, there are inefficiencies in Iceberg if someone
>>> chooses to use a string in an engine that doesn't have a UUID type.
>>>
>>> Another thing to consider is cross-engine support. If Iceberg removes
>>> UUID, then Trino would probably translate to fixed[16]. That results in a
>>> table that's difficult to query in other engines, where people would
>>> probably choose to store the data as a string. On the other hand, if
>>> Iceberg keeps the UUID type then integrations would simply translate to the
>>> UUID string representation before passing data to the other engines.
>>> While the engines would be using 36-byte values in join keys, the user
>>> experience issue is fixed and the data is more compact on disk and in
>>> Iceberg's bounds metadata.
>>>
>>> While having a UUID type in Iceberg can't really help engines that don't
>>> support UUID take advantage of the type at runtime, it does seem slightly
>>> better to have the UUID type in general since at least one engine supports
>>> it and it provides the expected user experience with a compact
>>> representation.
>>>
>>> IPv4 addresses are a good thing to think about as well, since most of
>>> the same arguments apply. If we keep the UUID type, should we also add IPv4
>>> or IPv6 types? I would probably draw the line at UUID because it helps in
>>> joins, which are an important operation. IPv4 representations aren't that
>>> big of an inconvenience unless you need to do IP manipulation, which is
>>> typically in a UDF and not the query engine. And you can always keep both
>>> representations in a table fairly inexpensively. Does this sound like a
>>> valid rationale for having UUID but not IP types?
>>>
>>> Ryan
>>>
>>> On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau <ja...@gmail.com>
>>> wrote:
>>>
>>>> It seems like Spark, Hive, Dremio and Impala all lack UUID as a native
>>>> type. Which engines are you thinking of that have a native UUID type
>>>> besides the Presto derivatives and support Iceberg?
>>>>
>>>> I agree that Trino should expose a UUID type on top of Iceberg tables.
>>>> All the user experience things that you are describing as important
>>>> (compact storage, friendly display, ddl, clean literals) are possible
>>>> without it being a first class type in Iceberg using a trino specific
>>>> property.
>>>>
>>>> I don't really have a strong opinion about UUID. In general, type bloat
>>>> is probably just a part of this kind of project. Generally, CHAR(X) and
>>>> VARCHAR(X) feel like much bigger concerns given that they exist in all of
>>>> the engines but not Iceberg--especially when we start talking about views.
>>>>
>>>> Some of this argues for physical vs logical type abstraction.
>>>> (Something that was always challenging in Parquet but also helped to
>>>> resolve how these types are managed in engines that don't support them.)
>>>>
>>>> thanks,
>>>> Jacques
>>>>
>>>> PS: Funny aside, the bloat on an ip address is actually worse than a
>>>> UUID, right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% bloat.
>>>> UUID 36/16 => 125% bloat.
>>>>
>>>> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>
>>>>> I don't think this is just a problem in Trino.
>>>>>
>>>>> If there is no UUID type, then a user must choose between a 36-byte
>>>>> string and a 16-byte binary. That's not a good choice to force people into.
>>>>> If someone chooses binary, then it's harder to work with rows and construct
>>>>> queries even though there is a standard representation for UUIDs. To avoid
>>>>> the user headache, people will probably choose to store values as strings.
>>>>> Using a string would mean that more than half the value is needlessly
>>>>> discarded by default in Iceberg lower/upper bounds instead of keeping the
>>>>> entire value. And since engines don't know what's in the string, the full
>>>>> value must be used in comparison, which is extra work and extra space.
>>>>>
>>>>> Inflated values may not be a problem in some cases. IPv4 addresses are
>>>>> one case where you could argue that it doesn't matter very much that they
>>>>> are typically stored as strings. But I expect the use of UUIDs to be common
>>>>> for ID columns because you can generate them without coordination (unlike
>>>>> an incrementing ID) and that's a concern because the use as an ID makes
>>>>> them likely to be join keys.
>>>>>
>>>>> If we want the values to be stored as 16-byte fixed, then we need to
>>>>> make it easy to get the expected string representation in and out, just
>>>>> like we do with date/time types. I don't think that's specific to any
>>>>> engine.
>>>>>
>>>>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <
>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>
>>>>>> I think points 1&2 don't really apply since a fixed width binary
>>>>>> already covers those properties.
>>>>>>
>>>>>> It seems like this isn't really a concern of iceberg but rather a
>>>>>> cosmetic layer that exists primarily (only?) in trino. In that case I would
>>>>>> be inclined to say that trino should just use custom metadata and a fixed
>>>>>> binary type. That way you still have the desired ux without exposing those
>>>>>> extra concepts to the  iceberg. It actually feels like better encapsulation
>>>>>> imo.
>>>>>>
>>>>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <
>>>>>> piotr@starburstdata.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I agree with Ryan, that it takes some precautions before one can
>>>>>>> assume uniqueness of UUID values, and that this shouldn't be any special
>>>>>>> for UUIDs at all.
>>>>>>> After all, this is just a primitive type, which is commonly used for
>>>>>>> certain things, but "commonly" doesn't mean "always".
>>>>>>>
>>>>>>> The advantages of having a dedicated type are on 3 layers.
>>>>>>> The compact representation in the file, and compact representation
>>>>>>> in memory in the query engine are the ones mentioned above.
>>>>>>>
>>>>>>> The third layer is the usability. Seeing a UUID column i know what
>>>>>>> values i can expect, so it's more descriptive than `id char(36)`.
>>>>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), ....
>>>>>>> without need for casting to varchar.
>>>>>>> It also removes temptation of casting uuid to varbinary to achieve
>>>>>>> compact representation.
>>>>>>>
>>>>>>> Thus i think it would be good to have them.
>>>>>>>
>>>>>>> Best
>>>>>>> PF
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>
>>>>>>>> The original reason why I added UUID to the spec was that I thought
>>>>>>>> there would be opportunities to take advantage of UUIDs as unique values
>>>>>>>> and to optimize the use of UUIDs. I was thinking about auto-increment ID
>>>>>>>> fields and how we might do something similar in Iceberg.
>>>>>>>>
>>>>>>>> The reason we have thought about removing UUID is that there aren't
>>>>>>>> as many opportunities to take advantage of UUIDs as I thought. My original
>>>>>>>> assumption was that we could do things like bucket on UUID fields or assume
>>>>>>>> that a UUID field has a high NDV. But that's not necessarily the case with
>>>>>>>> when a UUID field is a foreign key, only when it is used as an identifier
>>>>>>>> or primary key. Before Jack added tracking for row identifier fields, we
>>>>>>>> couldn't know that a UUID was unique in a table. As a result, we didn't
>>>>>>>> invest in support for UUID.
>>>>>>>>
>>>>>>>> Quick aside: Now that row identifier fields are tracked, we can do
>>>>>>>> some of these things with the row identifier fields. Engines can assume
>>>>>>>> that the tuple of row identifier fields is unique in a table for join
>>>>>>>> estimation. And engines can use row identifier fields in sort keys to
>>>>>>>> ensure lots of partition split locations (this is really important for
>>>>>>>> Spark).
>>>>>>>>
>>>>>>>> Coming back to UUIDs, the second reason to have a UUID type is
>>>>>>>> still valid: it is better to represent UUIDs as fixed[16] than as 36 byte
>>>>>>>> UTF-8 strings that are more than twice as large, or even worse UCS-16
>>>>>>>> Strings that are 4x as large. Since UUIDs are likely to be used in joins,
>>>>>>>> this could really help engines as long as they can keep the values as
>>>>>>>> fixed-width binary.
>>>>>>>>
>>>>>>>> I could go either way on this. I think it is valuable to have a
>>>>>>>> compact representation for UUIDs rather than using the string
>>>>>>>> representation. But that will require investing in the type and building
>>>>>>>> support in engines that won't take advantage of it. If Trino can use this,
>>>>>>>> I think it may be worth keeping and investing in.
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <ye...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Yes I agree with Jacques that fixed binary is what it is in the
>>>>>>>>> end. I think It is more about user experience, whether the conversion is
>>>>>>>>> done at the user side or Iceberg and engine side. Many people just store
>>>>>>>>> UUID as a 36 byte string instead of a 16 byte binary, so with an explicit
>>>>>>>>> UUID type, Iceberg can optimize this common use case internally for users.
>>>>>>>>> There might be some other benefits I overlooked, but maybe the complication
>>>>>>>>> introduced by this type does not really justify the slightly better user
>>>>>>>>> experience. I am also on the fence about it.
>>>>>>>>>
>>>>>>>>> -Jack Ye
>>>>>>>>>
>>>>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <
>>>>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> What specific arguments are there for it being a first class type
>>>>>>>>>> besides it is elsewhere? Is there some kind of optimization iceberg or an
>>>>>>>>>> engine could do if it was typed versus just a bucket of bits? Fixed width
>>>>>>>>>> binary seems to cover the cases I see in terms of actual functionality in
>>>>>>>>>> the iceberg libraries or engines…
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yy...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> One conversation I used to come across regarding UUID
>>>>>>>>>>> deprecation was from https://github.com/apache/iceberg/pull/1611
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Yan
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary
>>>>>>>>>>> <pv...@cloudera.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Joshua,
>>>>>>>>>>>>
>>>>>>>>>>>> I do not have a strong preference about the UUID type, but I
>>>>>>>>>>>> would like the highlight, that the type is handled inconsistently in
>>>>>>>>>>>> Iceberg with different file formats. (See:
>>>>>>>>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>>>>>>>>
>>>>>>>>>>>> If we keep the type, it would be good to standardize the
>>>>>>>>>>>> handling in every file format.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <
>>>>>>>>>>>> joshthoward@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi.
>>>>>>>>>>>>>
>>>>>>>>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>>>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but there
>>>>>>>>>>>>> seems to have been some discussion about removing it? I could not find the
>>>>>>>>>>>>> original discussion, but a reference to the discussion can be found here (
>>>>>>>>>>>>> https://github.com/trinodb/trino/issues/6663).
>>>>>>>>>>>>>
>>>>>>>>>>>>> I generally agree with the consensus in the Trino issue to
>>>>>>>>>>>>> keep UUID in Iceberg. To summarize…
>>>>>>>>>>>>>
>>>>>>>>>>>>> - It makes sense to keep the type now that row identifiers are
>>>>>>>>>>>>> supported
>>>>>>>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>>>>>>>
>>>>>>>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>>
>
> --
> Josh Howard
>


-- 
Ryan Blue
Tabular

Re: [DISCUSS] UUID type

Posted by Joshua Howard <jo...@gmail.com>.

Just following up on Piotr's message here.

Have we converged? I think most people would assume that silence is a vote
for the status-quo.

On Mon, Sep 13, 2021 at 7:30 AM Piotr Findeisen <pi...@starburstdata.com>
wrote:

> Hi,
>
> It seems we converged here that UUID should remain included.
> I read this as a consensus reached, but it may be subjective. Did we
> objectively reached consensus on this?
>
> From Iceberg project perspective there isn't anything to do, as UUID
> already *is* part of the spec (
> https://iceberg.apache.org/spec/#schemas-and-data-types).
> Trino Iceberg PR adding support for UUID
> https://github.com/trinodb/trino/pull/8747 was pending merge while this
> conversation has been ongoing.
>
> Best,
> PF
>
>
>
> On Mon, Aug 2, 2021 at 6:22 AM Kyle B <kj...@gmail.com> wrote:
>
>> Hi Ryan and all,
>>
>> That sounds like a reasonable reason to leave IP address types out. In my
>> experience, dedicated IP address types are mostly found in logging tools
>> and other things for sysadmins / DevOps etc.
>>
>> When querying data with IP addresses, I’ve seen it done quite a lot (eg
>> security reasons) but usually stored as string or manipulated in a UDF.
>> They’re not commonly supported types.
>>
>> I would also draw the line at UUID types.
>>
>> - Kyle Bendickson
>>
>> On Jul 30, 2021, at 3:15 PM, Ryan Blue <bl...@tabular.io> wrote:
>>
>> 
>> Jacques, you make some good points here. I think my argument about
>> usability leading to performance issues is a stronger argument for engines
>> than for Iceberg. Still, there are inefficiencies in Iceberg if someone
>> chooses to use a string in an engine that doesn't have a UUID type.
>>
>> Another thing to consider is cross-engine support. If Iceberg removes
>> UUID, then Trino would probably translate to fixed[16]. That results in a
>> table that's difficult to query in other engines, where people would
>> probably choose to store the data as a string. On the other hand, if
>> Iceberg keeps the UUID type then integrations would simply translate to the
>> UUID string representation before passing data to the other engines.
>> While the engines would be using 36-byte values in join keys, the user
>> experience issue is fixed and the data is more compact on disk and in
>> Iceberg's bounds metadata.
>>
>> While having a UUID type in Iceberg can't really help engines that don't
>> support UUID take advantage of the type at runtime, it does seem slightly
>> better to have the UUID type in general since at least one engine supports
>> it and it provides the expected user experience with a compact
>> representation.
>>
>> IPv4 addresses are a good thing to think about as well, since most of the
>> same arguments apply. If we keep the UUID type, should we also add IPv4 or
>> IPv6 types? I would probably draw the line at UUID because it helps in
>> joins, which are an important operation. IPv4 representations aren't that
>> big of an inconvenience unless you need to do IP manipulation, which is
>> typically in a UDF and not the query engine. And you can always keep both
>> representations in a table fairly inexpensively. Does this sound like a
>> valid rationale for having UUID but not IP types?
>>
>> Ryan
>>
>> On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau <ja...@gmail.com>
>> wrote:
>>
>>> It seems like Spark, Hive, Dremio and Impala all lack UUID as a native
>>> type. Which engines are you thinking of that have a native UUID type
>>> besides the Presto derivatives and support Iceberg?
>>>
>>> I agree that Trino should expose a UUID type on top of Iceberg tables.
>>> All the user experience things that you are describing as important
>>> (compact storage, friendly display, ddl, clean literals) are possible
>>> without it being a first class type in Iceberg using a trino specific
>>> property.
>>>
>>> I don't really have a strong opinion about UUID. In general, type bloat
>>> is probably just a part of this kind of project. Generally, CHAR(X) and
>>> VARCHAR(X) feel like much bigger concerns given that they exist in all of
>>> the engines but not Iceberg--especially when we start talking about views.
>>>
>>> Some of this argues for physical vs logical type abstraction. (Something
>>> that was always challenging in Parquet but also helped to resolve how these
>>> types are managed in engines that don't support them.)
>>>
>>> thanks,
>>> Jacques
>>>
>>> PS: Funny aside, the bloat on an ip address is actually worse than a
>>> UUID, right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% bloat.
>>> UUID 36/16 => 125% bloat.
>>>
>>> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <bl...@tabular.io> wrote:
>>>
>>>> I don't think this is just a problem in Trino.
>>>>
>>>> If there is no UUID type, then a user must choose between a 36-byte
>>>> string and a 16-byte binary. That's not a good choice to force people into.
>>>> If someone chooses binary, then it's harder to work with rows and construct
>>>> queries even though there is a standard representation for UUIDs. To avoid
>>>> the user headache, people will probably choose to store values as strings.
>>>> Using a string would mean that more than half the value is needlessly
>>>> discarded by default in Iceberg lower/upper bounds instead of keeping the
>>>> entire value. And since engines don't know what's in the string, the full
>>>> value must be used in comparison, which is extra work and extra space.
>>>>
>>>> Inflated values may not be a problem in some cases. IPv4 addresses are
>>>> one case where you could argue that it doesn't matter very much that they
>>>> are typically stored as strings. But I expect the use of UUIDs to be common
>>>> for ID columns because you can generate them without coordination (unlike
>>>> an incrementing ID) and that's a concern because the use as an ID makes
>>>> them likely to be join keys.
>>>>
>>>> If we want the values to be stored as 16-byte fixed, then we need to
>>>> make it easy to get the expected string representation in and out, just
>>>> like we do with date/time types. I don't think that's specific to any
>>>> engine.
>>>>
>>>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> I think points 1&2 don't really apply since a fixed width binary
>>>>> already covers those properties.
>>>>>
>>>>> It seems like this isn't really a concern of iceberg but rather a
>>>>> cosmetic layer that exists primarily (only?) in trino. In that case I would
>>>>> be inclined to say that trino should just use custom metadata and a fixed
>>>>> binary type. That way you still have the desired ux without exposing those
>>>>> extra concepts to the  iceberg. It actually feels like better encapsulation
>>>>> imo.
>>>>>
>>>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <pi...@starburstdata.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I agree with Ryan, that it takes some precautions before one can
>>>>>> assume uniqueness of UUID values, and that this shouldn't be any special
>>>>>> for UUIDs at all.
>>>>>> After all, this is just a primitive type, which is commonly used for
>>>>>> certain things, but "commonly" doesn't mean "always".
>>>>>>
>>>>>> The advantages of having a dedicated type are on 3 layers.
>>>>>> The compact representation in the file, and compact representation in
>>>>>> memory in the query engine are the ones mentioned above.
>>>>>>
>>>>>> The third layer is the usability. Seeing a UUID column i know what
>>>>>> values i can expect, so it's more descriptive than `id char(36)`.
>>>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), .... without
>>>>>> need for casting to varchar.
>>>>>> It also removes temptation of casting uuid to varbinary to achieve
>>>>>> compact representation.
>>>>>>
>>>>>> Thus i think it would be good to have them.
>>>>>>
>>>>>> Best
>>>>>> PF
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>
>>>>>>> The original reason why I added UUID to the spec was that I thought
>>>>>>> there would be opportunities to take advantage of UUIDs as unique values
>>>>>>> and to optimize the use of UUIDs. I was thinking about auto-increment ID
>>>>>>> fields and how we might do something similar in Iceberg.
>>>>>>>
>>>>>>> The reason we have thought about removing UUID is that there aren't
>>>>>>> as many opportunities to take advantage of UUIDs as I thought. My original
>>>>>>> assumption was that we could do things like bucket on UUID fields or assume
>>>>>>> that a UUID field has a high NDV. But that's not necessarily the case with
>>>>>>> when a UUID field is a foreign key, only when it is used as an identifier
>>>>>>> or primary key. Before Jack added tracking for row identifier fields, we
>>>>>>> couldn't know that a UUID was unique in a table. As a result, we didn't
>>>>>>> invest in support for UUID.
>>>>>>>
>>>>>>> Quick aside: Now that row identifier fields are tracked, we can do
>>>>>>> some of these things with the row identifier fields. Engines can assume
>>>>>>> that the tuple of row identifier fields is unique in a table for join
>>>>>>> estimation. And engines can use row identifier fields in sort keys to
>>>>>>> ensure lots of partition split locations (this is really important for
>>>>>>> Spark).
>>>>>>>
>>>>>>> Coming back to UUIDs, the second reason to have a UUID type is still
>>>>>>> valid: it is better to represent UUIDs as fixed[16] than as 36 byte UTF-8
>>>>>>> strings that are more than twice as large, or even worse UCS-16 Strings
>>>>>>> that are 4x as large. Since UUIDs are likely to be used in joins, this
>>>>>>> could really help engines as long as they can keep the values as
>>>>>>> fixed-width binary.
>>>>>>>
>>>>>>> I could go either way on this. I think it is valuable to have a
>>>>>>> compact representation for UUIDs rather than using the string
>>>>>>> representation. But that will require investing in the type and building
>>>>>>> support in engines that won't take advantage of it. If Trino can use this,
>>>>>>> I think it may be worth keeping and investing in.
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <ye...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Yes I agree with Jacques that fixed binary is what it is in the
>>>>>>>> end. I think It is more about user experience, whether the conversion is
>>>>>>>> done at the user side or Iceberg and engine side. Many people just store
>>>>>>>> UUID as a 36 byte string instead of a 16 byte binary, so with an explicit
>>>>>>>> UUID type, Iceberg can optimize this common use case internally for users.
>>>>>>>> There might be some other benefits I overlooked, but maybe the complication
>>>>>>>> introduced by this type does not really justify the slightly better user
>>>>>>>> experience. I am also on the fence about it.
>>>>>>>>
>>>>>>>> -Jack Ye
>>>>>>>>
>>>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <
>>>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> What specific arguments are there for it being a first class type
>>>>>>>>> besides it is elsewhere? Is there some kind of optimization iceberg or an
>>>>>>>>> engine could do if it was typed versus just a bucket of bits? Fixed width
>>>>>>>>> binary seems to cover the cases I see in terms of actual functionality in
>>>>>>>>> the iceberg libraries or engines…
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yy...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> One conversation I used to come across regarding UUID deprecation
>>>>>>>>>> was from https://github.com/apache/iceberg/pull/1611
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Yan
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary
>>>>>>>>>> <pv...@cloudera.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Joshua,
>>>>>>>>>>>
>>>>>>>>>>> I do not have a strong preference about the UUID type, but I
>>>>>>>>>>> would like the highlight, that the type is handled inconsistently in
>>>>>>>>>>> Iceberg with different file formats. (See:
>>>>>>>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>>>>>>>
>>>>>>>>>>> If we keep the type, it would be good to standardize the
>>>>>>>>>>> handling in every file format.
>>>>>>>>>>>
>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>
>>>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <jo...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi.
>>>>>>>>>>>>
>>>>>>>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but there
>>>>>>>>>>>> seems to have been some discussion about removing it? I could not find the
>>>>>>>>>>>> original discussion, but a reference to the discussion can be found here (
>>>>>>>>>>>> https://github.com/trinodb/trino/issues/6663).
>>>>>>>>>>>>
>>>>>>>>>>>> I generally agree with the consensus in the Trino issue to keep
>>>>>>>>>>>> UUID in Iceberg. To summarize…
>>>>>>>>>>>>
>>>>>>>>>>>> - It makes sense to keep the type now that row identifiers are
>>>>>>>>>>>> supported
>>>>>>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>>>>>>
>>>>>>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>>

-- 
Josh Howard