You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Joshua Howard <jo...@gmail.com> on 2021/07/27 15:08:08 UTC

[DISCUSS] UUID type

Hi. 

UUID is a current data type according to the Iceberg spec (https://iceberg.apache.org/spec/#primitive-types), but there seems to have been some discussion about removing it? I could not find the original discussion, but a reference to the discussion can be found here (https://github.com/trinodb/trino/issues/6663). 

I generally agree with the consensus in the Trino issue to keep UUID in Iceberg. To summarize… 

- It makes sense to keep the type now that row identifiers are supported
- Some engines (Trino) have support for the UUID type
- Engines w/o support for UUID type can determine how to map

Does anyone want to remove the type? If so, why?

Re: [DISCUSS] UUID type

Posted by Russell Spitzer <ru...@gmail.com>.

Without time based uuid's as a special type I think these aren't as useful, since the only comparator that works on a non time UUID is equality. For TimeUUIDs you need another comparator (and type) since they are not lexicographically comparable but then you can actually benefit from range predicates as well as equality. I think the biggest benefit Iceberg gives us is file pruning and if we can't be that much better with a special UUID type I think it may not be worth the complexity. 

Storing a UUID as a string is a pretty wasteful config but not something I think we should make an additional type to avoid.

So i'm at best +0 on UUIDs

> On Jul 27, 2021, at 11:54 PM, Jack Ye <ye...@gmail.com> wrote:
> 
> Yes I agree with Jacques that fixed binary is what it is in the end. I think It is more about user experience, whether the conversion is done at the user side or Iceberg and engine side. Many people just store UUID as a 36 byte string instead of a 16 byte binary, so with an explicit UUID type, Iceberg can optimize this common use case internally for users. There might be some other benefits I overlooked, but maybe the complication introduced by this type does not really justify the slightly better user experience. I am also on the fence about it.
> 
> -Jack Ye
> 
> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <jacquesnadeau@gmail.com <ma...@gmail.com>> wrote:
> What specific arguments are there for it being a first class type besides it is elsewhere? Is there some kind of optimization iceberg or an engine could do if it was typed versus just a bucket of bits? Fixed width binary seems to cover the cases I see in terms of actual functionality in the iceberg libraries or engines…
> 
> 
> 
> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yyanyyyy@gmail.com <ma...@gmail.com>> wrote:
> One conversation I used to come across regarding UUID deprecation was from https://github.com/apache/iceberg/pull/1611 <https://github.com/apache/iceberg/pull/1611> 
> 
> Thanks,
> Yan
> 
> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary <pv...@cloudera.com.invalid> wrote:
> Hi Joshua, 
> 
> I do not have a strong preference about the UUID type, but I would like the highlight, that the type is handled inconsistently in Iceberg with different file formats. (See: https://github.com/apache/iceberg/issues/1881 <https://github.com/apache/iceberg/issues/1881>) 
> 
> If we keep the type, it would be good to standardize the handling in every file format. 
> 
> Thanks, Peter 
> 
> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <joshthoward@gmail.com <ma...@gmail.com>> wrote:
> Hi. 
> 
> UUID is a current data type according to the Iceberg spec (https://iceberg.apache.org/spec/#primitive-types <https://iceberg.apache.org/spec/#primitive-types>), but there seems to have been some discussion about removing it? I could not find the original discussion, but a reference to the discussion can be found here (https://github.com/trinodb/trino/issues/6663 <https://github.com/trinodb/trino/issues/6663>). 
> 
> I generally agree with the consensus in the Trino issue to keep UUID in Iceberg. To summarize… 
> 
> - It makes sense to keep the type now that row identifiers are supported
> - Some engines (Trino) have support for the UUID type
> - Engines w/o support for UUID type can determine how to map
> 
> Does anyone want to remove the type? If so, why?

Re: [DISCUSS] UUID type

Posted by parth brahmbhatt <br...@gmail.com>.

I am personally against UUID that does not guarantee at the spec level that
they are unique across something. Even if the spec could guarantee that, it
feels like we are trying to define a type for what should be a constraint.
I would rather remove support for UUID and let the engines do coercion when
needed but invest in actually adding a constraint definition framework at
spec level so we can define constraints like "Column x is unique at
partition level".

Thanks
Parth

On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau <ja...@gmail.com>
wrote:

> It seems like Spark, Hive, Dremio and Impala all lack UUID as a native
> type. Which engines are you thinking of that have a native UUID type
> besides the Presto derivatives and support Iceberg?
>
> I agree that Trino should expose a UUID type on top of Iceberg tables. All
> the user experience things that you are describing as important (compact
> storage, friendly display, ddl, clean literals) are possible without it
> being a first class type in Iceberg using a trino specific property.
>
> I don't really have a strong opinion about UUID. In general, type bloat is
> probably just a part of this kind of project. Generally, CHAR(X) and
> VARCHAR(X) feel like much bigger concerns given that they exist in all of
> the engines but not Iceberg--especially when we start talking about views.
>
> Some of this argues for physical vs logical type abstraction. (Something
> that was always challenging in Parquet but also helped to resolve how these
> types are managed in engines that don't support them.)
>
> thanks,
> Jacques
>
> PS: Funny aside, the bloat on an ip address is actually worse than a UUID,
> right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% bloat. UUID
> 36/16 => 125% bloat.
>
> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <bl...@tabular.io> wrote:
>
>> I don't think this is just a problem in Trino.
>>
>> If there is no UUID type, then a user must choose between a 36-byte
>> string and a 16-byte binary. That's not a good choice to force people into.
>> If someone chooses binary, then it's harder to work with rows and construct
>> queries even though there is a standard representation for UUIDs. To avoid
>> the user headache, people will probably choose to store values as strings.
>> Using a string would mean that more than half the value is needlessly
>> discarded by default in Iceberg lower/upper bounds instead of keeping the
>> entire value. And since engines don't know what's in the string, the full
>> value must be used in comparison, which is extra work and extra space.
>>
>> Inflated values may not be a problem in some cases. IPv4 addresses are
>> one case where you could argue that it doesn't matter very much that they
>> are typically stored as strings. But I expect the use of UUIDs to be common
>> for ID columns because you can generate them without coordination (unlike
>> an incrementing ID) and that's a concern because the use as an ID makes
>> them likely to be join keys.
>>
>> If we want the values to be stored as 16-byte fixed, then we need to make
>> it easy to get the expected string representation in and out, just like we
>> do with date/time types. I don't think that's specific to any engine.
>>
>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <ja...@gmail.com>
>> wrote:
>>
>>> I think points 1&2 don't really apply since a fixed width binary already
>>> covers those properties.
>>>
>>> It seems like this isn't really a concern of iceberg but rather a
>>> cosmetic layer that exists primarily (only?) in trino. In that case I would
>>> be inclined to say that trino should just use custom metadata and a fixed
>>> binary type. That way you still have the desired ux without exposing those
>>> extra concepts to the  iceberg. It actually feels like better encapsulation
>>> imo.
>>>
>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <pi...@starburstdata.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I agree with Ryan, that it takes some precautions before one can assume
>>>> uniqueness of UUID values, and that this shouldn't be any special for UUIDs
>>>> at all.
>>>> After all, this is just a primitive type, which is commonly used for
>>>> certain things, but "commonly" doesn't mean "always".
>>>>
>>>> The advantages of having a dedicated type are on 3 layers.
>>>> The compact representation in the file, and compact representation in
>>>> memory in the query engine are the ones mentioned above.
>>>>
>>>> The third layer is the usability. Seeing a UUID column i know what
>>>> values i can expect, so it's more descriptive than `id char(36)`.
>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), .... without
>>>> need for casting to varchar.
>>>> It also removes temptation of casting uuid to varbinary to achieve
>>>> compact representation.
>>>>
>>>> Thus i think it would be good to have them.
>>>>
>>>> Best
>>>> PF
>>>>
>>>>
>>>>
>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>
>>>>> The original reason why I added UUID to the spec was that I thought
>>>>> there would be opportunities to take advantage of UUIDs as unique values
>>>>> and to optimize the use of UUIDs. I was thinking about auto-increment ID
>>>>> fields and how we might do something similar in Iceberg.
>>>>>
>>>>> The reason we have thought about removing UUID is that there aren't as
>>>>> many opportunities to take advantage of UUIDs as I thought. My original
>>>>> assumption was that we could do things like bucket on UUID fields or assume
>>>>> that a UUID field has a high NDV. But that's not necessarily the case with
>>>>> when a UUID field is a foreign key, only when it is used as an identifier
>>>>> or primary key. Before Jack added tracking for row identifier fields, we
>>>>> couldn't know that a UUID was unique in a table. As a result, we didn't
>>>>> invest in support for UUID.
>>>>>
>>>>> Quick aside: Now that row identifier fields are tracked, we can do
>>>>> some of these things with the row identifier fields. Engines can assume
>>>>> that the tuple of row identifier fields is unique in a table for join
>>>>> estimation. And engines can use row identifier fields in sort keys to
>>>>> ensure lots of partition split locations (this is really important for
>>>>> Spark).
>>>>>
>>>>> Coming back to UUIDs, the second reason to have a UUID type is still
>>>>> valid: it is better to represent UUIDs as fixed[16] than as 36 byte UTF-8
>>>>> strings that are more than twice as large, or even worse UCS-16 Strings
>>>>> that are 4x as large. Since UUIDs are likely to be used in joins, this
>>>>> could really help engines as long as they can keep the values as
>>>>> fixed-width binary.
>>>>>
>>>>> I could go either way on this. I think it is valuable to have a
>>>>> compact representation for UUIDs rather than using the string
>>>>> representation. But that will require investing in the type and building
>>>>> support in engines that won't take advantage of it. If Trino can use this,
>>>>> I think it may be worth keeping and investing in.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <ye...@gmail.com> wrote:
>>>>>
>>>>>> Yes I agree with Jacques that fixed binary is what it is in the end.
>>>>>> I think It is more about user experience, whether the conversion is done at
>>>>>> the user side or Iceberg and engine side. Many people just store UUID as a
>>>>>> 36 byte string instead of a 16 byte binary, so with an explicit UUID type,
>>>>>> Iceberg can optimize this common use case internally for users. There might
>>>>>> be some other benefits I overlooked, but maybe the complication introduced
>>>>>> by this type does not really justify the slightly better user experience. I
>>>>>> am also on the fence about it.
>>>>>>
>>>>>> -Jack Ye
>>>>>>
>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <
>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>
>>>>>>> What specific arguments are there for it being a first class type
>>>>>>> besides it is elsewhere? Is there some kind of optimization iceberg or an
>>>>>>> engine could do if it was typed versus just a bucket of bits? Fixed width
>>>>>>> binary seems to cover the cases I see in terms of actual functionality in
>>>>>>> the iceberg libraries or engines…
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yy...@gmail.com> wrote:
>>>>>>>
>>>>>>>> One conversation I used to come across regarding UUID deprecation
>>>>>>>> was from https://github.com/apache/iceberg/pull/1611
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Yan
>>>>>>>>
>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary
>>>>>>>> <pv...@cloudera.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Hi Joshua,
>>>>>>>>>
>>>>>>>>> I do not have a strong preference about the UUID type, but I would
>>>>>>>>> like the highlight, that the type is handled inconsistently in Iceberg with
>>>>>>>>> different file formats. (See:
>>>>>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>>>>>
>>>>>>>>> If we keep the type, it would be good to standardize the handling
>>>>>>>>> in every file format.
>>>>>>>>>
>>>>>>>>> Thanks, Peter
>>>>>>>>>
>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <jo...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi.
>>>>>>>>>>
>>>>>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but there
>>>>>>>>>> seems to have been some discussion about removing it? I could not find the
>>>>>>>>>> original discussion, but a reference to the discussion can be found here (
>>>>>>>>>> https://github.com/trinodb/trino/issues/6663).
>>>>>>>>>>
>>>>>>>>>> I generally agree with the consensus in the Trino issue to keep
>>>>>>>>>> UUID in Iceberg. To summarize…
>>>>>>>>>>
>>>>>>>>>> - It makes sense to keep the type now that row identifiers are
>>>>>>>>>> supported
>>>>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>>>>
>>>>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: [DISCUSS] UUID type

Posted by Jacques Nadeau <ja...@gmail.com>.

I already added it to Substrait because of Iceberg lazy consensus :D


On Fri, Sep 17, 2021 at 2:05 PM Ryan Blue <bl...@tabular.io> wrote:

> Let's move forward with it. I'm not hearing much dissent after saying the
> general trend is to keep UUID. So let's call it lazy consensus.
>
> Ryan
>
> On Fri, Sep 17, 2021 at 1:32 PM Piotr Findeisen <pi...@starburstdata.com>
> wrote:
>
>> Hi Ryan,
>>
>> Please advise whatever feels more appropriate from your perspective.
>> From my perspective, we could just go ahead and merge Trino Iceberg
>> support for UUID, since this is just fulfilling the spec as it is defined
>> today.
>>
>> Best
>> PF
>>
>>
>> On Wed, Sep 15, 2021 at 10:17 PM Ryan Blue <bl...@tabular.io> wrote:
>>
>>> I don't think we necessarily reached consensus, but I think the general
>>> trend toward the end was to keep support for UUID. Should we start a vote
>>> to validate consensus?
>>>
>>> On Wed, Sep 15, 2021 at 1:15 PM Joshua Howard <jo...@gmail.com>
>>> wrote:
>>>
>>>> Just following up on Piotr's message here.
>>>>
>>>> Have we converged? I think most people would assume that silence is a
>>>> vote for the status-quo.
>>>>
>>>> On Mon, Sep 13, 2021 at 7:30 AM Piotr Findeisen <
>>>> piotr@starburstdata.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> It seems we converged here that UUID should remain included.
>>>>> I read this as a consensus reached, but it may be subjective. Did we
>>>>> objectively reached consensus on this?
>>>>>
>>>>> From Iceberg project perspective there isn't anything to do, as UUID
>>>>> already *is* part of the spec (
>>>>> https://iceberg.apache.org/spec/#schemas-and-data-types).
>>>>> Trino Iceberg PR adding support for UUID
>>>>> https://github.com/trinodb/trino/pull/8747 was pending merge while
>>>>> this conversation has been ongoing.
>>>>>
>>>>> Best,
>>>>> PF
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Aug 2, 2021 at 6:22 AM Kyle B <kj...@gmail.com> wrote:
>>>>>
>>>>>> Hi Ryan and all,
>>>>>>
>>>>>> That sounds like a reasonable reason to leave IP address types out.
>>>>>> In my experience, dedicated IP address types are mostly found in logging
>>>>>> tools and other things for sysadmins / DevOps etc.
>>>>>>
>>>>>> When querying data with IP addresses, I’ve seen it done quite a lot
>>>>>> (eg security reasons) but usually stored as string or manipulated in a UDF.
>>>>>> They’re not commonly supported types.
>>>>>>
>>>>>> I would also draw the line at UUID types.
>>>>>>
>>>>>> - Kyle Bendickson
>>>>>>
>>>>>> On Jul 30, 2021, at 3:15 PM, Ryan Blue <bl...@tabular.io> wrote:
>>>>>>
>>>>>> 
>>>>>> Jacques, you make some good points here. I think my argument about
>>>>>> usability leading to performance issues is a stronger argument for engines
>>>>>> than for Iceberg. Still, there are inefficiencies in Iceberg if someone
>>>>>> chooses to use a string in an engine that doesn't have a UUID type.
>>>>>>
>>>>>> Another thing to consider is cross-engine support. If Iceberg removes
>>>>>> UUID, then Trino would probably translate to fixed[16]. That results in a
>>>>>> table that's difficult to query in other engines, where people would
>>>>>> probably choose to store the data as a string. On the other hand, if
>>>>>> Iceberg keeps the UUID type then integrations would simply translate to the
>>>>>> UUID string representation before passing data to the other engines.
>>>>>> While the engines would be using 36-byte values in join keys, the user
>>>>>> experience issue is fixed and the data is more compact on disk and in
>>>>>> Iceberg's bounds metadata.
>>>>>>
>>>>>> While having a UUID type in Iceberg can't really help engines that
>>>>>> don't support UUID take advantage of the type at runtime, it does seem
>>>>>> slightly better to have the UUID type in general since at least one engine
>>>>>> supports it and it provides the expected user experience with a compact
>>>>>> representation.
>>>>>>
>>>>>> IPv4 addresses are a good thing to think about as well, since most of
>>>>>> the same arguments apply. If we keep the UUID type, should we also add IPv4
>>>>>> or IPv6 types? I would probably draw the line at UUID because it helps in
>>>>>> joins, which are an important operation. IPv4 representations aren't that
>>>>>> big of an inconvenience unless you need to do IP manipulation, which is
>>>>>> typically in a UDF and not the query engine. And you can always keep both
>>>>>> representations in a table fairly inexpensively. Does this sound like a
>>>>>> valid rationale for having UUID but not IP types?
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau <
>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>
>>>>>>> It seems like Spark, Hive, Dremio and Impala all lack UUID as a
>>>>>>> native type. Which engines are you thinking of that have a native UUID type
>>>>>>> besides the Presto derivatives and support Iceberg?
>>>>>>>
>>>>>>> I agree that Trino should expose a UUID type on top of Iceberg
>>>>>>> tables. All the user experience things that you are describing as important
>>>>>>> (compact storage, friendly display, ddl, clean literals) are possible
>>>>>>> without it being a first class type in Iceberg using a trino specific
>>>>>>> property.
>>>>>>>
>>>>>>> I don't really have a strong opinion about UUID. In general, type
>>>>>>> bloat is probably just a part of this kind of project. Generally, CHAR(X)
>>>>>>> and VARCHAR(X) feel like much bigger concerns given that they exist in all
>>>>>>> of the engines but not Iceberg--especially when we start talking about
>>>>>>> views.
>>>>>>>
>>>>>>> Some of this argues for physical vs logical type abstraction.
>>>>>>> (Something that was always challenging in Parquet but also helped to
>>>>>>> resolve how these types are managed in engines that don't support them.)
>>>>>>>
>>>>>>> thanks,
>>>>>>> Jacques
>>>>>>>
>>>>>>> PS: Funny aside, the bloat on an ip address is actually worse than a
>>>>>>> UUID, right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% bloat.
>>>>>>> UUID 36/16 => 125% bloat.
>>>>>>>
>>>>>>> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>
>>>>>>>> I don't think this is just a problem in Trino.
>>>>>>>>
>>>>>>>> If there is no UUID type, then a user must choose between a 36-byte
>>>>>>>> string and a 16-byte binary. That's not a good choice to force people into.
>>>>>>>> If someone chooses binary, then it's harder to work with rows and construct
>>>>>>>> queries even though there is a standard representation for UUIDs. To avoid
>>>>>>>> the user headache, people will probably choose to store values as strings.
>>>>>>>> Using a string would mean that more than half the value is needlessly
>>>>>>>> discarded by default in Iceberg lower/upper bounds instead of keeping the
>>>>>>>> entire value. And since engines don't know what's in the string, the full
>>>>>>>> value must be used in comparison, which is extra work and extra space.
>>>>>>>>
>>>>>>>> Inflated values may not be a problem in some cases. IPv4 addresses
>>>>>>>> are one case where you could argue that it doesn't matter very much that
>>>>>>>> they are typically stored as strings. But I expect the use of UUIDs to be
>>>>>>>> common for ID columns because you can generate them without coordination
>>>>>>>> (unlike an incrementing ID) and that's a concern because the use as an ID
>>>>>>>> makes them likely to be join keys.
>>>>>>>>
>>>>>>>> If we want the values to be stored as 16-byte fixed, then we need
>>>>>>>> to make it easy to get the expected string representation in and out, just
>>>>>>>> like we do with date/time types. I don't think that's specific to any
>>>>>>>> engine.
>>>>>>>>
>>>>>>>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <
>>>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I think points 1&2 don't really apply since a fixed width binary
>>>>>>>>> already covers those properties.
>>>>>>>>>
>>>>>>>>> It seems like this isn't really a concern of iceberg but rather a
>>>>>>>>> cosmetic layer that exists primarily (only?) in trino. In that case I would
>>>>>>>>> be inclined to say that trino should just use custom metadata and a fixed
>>>>>>>>> binary type. That way you still have the desired ux without exposing those
>>>>>>>>> extra concepts to the  iceberg. It actually feels like better encapsulation
>>>>>>>>> imo.
>>>>>>>>>
>>>>>>>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <
>>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I agree with Ryan, that it takes some precautions before one can
>>>>>>>>>> assume uniqueness of UUID values, and that this shouldn't be any special
>>>>>>>>>> for UUIDs at all.
>>>>>>>>>> After all, this is just a primitive type, which is commonly used
>>>>>>>>>> for certain things, but "commonly" doesn't mean "always".
>>>>>>>>>>
>>>>>>>>>> The advantages of having a dedicated type are on 3 layers.
>>>>>>>>>> The compact representation in the file, and compact
>>>>>>>>>> representation in memory in the query engine are the ones mentioned above.
>>>>>>>>>>
>>>>>>>>>> The third layer is the usability. Seeing a UUID column i know
>>>>>>>>>> what values i can expect, so it's more descriptive than `id char(36)`.
>>>>>>>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), ....
>>>>>>>>>> without need for casting to varchar.
>>>>>>>>>> It also removes temptation of casting uuid to varbinary to
>>>>>>>>>> achieve compact representation.
>>>>>>>>>>
>>>>>>>>>> Thus i think it would be good to have them.
>>>>>>>>>>
>>>>>>>>>> Best
>>>>>>>>>> PF
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <bl...@tabular.io>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> The original reason why I added UUID to the spec was that I
>>>>>>>>>>> thought there would be opportunities to take advantage of UUIDs as unique
>>>>>>>>>>> values and to optimize the use of UUIDs. I was thinking about
>>>>>>>>>>> auto-increment ID fields and how we might do something similar in Iceberg.
>>>>>>>>>>>
>>>>>>>>>>> The reason we have thought about removing UUID is that there
>>>>>>>>>>> aren't as many opportunities to take advantage of UUIDs as I thought. My
>>>>>>>>>>> original assumption was that we could do things like bucket on UUID fields
>>>>>>>>>>> or assume that a UUID field has a high NDV. But that's not necessarily the
>>>>>>>>>>> case with when a UUID field is a foreign key, only when it is used as an
>>>>>>>>>>> identifier or primary key. Before Jack added tracking for row identifier
>>>>>>>>>>> fields, we couldn't know that a UUID was unique in a table. As a result, we
>>>>>>>>>>> didn't invest in support for UUID.
>>>>>>>>>>>
>>>>>>>>>>> Quick aside: Now that row identifier fields are tracked, we can
>>>>>>>>>>> do some of these things with the row identifier fields. Engines can assume
>>>>>>>>>>> that the tuple of row identifier fields is unique in a table for join
>>>>>>>>>>> estimation. And engines can use row identifier fields in sort keys to
>>>>>>>>>>> ensure lots of partition split locations (this is really important for
>>>>>>>>>>> Spark).
>>>>>>>>>>>
>>>>>>>>>>> Coming back to UUIDs, the second reason to have a UUID type is
>>>>>>>>>>> still valid: it is better to represent UUIDs as fixed[16] than as 36 byte
>>>>>>>>>>> UTF-8 strings that are more than twice as large, or even worse UCS-16
>>>>>>>>>>> Strings that are 4x as large. Since UUIDs are likely to be used in joins,
>>>>>>>>>>> this could really help engines as long as they can keep the values as
>>>>>>>>>>> fixed-width binary.
>>>>>>>>>>>
>>>>>>>>>>> I could go either way on this. I think it is valuable to have a
>>>>>>>>>>> compact representation for UUIDs rather than using the string
>>>>>>>>>>> representation. But that will require investing in the type and building
>>>>>>>>>>> support in engines that won't take advantage of it. If Trino can use this,
>>>>>>>>>>> I think it may be worth keeping and investing in.
>>>>>>>>>>>
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <ye...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes I agree with Jacques that fixed binary is what it is in the
>>>>>>>>>>>> end. I think It is more about user experience, whether the conversion is
>>>>>>>>>>>> done at the user side or Iceberg and engine side. Many people just store
>>>>>>>>>>>> UUID as a 36 byte string instead of a 16 byte binary, so with an explicit
>>>>>>>>>>>> UUID type, Iceberg can optimize this common use case internally for users.
>>>>>>>>>>>> There might be some other benefits I overlooked, but maybe the complication
>>>>>>>>>>>> introduced by this type does not really justify the slightly better user
>>>>>>>>>>>> experience. I am also on the fence about it.
>>>>>>>>>>>>
>>>>>>>>>>>> -Jack Ye
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <
>>>>>>>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> What specific arguments are there for it being a first class
>>>>>>>>>>>>> type besides it is elsewhere? Is there some kind of optimization iceberg or
>>>>>>>>>>>>> an engine could do if it was typed versus just a bucket of bits? Fixed
>>>>>>>>>>>>> width binary seems to cover the cases I see in terms of actual
>>>>>>>>>>>>> functionality in the iceberg libraries or engines…
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yy...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> One conversation I used to come across regarding UUID
>>>>>>>>>>>>>> deprecation was from
>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/1611
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Yan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary
>>>>>>>>>>>>>> <pv...@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Joshua,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I do not have a strong preference about the UUID type, but I
>>>>>>>>>>>>>>> would like the highlight, that the type is handled inconsistently in
>>>>>>>>>>>>>>> Iceberg with different file formats. (See:
>>>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If we keep the type, it would be good to standardize the
>>>>>>>>>>>>>>> handling in every file format.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <
>>>>>>>>>>>>>>> joshthoward@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>>>>>>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but
>>>>>>>>>>>>>>>> there seems to have been some discussion about removing it? I could not
>>>>>>>>>>>>>>>> find the original discussion, but a reference to the discussion can be
>>>>>>>>>>>>>>>> found here (https://github.com/trinodb/trino/issues/6663).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I generally agree with the consensus in the Trino issue to
>>>>>>>>>>>>>>>> keep UUID in Iceberg. To summarize…
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - It makes sense to keep the type now that row identifiers
>>>>>>>>>>>>>>>> are supported
>>>>>>>>>>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>>>>>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Tabular
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Josh Howard
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: [DISCUSS] UUID type

Posted by Ryan Blue <bl...@tabular.io>.

Let's move forward with it. I'm not hearing much dissent after saying the
general trend is to keep UUID. So let's call it lazy consensus.

Ryan

On Fri, Sep 17, 2021 at 1:32 PM Piotr Findeisen <pi...@starburstdata.com>
wrote:

> Hi Ryan,
>
> Please advise whatever feels more appropriate from your perspective.
> From my perspective, we could just go ahead and merge Trino Iceberg
> support for UUID, since this is just fulfilling the spec as it is defined
> today.
>
> Best
> PF
>
>
> On Wed, Sep 15, 2021 at 10:17 PM Ryan Blue <bl...@tabular.io> wrote:
>
>> I don't think we necessarily reached consensus, but I think the general
>> trend toward the end was to keep support for UUID. Should we start a vote
>> to validate consensus?
>>
>> On Wed, Sep 15, 2021 at 1:15 PM Joshua Howard <jo...@gmail.com>
>> wrote:
>>
>>> Just following up on Piotr's message here.
>>>
>>> Have we converged? I think most people would assume that silence is a
>>> vote for the status-quo.
>>>
>>> On Mon, Sep 13, 2021 at 7:30 AM Piotr Findeisen <pi...@starburstdata.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> It seems we converged here that UUID should remain included.
>>>> I read this as a consensus reached, but it may be subjective. Did we
>>>> objectively reached consensus on this?
>>>>
>>>> From Iceberg project perspective there isn't anything to do, as UUID
>>>> already *is* part of the spec (
>>>> https://iceberg.apache.org/spec/#schemas-and-data-types).
>>>> Trino Iceberg PR adding support for UUID
>>>> https://github.com/trinodb/trino/pull/8747 was pending merge while
>>>> this conversation has been ongoing.
>>>>
>>>> Best,
>>>> PF
>>>>
>>>>
>>>>
>>>> On Mon, Aug 2, 2021 at 6:22 AM Kyle B <kj...@gmail.com> wrote:
>>>>
>>>>> Hi Ryan and all,
>>>>>
>>>>> That sounds like a reasonable reason to leave IP address types out. In
>>>>> my experience, dedicated IP address types are mostly found in logging tools
>>>>> and other things for sysadmins / DevOps etc.
>>>>>
>>>>> When querying data with IP addresses, I’ve seen it done quite a lot
>>>>> (eg security reasons) but usually stored as string or manipulated in a UDF.
>>>>> They’re not commonly supported types.
>>>>>
>>>>> I would also draw the line at UUID types.
>>>>>
>>>>> - Kyle Bendickson
>>>>>
>>>>> On Jul 30, 2021, at 3:15 PM, Ryan Blue <bl...@tabular.io> wrote:
>>>>>
>>>>> 
>>>>> Jacques, you make some good points here. I think my argument about
>>>>> usability leading to performance issues is a stronger argument for engines
>>>>> than for Iceberg. Still, there are inefficiencies in Iceberg if someone
>>>>> chooses to use a string in an engine that doesn't have a UUID type.
>>>>>
>>>>> Another thing to consider is cross-engine support. If Iceberg removes
>>>>> UUID, then Trino would probably translate to fixed[16]. That results in a
>>>>> table that's difficult to query in other engines, where people would
>>>>> probably choose to store the data as a string. On the other hand, if
>>>>> Iceberg keeps the UUID type then integrations would simply translate to the
>>>>> UUID string representation before passing data to the other engines.
>>>>> While the engines would be using 36-byte values in join keys, the user
>>>>> experience issue is fixed and the data is more compact on disk and in
>>>>> Iceberg's bounds metadata.
>>>>>
>>>>> While having a UUID type in Iceberg can't really help engines that
>>>>> don't support UUID take advantage of the type at runtime, it does seem
>>>>> slightly better to have the UUID type in general since at least one engine
>>>>> supports it and it provides the expected user experience with a compact
>>>>> representation.
>>>>>
>>>>> IPv4 addresses are a good thing to think about as well, since most of
>>>>> the same arguments apply. If we keep the UUID type, should we also add IPv4
>>>>> or IPv6 types? I would probably draw the line at UUID because it helps in
>>>>> joins, which are an important operation. IPv4 representations aren't that
>>>>> big of an inconvenience unless you need to do IP manipulation, which is
>>>>> typically in a UDF and not the query engine. And you can always keep both
>>>>> representations in a table fairly inexpensively. Does this sound like a
>>>>> valid rationale for having UUID but not IP types?
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau <
>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>
>>>>>> It seems like Spark, Hive, Dremio and Impala all lack UUID as a
>>>>>> native type. Which engines are you thinking of that have a native UUID type
>>>>>> besides the Presto derivatives and support Iceberg?
>>>>>>
>>>>>> I agree that Trino should expose a UUID type on top of Iceberg
>>>>>> tables. All the user experience things that you are describing as important
>>>>>> (compact storage, friendly display, ddl, clean literals) are possible
>>>>>> without it being a first class type in Iceberg using a trino specific
>>>>>> property.
>>>>>>
>>>>>> I don't really have a strong opinion about UUID. In general, type
>>>>>> bloat is probably just a part of this kind of project. Generally, CHAR(X)
>>>>>> and VARCHAR(X) feel like much bigger concerns given that they exist in all
>>>>>> of the engines but not Iceberg--especially when we start talking about
>>>>>> views.
>>>>>>
>>>>>> Some of this argues for physical vs logical type abstraction.
>>>>>> (Something that was always challenging in Parquet but also helped to
>>>>>> resolve how these types are managed in engines that don't support them.)
>>>>>>
>>>>>> thanks,
>>>>>> Jacques
>>>>>>
>>>>>> PS: Funny aside, the bloat on an ip address is actually worse than a
>>>>>> UUID, right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% bloat.
>>>>>> UUID 36/16 => 125% bloat.
>>>>>>
>>>>>> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>
>>>>>>> I don't think this is just a problem in Trino.
>>>>>>>
>>>>>>> If there is no UUID type, then a user must choose between a 36-byte
>>>>>>> string and a 16-byte binary. That's not a good choice to force people into.
>>>>>>> If someone chooses binary, then it's harder to work with rows and construct
>>>>>>> queries even though there is a standard representation for UUIDs. To avoid
>>>>>>> the user headache, people will probably choose to store values as strings.
>>>>>>> Using a string would mean that more than half the value is needlessly
>>>>>>> discarded by default in Iceberg lower/upper bounds instead of keeping the
>>>>>>> entire value. And since engines don't know what's in the string, the full
>>>>>>> value must be used in comparison, which is extra work and extra space.
>>>>>>>
>>>>>>> Inflated values may not be a problem in some cases. IPv4 addresses
>>>>>>> are one case where you could argue that it doesn't matter very much that
>>>>>>> they are typically stored as strings. But I expect the use of UUIDs to be
>>>>>>> common for ID columns because you can generate them without coordination
>>>>>>> (unlike an incrementing ID) and that's a concern because the use as an ID
>>>>>>> makes them likely to be join keys.
>>>>>>>
>>>>>>> If we want the values to be stored as 16-byte fixed, then we need to
>>>>>>> make it easy to get the expected string representation in and out, just
>>>>>>> like we do with date/time types. I don't think that's specific to any
>>>>>>> engine.
>>>>>>>
>>>>>>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <
>>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>>
>>>>>>>> I think points 1&2 don't really apply since a fixed width binary
>>>>>>>> already covers those properties.
>>>>>>>>
>>>>>>>> It seems like this isn't really a concern of iceberg but rather a
>>>>>>>> cosmetic layer that exists primarily (only?) in trino. In that case I would
>>>>>>>> be inclined to say that trino should just use custom metadata and a fixed
>>>>>>>> binary type. That way you still have the desired ux without exposing those
>>>>>>>> extra concepts to the  iceberg. It actually feels like better encapsulation
>>>>>>>> imo.
>>>>>>>>
>>>>>>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <
>>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I agree with Ryan, that it takes some precautions before one can
>>>>>>>>> assume uniqueness of UUID values, and that this shouldn't be any special
>>>>>>>>> for UUIDs at all.
>>>>>>>>> After all, this is just a primitive type, which is commonly used
>>>>>>>>> for certain things, but "commonly" doesn't mean "always".
>>>>>>>>>
>>>>>>>>> The advantages of having a dedicated type are on 3 layers.
>>>>>>>>> The compact representation in the file, and compact representation
>>>>>>>>> in memory in the query engine are the ones mentioned above.
>>>>>>>>>
>>>>>>>>> The third layer is the usability. Seeing a UUID column i know what
>>>>>>>>> values i can expect, so it's more descriptive than `id char(36)`.
>>>>>>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), ....
>>>>>>>>> without need for casting to varchar.
>>>>>>>>> It also removes temptation of casting uuid to varbinary to achieve
>>>>>>>>> compact representation.
>>>>>>>>>
>>>>>>>>> Thus i think it would be good to have them.
>>>>>>>>>
>>>>>>>>> Best
>>>>>>>>> PF
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>>>
>>>>>>>>>> The original reason why I added UUID to the spec was that I
>>>>>>>>>> thought there would be opportunities to take advantage of UUIDs as unique
>>>>>>>>>> values and to optimize the use of UUIDs. I was thinking about
>>>>>>>>>> auto-increment ID fields and how we might do something similar in Iceberg.
>>>>>>>>>>
>>>>>>>>>> The reason we have thought about removing UUID is that there
>>>>>>>>>> aren't as many opportunities to take advantage of UUIDs as I thought. My
>>>>>>>>>> original assumption was that we could do things like bucket on UUID fields
>>>>>>>>>> or assume that a UUID field has a high NDV. But that's not necessarily the
>>>>>>>>>> case with when a UUID field is a foreign key, only when it is used as an
>>>>>>>>>> identifier or primary key. Before Jack added tracking for row identifier
>>>>>>>>>> fields, we couldn't know that a UUID was unique in a table. As a result, we
>>>>>>>>>> didn't invest in support for UUID.
>>>>>>>>>>
>>>>>>>>>> Quick aside: Now that row identifier fields are tracked, we can
>>>>>>>>>> do some of these things with the row identifier fields. Engines can assume
>>>>>>>>>> that the tuple of row identifier fields is unique in a table for join
>>>>>>>>>> estimation. And engines can use row identifier fields in sort keys to
>>>>>>>>>> ensure lots of partition split locations (this is really important for
>>>>>>>>>> Spark).
>>>>>>>>>>
>>>>>>>>>> Coming back to UUIDs, the second reason to have a UUID type is
>>>>>>>>>> still valid: it is better to represent UUIDs as fixed[16] than as 36 byte
>>>>>>>>>> UTF-8 strings that are more than twice as large, or even worse UCS-16
>>>>>>>>>> Strings that are 4x as large. Since UUIDs are likely to be used in joins,
>>>>>>>>>> this could really help engines as long as they can keep the values as
>>>>>>>>>> fixed-width binary.
>>>>>>>>>>
>>>>>>>>>> I could go either way on this. I think it is valuable to have a
>>>>>>>>>> compact representation for UUIDs rather than using the string
>>>>>>>>>> representation. But that will require investing in the type and building
>>>>>>>>>> support in engines that won't take advantage of it. If Trino can use this,
>>>>>>>>>> I think it may be worth keeping and investing in.
>>>>>>>>>>
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <ye...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes I agree with Jacques that fixed binary is what it is in the
>>>>>>>>>>> end. I think It is more about user experience, whether the conversion is
>>>>>>>>>>> done at the user side or Iceberg and engine side. Many people just store
>>>>>>>>>>> UUID as a 36 byte string instead of a 16 byte binary, so with an explicit
>>>>>>>>>>> UUID type, Iceberg can optimize this common use case internally for users.
>>>>>>>>>>> There might be some other benefits I overlooked, but maybe the complication
>>>>>>>>>>> introduced by this type does not really justify the slightly better user
>>>>>>>>>>> experience. I am also on the fence about it.
>>>>>>>>>>>
>>>>>>>>>>> -Jack Ye
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <
>>>>>>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> What specific arguments are there for it being a first class
>>>>>>>>>>>> type besides it is elsewhere? Is there some kind of optimization iceberg or
>>>>>>>>>>>> an engine could do if it was typed versus just a bucket of bits? Fixed
>>>>>>>>>>>> width binary seems to cover the cases I see in terms of actual
>>>>>>>>>>>> functionality in the iceberg libraries or engines…
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yy...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> One conversation I used to come across regarding UUID
>>>>>>>>>>>>> deprecation was from
>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/1611
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Yan
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary
>>>>>>>>>>>>> <pv...@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Joshua,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I do not have a strong preference about the UUID type, but I
>>>>>>>>>>>>>> would like the highlight, that the type is handled inconsistently in
>>>>>>>>>>>>>> Iceberg with different file formats. (See:
>>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If we keep the type, it would be good to standardize the
>>>>>>>>>>>>>> handling in every file format.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <
>>>>>>>>>>>>>> joshthoward@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>>>>>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but
>>>>>>>>>>>>>>> there seems to have been some discussion about removing it? I could not
>>>>>>>>>>>>>>> find the original discussion, but a reference to the discussion can be
>>>>>>>>>>>>>>> found here (https://github.com/trinodb/trino/issues/6663).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I generally agree with the consensus in the Trino issue to
>>>>>>>>>>>>>>> keep UUID in Iceberg. To summarize…
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - It makes sense to keep the type now that row identifiers
>>>>>>>>>>>>>>> are supported
>>>>>>>>>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>>>>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Tabular
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>>
>>>
>>> --
>>> Josh Howard
>>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: [DISCUSS] UUID type

Posted by Piotr Findeisen <pi...@starburstdata.com>.

Hi Ryan,

Please advise whatever feels more appropriate from your perspective.
From my perspective, we could just go ahead and merge Trino Iceberg support
for UUID, since this is just fulfilling the spec as it is defined today.

Best
PF


On Wed, Sep 15, 2021 at 10:17 PM Ryan Blue <bl...@tabular.io> wrote:

> I don't think we necessarily reached consensus, but I think the general
> trend toward the end was to keep support for UUID. Should we start a vote
> to validate consensus?
>
> On Wed, Sep 15, 2021 at 1:15 PM Joshua Howard <jo...@gmail.com>
> wrote:
>
>> Just following up on Piotr's message here.
>>
>> Have we converged? I think most people would assume that silence is a
>> vote for the status-quo.
>>
>> On Mon, Sep 13, 2021 at 7:30 AM Piotr Findeisen <pi...@starburstdata.com>
>> wrote:
>>
>>> Hi,
>>>
>>> It seems we converged here that UUID should remain included.
>>> I read this as a consensus reached, but it may be subjective. Did we
>>> objectively reached consensus on this?
>>>
>>> From Iceberg project perspective there isn't anything to do, as UUID
>>> already *is* part of the spec (
>>> https://iceberg.apache.org/spec/#schemas-and-data-types).
>>> Trino Iceberg PR adding support for UUID
>>> https://github.com/trinodb/trino/pull/8747 was pending merge while this
>>> conversation has been ongoing.
>>>
>>> Best,
>>> PF
>>>
>>>
>>>
>>> On Mon, Aug 2, 2021 at 6:22 AM Kyle B <kj...@gmail.com> wrote:
>>>
>>>> Hi Ryan and all,
>>>>
>>>> That sounds like a reasonable reason to leave IP address types out. In
>>>> my experience, dedicated IP address types are mostly found in logging tools
>>>> and other things for sysadmins / DevOps etc.
>>>>
>>>> When querying data with IP addresses, I’ve seen it done quite a lot (eg
>>>> security reasons) but usually stored as string or manipulated in a UDF.
>>>> They’re not commonly supported types.
>>>>
>>>> I would also draw the line at UUID types.
>>>>
>>>> - Kyle Bendickson
>>>>
>>>> On Jul 30, 2021, at 3:15 PM, Ryan Blue <bl...@tabular.io> wrote:
>>>>
>>>> 
>>>> Jacques, you make some good points here. I think my argument about
>>>> usability leading to performance issues is a stronger argument for engines
>>>> than for Iceberg. Still, there are inefficiencies in Iceberg if someone
>>>> chooses to use a string in an engine that doesn't have a UUID type.
>>>>
>>>> Another thing to consider is cross-engine support. If Iceberg removes
>>>> UUID, then Trino would probably translate to fixed[16]. That results in a
>>>> table that's difficult to query in other engines, where people would
>>>> probably choose to store the data as a string. On the other hand, if
>>>> Iceberg keeps the UUID type then integrations would simply translate to the
>>>> UUID string representation before passing data to the other engines.
>>>> While the engines would be using 36-byte values in join keys, the user
>>>> experience issue is fixed and the data is more compact on disk and in
>>>> Iceberg's bounds metadata.
>>>>
>>>> While having a UUID type in Iceberg can't really help engines that
>>>> don't support UUID take advantage of the type at runtime, it does seem
>>>> slightly better to have the UUID type in general since at least one engine
>>>> supports it and it provides the expected user experience with a compact
>>>> representation.
>>>>
>>>> IPv4 addresses are a good thing to think about as well, since most of
>>>> the same arguments apply. If we keep the UUID type, should we also add IPv4
>>>> or IPv6 types? I would probably draw the line at UUID because it helps in
>>>> joins, which are an important operation. IPv4 representations aren't that
>>>> big of an inconvenience unless you need to do IP manipulation, which is
>>>> typically in a UDF and not the query engine. And you can always keep both
>>>> representations in a table fairly inexpensively. Does this sound like a
>>>> valid rationale for having UUID but not IP types?
>>>>
>>>> Ryan
>>>>
>>>> On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> It seems like Spark, Hive, Dremio and Impala all lack UUID as a native
>>>>> type. Which engines are you thinking of that have a native UUID type
>>>>> besides the Presto derivatives and support Iceberg?
>>>>>
>>>>> I agree that Trino should expose a UUID type on top of Iceberg tables.
>>>>> All the user experience things that you are describing as important
>>>>> (compact storage, friendly display, ddl, clean literals) are possible
>>>>> without it being a first class type in Iceberg using a trino specific
>>>>> property.
>>>>>
>>>>> I don't really have a strong opinion about UUID. In general, type
>>>>> bloat is probably just a part of this kind of project. Generally, CHAR(X)
>>>>> and VARCHAR(X) feel like much bigger concerns given that they exist in all
>>>>> of the engines but not Iceberg--especially when we start talking about
>>>>> views.
>>>>>
>>>>> Some of this argues for physical vs logical type abstraction.
>>>>> (Something that was always challenging in Parquet but also helped to
>>>>> resolve how these types are managed in engines that don't support them.)
>>>>>
>>>>> thanks,
>>>>> Jacques
>>>>>
>>>>> PS: Funny aside, the bloat on an ip address is actually worse than a
>>>>> UUID, right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% bloat.
>>>>> UUID 36/16 => 125% bloat.
>>>>>
>>>>> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>
>>>>>> I don't think this is just a problem in Trino.
>>>>>>
>>>>>> If there is no UUID type, then a user must choose between a 36-byte
>>>>>> string and a 16-byte binary. That's not a good choice to force people into.
>>>>>> If someone chooses binary, then it's harder to work with rows and construct
>>>>>> queries even though there is a standard representation for UUIDs. To avoid
>>>>>> the user headache, people will probably choose to store values as strings.
>>>>>> Using a string would mean that more than half the value is needlessly
>>>>>> discarded by default in Iceberg lower/upper bounds instead of keeping the
>>>>>> entire value. And since engines don't know what's in the string, the full
>>>>>> value must be used in comparison, which is extra work and extra space.
>>>>>>
>>>>>> Inflated values may not be a problem in some cases. IPv4 addresses
>>>>>> are one case where you could argue that it doesn't matter very much that
>>>>>> they are typically stored as strings. But I expect the use of UUIDs to be
>>>>>> common for ID columns because you can generate them without coordination
>>>>>> (unlike an incrementing ID) and that's a concern because the use as an ID
>>>>>> makes them likely to be join keys.
>>>>>>
>>>>>> If we want the values to be stored as 16-byte fixed, then we need to
>>>>>> make it easy to get the expected string representation in and out, just
>>>>>> like we do with date/time types. I don't think that's specific to any
>>>>>> engine.
>>>>>>
>>>>>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <
>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>
>>>>>>> I think points 1&2 don't really apply since a fixed width binary
>>>>>>> already covers those properties.
>>>>>>>
>>>>>>> It seems like this isn't really a concern of iceberg but rather a
>>>>>>> cosmetic layer that exists primarily (only?) in trino. In that case I would
>>>>>>> be inclined to say that trino should just use custom metadata and a fixed
>>>>>>> binary type. That way you still have the desired ux without exposing those
>>>>>>> extra concepts to the  iceberg. It actually feels like better encapsulation
>>>>>>> imo.
>>>>>>>
>>>>>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <
>>>>>>> piotr@starburstdata.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I agree with Ryan, that it takes some precautions before one can
>>>>>>>> assume uniqueness of UUID values, and that this shouldn't be any special
>>>>>>>> for UUIDs at all.
>>>>>>>> After all, this is just a primitive type, which is commonly used
>>>>>>>> for certain things, but "commonly" doesn't mean "always".
>>>>>>>>
>>>>>>>> The advantages of having a dedicated type are on 3 layers.
>>>>>>>> The compact representation in the file, and compact representation
>>>>>>>> in memory in the query engine are the ones mentioned above.
>>>>>>>>
>>>>>>>> The third layer is the usability. Seeing a UUID column i know what
>>>>>>>> values i can expect, so it's more descriptive than `id char(36)`.
>>>>>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), ....
>>>>>>>> without need for casting to varchar.
>>>>>>>> It also removes temptation of casting uuid to varbinary to achieve
>>>>>>>> compact representation.
>>>>>>>>
>>>>>>>> Thus i think it would be good to have them.
>>>>>>>>
>>>>>>>> Best
>>>>>>>> PF
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>>
>>>>>>>>> The original reason why I added UUID to the spec was that I
>>>>>>>>> thought there would be opportunities to take advantage of UUIDs as unique
>>>>>>>>> values and to optimize the use of UUIDs. I was thinking about
>>>>>>>>> auto-increment ID fields and how we might do something similar in Iceberg.
>>>>>>>>>
>>>>>>>>> The reason we have thought about removing UUID is that there
>>>>>>>>> aren't as many opportunities to take advantage of UUIDs as I thought. My
>>>>>>>>> original assumption was that we could do things like bucket on UUID fields
>>>>>>>>> or assume that a UUID field has a high NDV. But that's not necessarily the
>>>>>>>>> case with when a UUID field is a foreign key, only when it is used as an
>>>>>>>>> identifier or primary key. Before Jack added tracking for row identifier
>>>>>>>>> fields, we couldn't know that a UUID was unique in a table. As a result, we
>>>>>>>>> didn't invest in support for UUID.
>>>>>>>>>
>>>>>>>>> Quick aside: Now that row identifier fields are tracked, we can do
>>>>>>>>> some of these things with the row identifier fields. Engines can assume
>>>>>>>>> that the tuple of row identifier fields is unique in a table for join
>>>>>>>>> estimation. And engines can use row identifier fields in sort keys to
>>>>>>>>> ensure lots of partition split locations (this is really important for
>>>>>>>>> Spark).
>>>>>>>>>
>>>>>>>>> Coming back to UUIDs, the second reason to have a UUID type is
>>>>>>>>> still valid: it is better to represent UUIDs as fixed[16] than as 36 byte
>>>>>>>>> UTF-8 strings that are more than twice as large, or even worse UCS-16
>>>>>>>>> Strings that are 4x as large. Since UUIDs are likely to be used in joins,
>>>>>>>>> this could really help engines as long as they can keep the values as
>>>>>>>>> fixed-width binary.
>>>>>>>>>
>>>>>>>>> I could go either way on this. I think it is valuable to have a
>>>>>>>>> compact representation for UUIDs rather than using the string
>>>>>>>>> representation. But that will require investing in the type and building
>>>>>>>>> support in engines that won't take advantage of it. If Trino can use this,
>>>>>>>>> I think it may be worth keeping and investing in.
>>>>>>>>>
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <ye...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Yes I agree with Jacques that fixed binary is what it is in the
>>>>>>>>>> end. I think It is more about user experience, whether the conversion is
>>>>>>>>>> done at the user side or Iceberg and engine side. Many people just store
>>>>>>>>>> UUID as a 36 byte string instead of a 16 byte binary, so with an explicit
>>>>>>>>>> UUID type, Iceberg can optimize this common use case internally for users.
>>>>>>>>>> There might be some other benefits I overlooked, but maybe the complication
>>>>>>>>>> introduced by this type does not really justify the slightly better user
>>>>>>>>>> experience. I am also on the fence about it.
>>>>>>>>>>
>>>>>>>>>> -Jack Ye
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <
>>>>>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> What specific arguments are there for it being a first class
>>>>>>>>>>> type besides it is elsewhere? Is there some kind of optimization iceberg or
>>>>>>>>>>> an engine could do if it was typed versus just a bucket of bits? Fixed
>>>>>>>>>>> width binary seems to cover the cases I see in terms of actual
>>>>>>>>>>> functionality in the iceberg libraries or engines…
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yy...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> One conversation I used to come across regarding UUID
>>>>>>>>>>>> deprecation was from
>>>>>>>>>>>> https://github.com/apache/iceberg/pull/1611
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Yan
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary
>>>>>>>>>>>> <pv...@cloudera.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Joshua,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I do not have a strong preference about the UUID type, but I
>>>>>>>>>>>>> would like the highlight, that the type is handled inconsistently in
>>>>>>>>>>>>> Iceberg with different file formats. (See:
>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>>>>>>>>>
>>>>>>>>>>>>> If we keep the type, it would be good to standardize the
>>>>>>>>>>>>> handling in every file format.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <
>>>>>>>>>>>>> joshthoward@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>>>>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but there
>>>>>>>>>>>>>> seems to have been some discussion about removing it? I could not find the
>>>>>>>>>>>>>> original discussion, but a reference to the discussion can be found here (
>>>>>>>>>>>>>> https://github.com/trinodb/trino/issues/6663).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I generally agree with the consensus in the Trino issue to
>>>>>>>>>>>>>> keep UUID in Iceberg. To summarize…
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - It makes sense to keep the type now that row identifiers
>>>>>>>>>>>>>> are supported
>>>>>>>>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>>>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Tabular
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>>
>>
>> --
>> Josh Howard
>>
>
>
> --
> Ryan Blue
> Tabular
>

Re: [DISCUSS] UUID type

Posted by Ryan Blue <bl...@tabular.io>.

I don't think we necessarily reached consensus, but I think the general
trend toward the end was to keep support for UUID. Should we start a vote
to validate consensus?

On Wed, Sep 15, 2021 at 1:15 PM Joshua Howard <jo...@gmail.com> wrote:

> Just following up on Piotr's message here.
>
> Have we converged? I think most people would assume that silence is a vote
> for the status-quo.
>
> On Mon, Sep 13, 2021 at 7:30 AM Piotr Findeisen <pi...@starburstdata.com>
> wrote:
>
>> Hi,
>>
>> It seems we converged here that UUID should remain included.
>> I read this as a consensus reached, but it may be subjective. Did we
>> objectively reached consensus on this?
>>
>> From Iceberg project perspective there isn't anything to do, as UUID
>> already *is* part of the spec (
>> https://iceberg.apache.org/spec/#schemas-and-data-types).
>> Trino Iceberg PR adding support for UUID
>> https://github.com/trinodb/trino/pull/8747 was pending merge while this
>> conversation has been ongoing.
>>
>> Best,
>> PF
>>
>>
>>
>> On Mon, Aug 2, 2021 at 6:22 AM Kyle B <kj...@gmail.com> wrote:
>>
>>> Hi Ryan and all,
>>>
>>> That sounds like a reasonable reason to leave IP address types out. In
>>> my experience, dedicated IP address types are mostly found in logging tools
>>> and other things for sysadmins / DevOps etc.
>>>
>>> When querying data with IP addresses, I’ve seen it done quite a lot (eg
>>> security reasons) but usually stored as string or manipulated in a UDF.
>>> They’re not commonly supported types.
>>>
>>> I would also draw the line at UUID types.
>>>
>>> - Kyle Bendickson
>>>
>>> On Jul 30, 2021, at 3:15 PM, Ryan Blue <bl...@tabular.io> wrote:
>>>
>>> 
>>> Jacques, you make some good points here. I think my argument about
>>> usability leading to performance issues is a stronger argument for engines
>>> than for Iceberg. Still, there are inefficiencies in Iceberg if someone
>>> chooses to use a string in an engine that doesn't have a UUID type.
>>>
>>> Another thing to consider is cross-engine support. If Iceberg removes
>>> UUID, then Trino would probably translate to fixed[16]. That results in a
>>> table that's difficult to query in other engines, where people would
>>> probably choose to store the data as a string. On the other hand, if
>>> Iceberg keeps the UUID type then integrations would simply translate to the
>>> UUID string representation before passing data to the other engines.
>>> While the engines would be using 36-byte values in join keys, the user
>>> experience issue is fixed and the data is more compact on disk and in
>>> Iceberg's bounds metadata.
>>>
>>> While having a UUID type in Iceberg can't really help engines that don't
>>> support UUID take advantage of the type at runtime, it does seem slightly
>>> better to have the UUID type in general since at least one engine supports
>>> it and it provides the expected user experience with a compact
>>> representation.
>>>
>>> IPv4 addresses are a good thing to think about as well, since most of
>>> the same arguments apply. If we keep the UUID type, should we also add IPv4
>>> or IPv6 types? I would probably draw the line at UUID because it helps in
>>> joins, which are an important operation. IPv4 representations aren't that
>>> big of an inconvenience unless you need to do IP manipulation, which is
>>> typically in a UDF and not the query engine. And you can always keep both
>>> representations in a table fairly inexpensively. Does this sound like a
>>> valid rationale for having UUID but not IP types?
>>>
>>> Ryan
>>>
>>> On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau <ja...@gmail.com>
>>> wrote:
>>>
>>>> It seems like Spark, Hive, Dremio and Impala all lack UUID as a native
>>>> type. Which engines are you thinking of that have a native UUID type
>>>> besides the Presto derivatives and support Iceberg?
>>>>
>>>> I agree that Trino should expose a UUID type on top of Iceberg tables.
>>>> All the user experience things that you are describing as important
>>>> (compact storage, friendly display, ddl, clean literals) are possible
>>>> without it being a first class type in Iceberg using a trino specific
>>>> property.
>>>>
>>>> I don't really have a strong opinion about UUID. In general, type bloat
>>>> is probably just a part of this kind of project. Generally, CHAR(X) and
>>>> VARCHAR(X) feel like much bigger concerns given that they exist in all of
>>>> the engines but not Iceberg--especially when we start talking about views.
>>>>
>>>> Some of this argues for physical vs logical type abstraction.
>>>> (Something that was always challenging in Parquet but also helped to
>>>> resolve how these types are managed in engines that don't support them.)
>>>>
>>>> thanks,
>>>> Jacques
>>>>
>>>> PS: Funny aside, the bloat on an ip address is actually worse than a
>>>> UUID, right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% bloat.
>>>> UUID 36/16 => 125% bloat.
>>>>
>>>> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>
>>>>> I don't think this is just a problem in Trino.
>>>>>
>>>>> If there is no UUID type, then a user must choose between a 36-byte
>>>>> string and a 16-byte binary. That's not a good choice to force people into.
>>>>> If someone chooses binary, then it's harder to work with rows and construct
>>>>> queries even though there is a standard representation for UUIDs. To avoid
>>>>> the user headache, people will probably choose to store values as strings.
>>>>> Using a string would mean that more than half the value is needlessly
>>>>> discarded by default in Iceberg lower/upper bounds instead of keeping the
>>>>> entire value. And since engines don't know what's in the string, the full
>>>>> value must be used in comparison, which is extra work and extra space.
>>>>>
>>>>> Inflated values may not be a problem in some cases. IPv4 addresses are
>>>>> one case where you could argue that it doesn't matter very much that they
>>>>> are typically stored as strings. But I expect the use of UUIDs to be common
>>>>> for ID columns because you can generate them without coordination (unlike
>>>>> an incrementing ID) and that's a concern because the use as an ID makes
>>>>> them likely to be join keys.
>>>>>
>>>>> If we want the values to be stored as 16-byte fixed, then we need to
>>>>> make it easy to get the expected string representation in and out, just
>>>>> like we do with date/time types. I don't think that's specific to any
>>>>> engine.
>>>>>
>>>>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <
>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>
>>>>>> I think points 1&2 don't really apply since a fixed width binary
>>>>>> already covers those properties.
>>>>>>
>>>>>> It seems like this isn't really a concern of iceberg but rather a
>>>>>> cosmetic layer that exists primarily (only?) in trino. In that case I would
>>>>>> be inclined to say that trino should just use custom metadata and a fixed
>>>>>> binary type. That way you still have the desired ux without exposing those
>>>>>> extra concepts to the  iceberg. It actually feels like better encapsulation
>>>>>> imo.
>>>>>>
>>>>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <
>>>>>> piotr@starburstdata.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I agree with Ryan, that it takes some precautions before one can
>>>>>>> assume uniqueness of UUID values, and that this shouldn't be any special
>>>>>>> for UUIDs at all.
>>>>>>> After all, this is just a primitive type, which is commonly used for
>>>>>>> certain things, but "commonly" doesn't mean "always".
>>>>>>>
>>>>>>> The advantages of having a dedicated type are on 3 layers.
>>>>>>> The compact representation in the file, and compact representation
>>>>>>> in memory in the query engine are the ones mentioned above.
>>>>>>>
>>>>>>> The third layer is the usability. Seeing a UUID column i know what
>>>>>>> values i can expect, so it's more descriptive than `id char(36)`.
>>>>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), ....
>>>>>>> without need for casting to varchar.
>>>>>>> It also removes temptation of casting uuid to varbinary to achieve
>>>>>>> compact representation.
>>>>>>>
>>>>>>> Thus i think it would be good to have them.
>>>>>>>
>>>>>>> Best
>>>>>>> PF
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>>
>>>>>>>> The original reason why I added UUID to the spec was that I thought
>>>>>>>> there would be opportunities to take advantage of UUIDs as unique values
>>>>>>>> and to optimize the use of UUIDs. I was thinking about auto-increment ID
>>>>>>>> fields and how we might do something similar in Iceberg.
>>>>>>>>
>>>>>>>> The reason we have thought about removing UUID is that there aren't
>>>>>>>> as many opportunities to take advantage of UUIDs as I thought. My original
>>>>>>>> assumption was that we could do things like bucket on UUID fields or assume
>>>>>>>> that a UUID field has a high NDV. But that's not necessarily the case with
>>>>>>>> when a UUID field is a foreign key, only when it is used as an identifier
>>>>>>>> or primary key. Before Jack added tracking for row identifier fields, we
>>>>>>>> couldn't know that a UUID was unique in a table. As a result, we didn't
>>>>>>>> invest in support for UUID.
>>>>>>>>
>>>>>>>> Quick aside: Now that row identifier fields are tracked, we can do
>>>>>>>> some of these things with the row identifier fields. Engines can assume
>>>>>>>> that the tuple of row identifier fields is unique in a table for join
>>>>>>>> estimation. And engines can use row identifier fields in sort keys to
>>>>>>>> ensure lots of partition split locations (this is really important for
>>>>>>>> Spark).
>>>>>>>>
>>>>>>>> Coming back to UUIDs, the second reason to have a UUID type is
>>>>>>>> still valid: it is better to represent UUIDs as fixed[16] than as 36 byte
>>>>>>>> UTF-8 strings that are more than twice as large, or even worse UCS-16
>>>>>>>> Strings that are 4x as large. Since UUIDs are likely to be used in joins,
>>>>>>>> this could really help engines as long as they can keep the values as
>>>>>>>> fixed-width binary.
>>>>>>>>
>>>>>>>> I could go either way on this. I think it is valuable to have a
>>>>>>>> compact representation for UUIDs rather than using the string
>>>>>>>> representation. But that will require investing in the type and building
>>>>>>>> support in engines that won't take advantage of it. If Trino can use this,
>>>>>>>> I think it may be worth keeping and investing in.
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <ye...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Yes I agree with Jacques that fixed binary is what it is in the
>>>>>>>>> end. I think It is more about user experience, whether the conversion is
>>>>>>>>> done at the user side or Iceberg and engine side. Many people just store
>>>>>>>>> UUID as a 36 byte string instead of a 16 byte binary, so with an explicit
>>>>>>>>> UUID type, Iceberg can optimize this common use case internally for users.
>>>>>>>>> There might be some other benefits I overlooked, but maybe the complication
>>>>>>>>> introduced by this type does not really justify the slightly better user
>>>>>>>>> experience. I am also on the fence about it.
>>>>>>>>>
>>>>>>>>> -Jack Ye
>>>>>>>>>
>>>>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <
>>>>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> What specific arguments are there for it being a first class type
>>>>>>>>>> besides it is elsewhere? Is there some kind of optimization iceberg or an
>>>>>>>>>> engine could do if it was typed versus just a bucket of bits? Fixed width
>>>>>>>>>> binary seems to cover the cases I see in terms of actual functionality in
>>>>>>>>>> the iceberg libraries or engines…
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yy...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> One conversation I used to come across regarding UUID
>>>>>>>>>>> deprecation was from https://github.com/apache/iceberg/pull/1611
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Yan
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary
>>>>>>>>>>> <pv...@cloudera.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Joshua,
>>>>>>>>>>>>
>>>>>>>>>>>> I do not have a strong preference about the UUID type, but I
>>>>>>>>>>>> would like the highlight, that the type is handled inconsistently in
>>>>>>>>>>>> Iceberg with different file formats. (See:
>>>>>>>>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>>>>>>>>
>>>>>>>>>>>> If we keep the type, it would be good to standardize the
>>>>>>>>>>>> handling in every file format.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <
>>>>>>>>>>>> joshthoward@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi.
>>>>>>>>>>>>>
>>>>>>>>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>>>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but there
>>>>>>>>>>>>> seems to have been some discussion about removing it? I could not find the
>>>>>>>>>>>>> original discussion, but a reference to the discussion can be found here (
>>>>>>>>>>>>> https://github.com/trinodb/trino/issues/6663).
>>>>>>>>>>>>>
>>>>>>>>>>>>> I generally agree with the consensus in the Trino issue to
>>>>>>>>>>>>> keep UUID in Iceberg. To summarize…
>>>>>>>>>>>>>
>>>>>>>>>>>>> - It makes sense to keep the type now that row identifiers are
>>>>>>>>>>>>> supported
>>>>>>>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>>>>>>>
>>>>>>>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>>
>
> --
> Josh Howard
>


-- 
Ryan Blue
Tabular

Re: [DISCUSS] UUID type

Posted by Joshua Howard <jo...@gmail.com>.

Just following up on Piotr's message here.

Have we converged? I think most people would assume that silence is a vote
for the status-quo.

On Mon, Sep 13, 2021 at 7:30 AM Piotr Findeisen <pi...@starburstdata.com>
wrote:

> Hi,
>
> It seems we converged here that UUID should remain included.
> I read this as a consensus reached, but it may be subjective. Did we
> objectively reached consensus on this?
>
> From Iceberg project perspective there isn't anything to do, as UUID
> already *is* part of the spec (
> https://iceberg.apache.org/spec/#schemas-and-data-types).
> Trino Iceberg PR adding support for UUID
> https://github.com/trinodb/trino/pull/8747 was pending merge while this
> conversation has been ongoing.
>
> Best,
> PF
>
>
>
> On Mon, Aug 2, 2021 at 6:22 AM Kyle B <kj...@gmail.com> wrote:
>
>> Hi Ryan and all,
>>
>> That sounds like a reasonable reason to leave IP address types out. In my
>> experience, dedicated IP address types are mostly found in logging tools
>> and other things for sysadmins / DevOps etc.
>>
>> When querying data with IP addresses, I’ve seen it done quite a lot (eg
>> security reasons) but usually stored as string or manipulated in a UDF.
>> They’re not commonly supported types.
>>
>> I would also draw the line at UUID types.
>>
>> - Kyle Bendickson
>>
>> On Jul 30, 2021, at 3:15 PM, Ryan Blue <bl...@tabular.io> wrote:
>>
>> 
>> Jacques, you make some good points here. I think my argument about
>> usability leading to performance issues is a stronger argument for engines
>> than for Iceberg. Still, there are inefficiencies in Iceberg if someone
>> chooses to use a string in an engine that doesn't have a UUID type.
>>
>> Another thing to consider is cross-engine support. If Iceberg removes
>> UUID, then Trino would probably translate to fixed[16]. That results in a
>> table that's difficult to query in other engines, where people would
>> probably choose to store the data as a string. On the other hand, if
>> Iceberg keeps the UUID type then integrations would simply translate to the
>> UUID string representation before passing data to the other engines.
>> While the engines would be using 36-byte values in join keys, the user
>> experience issue is fixed and the data is more compact on disk and in
>> Iceberg's bounds metadata.
>>
>> While having a UUID type in Iceberg can't really help engines that don't
>> support UUID take advantage of the type at runtime, it does seem slightly
>> better to have the UUID type in general since at least one engine supports
>> it and it provides the expected user experience with a compact
>> representation.
>>
>> IPv4 addresses are a good thing to think about as well, since most of the
>> same arguments apply. If we keep the UUID type, should we also add IPv4 or
>> IPv6 types? I would probably draw the line at UUID because it helps in
>> joins, which are an important operation. IPv4 representations aren't that
>> big of an inconvenience unless you need to do IP manipulation, which is
>> typically in a UDF and not the query engine. And you can always keep both
>> representations in a table fairly inexpensively. Does this sound like a
>> valid rationale for having UUID but not IP types?
>>
>> Ryan
>>
>> On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau <ja...@gmail.com>
>> wrote:
>>
>>> It seems like Spark, Hive, Dremio and Impala all lack UUID as a native
>>> type. Which engines are you thinking of that have a native UUID type
>>> besides the Presto derivatives and support Iceberg?
>>>
>>> I agree that Trino should expose a UUID type on top of Iceberg tables.
>>> All the user experience things that you are describing as important
>>> (compact storage, friendly display, ddl, clean literals) are possible
>>> without it being a first class type in Iceberg using a trino specific
>>> property.
>>>
>>> I don't really have a strong opinion about UUID. In general, type bloat
>>> is probably just a part of this kind of project. Generally, CHAR(X) and
>>> VARCHAR(X) feel like much bigger concerns given that they exist in all of
>>> the engines but not Iceberg--especially when we start talking about views.
>>>
>>> Some of this argues for physical vs logical type abstraction. (Something
>>> that was always challenging in Parquet but also helped to resolve how these
>>> types are managed in engines that don't support them.)
>>>
>>> thanks,
>>> Jacques
>>>
>>> PS: Funny aside, the bloat on an ip address is actually worse than a
>>> UUID, right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% bloat.
>>> UUID 36/16 => 125% bloat.
>>>
>>> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <bl...@tabular.io> wrote:
>>>
>>>> I don't think this is just a problem in Trino.
>>>>
>>>> If there is no UUID type, then a user must choose between a 36-byte
>>>> string and a 16-byte binary. That's not a good choice to force people into.
>>>> If someone chooses binary, then it's harder to work with rows and construct
>>>> queries even though there is a standard representation for UUIDs. To avoid
>>>> the user headache, people will probably choose to store values as strings.
>>>> Using a string would mean that more than half the value is needlessly
>>>> discarded by default in Iceberg lower/upper bounds instead of keeping the
>>>> entire value. And since engines don't know what's in the string, the full
>>>> value must be used in comparison, which is extra work and extra space.
>>>>
>>>> Inflated values may not be a problem in some cases. IPv4 addresses are
>>>> one case where you could argue that it doesn't matter very much that they
>>>> are typically stored as strings. But I expect the use of UUIDs to be common
>>>> for ID columns because you can generate them without coordination (unlike
>>>> an incrementing ID) and that's a concern because the use as an ID makes
>>>> them likely to be join keys.
>>>>
>>>> If we want the values to be stored as 16-byte fixed, then we need to
>>>> make it easy to get the expected string representation in and out, just
>>>> like we do with date/time types. I don't think that's specific to any
>>>> engine.
>>>>
>>>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> I think points 1&2 don't really apply since a fixed width binary
>>>>> already covers those properties.
>>>>>
>>>>> It seems like this isn't really a concern of iceberg but rather a
>>>>> cosmetic layer that exists primarily (only?) in trino. In that case I would
>>>>> be inclined to say that trino should just use custom metadata and a fixed
>>>>> binary type. That way you still have the desired ux without exposing those
>>>>> extra concepts to the  iceberg. It actually feels like better encapsulation
>>>>> imo.
>>>>>
>>>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <pi...@starburstdata.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I agree with Ryan, that it takes some precautions before one can
>>>>>> assume uniqueness of UUID values, and that this shouldn't be any special
>>>>>> for UUIDs at all.
>>>>>> After all, this is just a primitive type, which is commonly used for
>>>>>> certain things, but "commonly" doesn't mean "always".
>>>>>>
>>>>>> The advantages of having a dedicated type are on 3 layers.
>>>>>> The compact representation in the file, and compact representation in
>>>>>> memory in the query engine are the ones mentioned above.
>>>>>>
>>>>>> The third layer is the usability. Seeing a UUID column i know what
>>>>>> values i can expect, so it's more descriptive than `id char(36)`.
>>>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), .... without
>>>>>> need for casting to varchar.
>>>>>> It also removes temptation of casting uuid to varbinary to achieve
>>>>>> compact representation.
>>>>>>
>>>>>> Thus i think it would be good to have them.
>>>>>>
>>>>>> Best
>>>>>> PF
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>>
>>>>>>> The original reason why I added UUID to the spec was that I thought
>>>>>>> there would be opportunities to take advantage of UUIDs as unique values
>>>>>>> and to optimize the use of UUIDs. I was thinking about auto-increment ID
>>>>>>> fields and how we might do something similar in Iceberg.
>>>>>>>
>>>>>>> The reason we have thought about removing UUID is that there aren't
>>>>>>> as many opportunities to take advantage of UUIDs as I thought. My original
>>>>>>> assumption was that we could do things like bucket on UUID fields or assume
>>>>>>> that a UUID field has a high NDV. But that's not necessarily the case with
>>>>>>> when a UUID field is a foreign key, only when it is used as an identifier
>>>>>>> or primary key. Before Jack added tracking for row identifier fields, we
>>>>>>> couldn't know that a UUID was unique in a table. As a result, we didn't
>>>>>>> invest in support for UUID.
>>>>>>>
>>>>>>> Quick aside: Now that row identifier fields are tracked, we can do
>>>>>>> some of these things with the row identifier fields. Engines can assume
>>>>>>> that the tuple of row identifier fields is unique in a table for join
>>>>>>> estimation. And engines can use row identifier fields in sort keys to
>>>>>>> ensure lots of partition split locations (this is really important for
>>>>>>> Spark).
>>>>>>>
>>>>>>> Coming back to UUIDs, the second reason to have a UUID type is still
>>>>>>> valid: it is better to represent UUIDs as fixed[16] than as 36 byte UTF-8
>>>>>>> strings that are more than twice as large, or even worse UCS-16 Strings
>>>>>>> that are 4x as large. Since UUIDs are likely to be used in joins, this
>>>>>>> could really help engines as long as they can keep the values as
>>>>>>> fixed-width binary.
>>>>>>>
>>>>>>> I could go either way on this. I think it is valuable to have a
>>>>>>> compact representation for UUIDs rather than using the string
>>>>>>> representation. But that will require investing in the type and building
>>>>>>> support in engines that won't take advantage of it. If Trino can use this,
>>>>>>> I think it may be worth keeping and investing in.
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <ye...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Yes I agree with Jacques that fixed binary is what it is in the
>>>>>>>> end. I think It is more about user experience, whether the conversion is
>>>>>>>> done at the user side or Iceberg and engine side. Many people just store
>>>>>>>> UUID as a 36 byte string instead of a 16 byte binary, so with an explicit
>>>>>>>> UUID type, Iceberg can optimize this common use case internally for users.
>>>>>>>> There might be some other benefits I overlooked, but maybe the complication
>>>>>>>> introduced by this type does not really justify the slightly better user
>>>>>>>> experience. I am also on the fence about it.
>>>>>>>>
>>>>>>>> -Jack Ye
>>>>>>>>
>>>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <
>>>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> What specific arguments are there for it being a first class type
>>>>>>>>> besides it is elsewhere? Is there some kind of optimization iceberg or an
>>>>>>>>> engine could do if it was typed versus just a bucket of bits? Fixed width
>>>>>>>>> binary seems to cover the cases I see in terms of actual functionality in
>>>>>>>>> the iceberg libraries or engines…
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yy...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> One conversation I used to come across regarding UUID deprecation
>>>>>>>>>> was from https://github.com/apache/iceberg/pull/1611
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Yan
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary
>>>>>>>>>> <pv...@cloudera.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Joshua,
>>>>>>>>>>>
>>>>>>>>>>> I do not have a strong preference about the UUID type, but I
>>>>>>>>>>> would like the highlight, that the type is handled inconsistently in
>>>>>>>>>>> Iceberg with different file formats. (See:
>>>>>>>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>>>>>>>
>>>>>>>>>>> If we keep the type, it would be good to standardize the
>>>>>>>>>>> handling in every file format.
>>>>>>>>>>>
>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>
>>>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <jo...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi.
>>>>>>>>>>>>
>>>>>>>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but there
>>>>>>>>>>>> seems to have been some discussion about removing it? I could not find the
>>>>>>>>>>>> original discussion, but a reference to the discussion can be found here (
>>>>>>>>>>>> https://github.com/trinodb/trino/issues/6663).
>>>>>>>>>>>>
>>>>>>>>>>>> I generally agree with the consensus in the Trino issue to keep
>>>>>>>>>>>> UUID in Iceberg. To summarize…
>>>>>>>>>>>>
>>>>>>>>>>>> - It makes sense to keep the type now that row identifiers are
>>>>>>>>>>>> supported
>>>>>>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>>>>>>
>>>>>>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>>

-- 
Josh Howard

Re: [DISCUSS] UUID type

Posted by Piotr Findeisen <pi...@starburstdata.com>.

Hi,

It seems we converged here that UUID should remain included.
I read this as a consensus reached, but it may be subjective. Did we
objectively reached consensus on this?

From Iceberg project perspective there isn't anything to do, as UUID
already *is* part of the spec (
https://iceberg.apache.org/spec/#schemas-and-data-types).
Trino Iceberg PR adding support for UUID
https://github.com/trinodb/trino/pull/8747 was pending merge while this
conversation has been ongoing.

Best,
PF



On Mon, Aug 2, 2021 at 6:22 AM Kyle B <kj...@gmail.com> wrote:

> Hi Ryan and all,
>
> That sounds like a reasonable reason to leave IP address types out. In my
> experience, dedicated IP address types are mostly found in logging tools
> and other things for sysadmins / DevOps etc.
>
> When querying data with IP addresses, I’ve seen it done quite a lot (eg
> security reasons) but usually stored as string or manipulated in a UDF.
> They’re not commonly supported types.
>
> I would also draw the line at UUID types.
>
> - Kyle Bendickson
>
> On Jul 30, 2021, at 3:15 PM, Ryan Blue <bl...@tabular.io> wrote:
>
> 
> Jacques, you make some good points here. I think my argument about
> usability leading to performance issues is a stronger argument for engines
> than for Iceberg. Still, there are inefficiencies in Iceberg if someone
> chooses to use a string in an engine that doesn't have a UUID type.
>
> Another thing to consider is cross-engine support. If Iceberg removes
> UUID, then Trino would probably translate to fixed[16]. That results in a
> table that's difficult to query in other engines, where people would
> probably choose to store the data as a string. On the other hand, if
> Iceberg keeps the UUID type then integrations would simply translate to the
> UUID string representation before passing data to the other engines.
> While the engines would be using 36-byte values in join keys, the user
> experience issue is fixed and the data is more compact on disk and in
> Iceberg's bounds metadata.
>
> While having a UUID type in Iceberg can't really help engines that don't
> support UUID take advantage of the type at runtime, it does seem slightly
> better to have the UUID type in general since at least one engine supports
> it and it provides the expected user experience with a compact
> representation.
>
> IPv4 addresses are a good thing to think about as well, since most of the
> same arguments apply. If we keep the UUID type, should we also add IPv4 or
> IPv6 types? I would probably draw the line at UUID because it helps in
> joins, which are an important operation. IPv4 representations aren't that
> big of an inconvenience unless you need to do IP manipulation, which is
> typically in a UDF and not the query engine. And you can always keep both
> representations in a table fairly inexpensively. Does this sound like a
> valid rationale for having UUID but not IP types?
>
> Ryan
>
> On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau <ja...@gmail.com>
> wrote:
>
>> It seems like Spark, Hive, Dremio and Impala all lack UUID as a native
>> type. Which engines are you thinking of that have a native UUID type
>> besides the Presto derivatives and support Iceberg?
>>
>> I agree that Trino should expose a UUID type on top of Iceberg tables.
>> All the user experience things that you are describing as important
>> (compact storage, friendly display, ddl, clean literals) are possible
>> without it being a first class type in Iceberg using a trino specific
>> property.
>>
>> I don't really have a strong opinion about UUID. In general, type bloat
>> is probably just a part of this kind of project. Generally, CHAR(X) and
>> VARCHAR(X) feel like much bigger concerns given that they exist in all of
>> the engines but not Iceberg--especially when we start talking about views.
>>
>> Some of this argues for physical vs logical type abstraction. (Something
>> that was always challenging in Parquet but also helped to resolve how these
>> types are managed in engines that don't support them.)
>>
>> thanks,
>> Jacques
>>
>> PS: Funny aside, the bloat on an ip address is actually worse than a
>> UUID, right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% bloat.
>> UUID 36/16 => 125% bloat.
>>
>> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <bl...@tabular.io> wrote:
>>
>>> I don't think this is just a problem in Trino.
>>>
>>> If there is no UUID type, then a user must choose between a 36-byte
>>> string and a 16-byte binary. That's not a good choice to force people into.
>>> If someone chooses binary, then it's harder to work with rows and construct
>>> queries even though there is a standard representation for UUIDs. To avoid
>>> the user headache, people will probably choose to store values as strings.
>>> Using a string would mean that more than half the value is needlessly
>>> discarded by default in Iceberg lower/upper bounds instead of keeping the
>>> entire value. And since engines don't know what's in the string, the full
>>> value must be used in comparison, which is extra work and extra space.
>>>
>>> Inflated values may not be a problem in some cases. IPv4 addresses are
>>> one case where you could argue that it doesn't matter very much that they
>>> are typically stored as strings. But I expect the use of UUIDs to be common
>>> for ID columns because you can generate them without coordination (unlike
>>> an incrementing ID) and that's a concern because the use as an ID makes
>>> them likely to be join keys.
>>>
>>> If we want the values to be stored as 16-byte fixed, then we need to
>>> make it easy to get the expected string representation in and out, just
>>> like we do with date/time types. I don't think that's specific to any
>>> engine.
>>>
>>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <ja...@gmail.com>
>>> wrote:
>>>
>>>> I think points 1&2 don't really apply since a fixed width binary
>>>> already covers those properties.
>>>>
>>>> It seems like this isn't really a concern of iceberg but rather a
>>>> cosmetic layer that exists primarily (only?) in trino. In that case I would
>>>> be inclined to say that trino should just use custom metadata and a fixed
>>>> binary type. That way you still have the desired ux without exposing those
>>>> extra concepts to the  iceberg. It actually feels like better encapsulation
>>>> imo.
>>>>
>>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <pi...@starburstdata.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I agree with Ryan, that it takes some precautions before one can
>>>>> assume uniqueness of UUID values, and that this shouldn't be any special
>>>>> for UUIDs at all.
>>>>> After all, this is just a primitive type, which is commonly used for
>>>>> certain things, but "commonly" doesn't mean "always".
>>>>>
>>>>> The advantages of having a dedicated type are on 3 layers.
>>>>> The compact representation in the file, and compact representation in
>>>>> memory in the query engine are the ones mentioned above.
>>>>>
>>>>> The third layer is the usability. Seeing a UUID column i know what
>>>>> values i can expect, so it's more descriptive than `id char(36)`.
>>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), .... without
>>>>> need for casting to varchar.
>>>>> It also removes temptation of casting uuid to varbinary to achieve
>>>>> compact representation.
>>>>>
>>>>> Thus i think it would be good to have them.
>>>>>
>>>>> Best
>>>>> PF
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>
>>>>>> The original reason why I added UUID to the spec was that I thought
>>>>>> there would be opportunities to take advantage of UUIDs as unique values
>>>>>> and to optimize the use of UUIDs. I was thinking about auto-increment ID
>>>>>> fields and how we might do something similar in Iceberg.
>>>>>>
>>>>>> The reason we have thought about removing UUID is that there aren't
>>>>>> as many opportunities to take advantage of UUIDs as I thought. My original
>>>>>> assumption was that we could do things like bucket on UUID fields or assume
>>>>>> that a UUID field has a high NDV. But that's not necessarily the case with
>>>>>> when a UUID field is a foreign key, only when it is used as an identifier
>>>>>> or primary key. Before Jack added tracking for row identifier fields, we
>>>>>> couldn't know that a UUID was unique in a table. As a result, we didn't
>>>>>> invest in support for UUID.
>>>>>>
>>>>>> Quick aside: Now that row identifier fields are tracked, we can do
>>>>>> some of these things with the row identifier fields. Engines can assume
>>>>>> that the tuple of row identifier fields is unique in a table for join
>>>>>> estimation. And engines can use row identifier fields in sort keys to
>>>>>> ensure lots of partition split locations (this is really important for
>>>>>> Spark).
>>>>>>
>>>>>> Coming back to UUIDs, the second reason to have a UUID type is still
>>>>>> valid: it is better to represent UUIDs as fixed[16] than as 36 byte UTF-8
>>>>>> strings that are more than twice as large, or even worse UCS-16 Strings
>>>>>> that are 4x as large. Since UUIDs are likely to be used in joins, this
>>>>>> could really help engines as long as they can keep the values as
>>>>>> fixed-width binary.
>>>>>>
>>>>>> I could go either way on this. I think it is valuable to have a
>>>>>> compact representation for UUIDs rather than using the string
>>>>>> representation. But that will require investing in the type and building
>>>>>> support in engines that won't take advantage of it. If Trino can use this,
>>>>>> I think it may be worth keeping and investing in.
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <ye...@gmail.com> wrote:
>>>>>>
>>>>>>> Yes I agree with Jacques that fixed binary is what it is in the end.
>>>>>>> I think It is more about user experience, whether the conversion is done at
>>>>>>> the user side or Iceberg and engine side. Many people just store UUID as a
>>>>>>> 36 byte string instead of a 16 byte binary, so with an explicit UUID type,
>>>>>>> Iceberg can optimize this common use case internally for users. There might
>>>>>>> be some other benefits I overlooked, but maybe the complication introduced
>>>>>>> by this type does not really justify the slightly better user experience. I
>>>>>>> am also on the fence about it.
>>>>>>>
>>>>>>> -Jack Ye
>>>>>>>
>>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <
>>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>>
>>>>>>>> What specific arguments are there for it being a first class type
>>>>>>>> besides it is elsewhere? Is there some kind of optimization iceberg or an
>>>>>>>> engine could do if it was typed versus just a bucket of bits? Fixed width
>>>>>>>> binary seems to cover the cases I see in terms of actual functionality in
>>>>>>>> the iceberg libraries or engines…
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yy...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> One conversation I used to come across regarding UUID deprecation
>>>>>>>>> was from https://github.com/apache/iceberg/pull/1611
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Yan
>>>>>>>>>
>>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary
>>>>>>>>> <pv...@cloudera.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Joshua,
>>>>>>>>>>
>>>>>>>>>> I do not have a strong preference about the UUID type, but I
>>>>>>>>>> would like the highlight, that the type is handled inconsistently in
>>>>>>>>>> Iceberg with different file formats. (See:
>>>>>>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>>>>>>
>>>>>>>>>> If we keep the type, it would be good to standardize the handling
>>>>>>>>>> in every file format.
>>>>>>>>>>
>>>>>>>>>> Thanks, Peter
>>>>>>>>>>
>>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <jo...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi.
>>>>>>>>>>>
>>>>>>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but there
>>>>>>>>>>> seems to have been some discussion about removing it? I could not find the
>>>>>>>>>>> original discussion, but a reference to the discussion can be found here (
>>>>>>>>>>> https://github.com/trinodb/trino/issues/6663).
>>>>>>>>>>>
>>>>>>>>>>> I generally agree with the consensus in the Trino issue to keep
>>>>>>>>>>> UUID in Iceberg. To summarize…
>>>>>>>>>>>
>>>>>>>>>>> - It makes sense to keep the type now that row identifiers are
>>>>>>>>>>> supported
>>>>>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>>>>>
>>>>>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>
>

Re: [DISCUSS] UUID type

Posted by Kyle B <kj...@gmail.com>.

Hi Ryan and all,

That sounds like a reasonable reason to leave IP address types out. In my experience, dedicated IP address types are mostly found in logging tools and other things for sysadmins / DevOps etc.

When querying data with IP addresses, I’ve seen it done quite a lot (eg security reasons) but usually stored as string or manipulated in a UDF. They’re not commonly supported types.

I would also draw the line at UUID types.

- Kyle Bendickson

> On Jul 30, 2021, at 3:15 PM, Ryan Blue <bl...@tabular.io> wrote:
> 
> 
> Jacques, you make some good points here. I think my argument about usability leading to performance issues is a stronger argument for engines than for Iceberg. Still, there are inefficiencies in Iceberg if someone chooses to use a string in an engine that doesn't have a UUID type.
> 
> Another thing to consider is cross-engine support. If Iceberg removes UUID, then Trino would probably translate to fixed[16]. That results in a table that's difficult to query in other engines, where people would probably choose to store the data as a string. On the other hand, if Iceberg keeps the UUID type then integrations would simply translate to the UUID string representation before passing data to the other engines. While the engines would be using 36-byte values in join keys, the user experience issue is fixed and the data is more compact on disk and in Iceberg's bounds metadata.
> 
> While having a UUID type in Iceberg can't really help engines that don't support UUID take advantage of the type at runtime, it does seem slightly better to have the UUID type in general since at least one engine supports it and it provides the expected user experience with a compact representation.
> 
> IPv4 addresses are a good thing to think about as well, since most of the same arguments apply. If we keep the UUID type, should we also add IPv4 or IPv6 types? I would probably draw the line at UUID because it helps in joins, which are an important operation. IPv4 representations aren't that big of an inconvenience unless you need to do IP manipulation, which is typically in a UDF and not the query engine. And you can always keep both representations in a table fairly inexpensively. Does this sound like a valid rationale for having UUID but not IP types?
> 
> Ryan
> 
>> On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau <ja...@gmail.com> wrote:
>> It seems like Spark, Hive, Dremio and Impala all lack UUID as a native type. Which engines are you thinking of that have a native UUID type besides the Presto derivatives and support Iceberg?
>> 
>> I agree that Trino should expose a UUID type on top of Iceberg tables. All the user experience things that you are describing as important (compact storage, friendly display, ddl, clean literals) are possible without it being a first class type in Iceberg using a trino specific property.
>> 
>> I don't really have a strong opinion about UUID. In general, type bloat is probably just a part of this kind of project. Generally, CHAR(X) and VARCHAR(X) feel like much bigger concerns given that they exist in all of the engines but not Iceberg--especially when we start talking about views.
>> 
>> Some of this argues for physical vs logical type abstraction. (Something that was always challenging in Parquet but also helped to resolve how these types are managed in engines that don't support them.)
>> 
>> thanks,
>> Jacques
>> 
>> PS: Funny aside, the bloat on an ip address is actually worse than a UUID, right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% bloat. UUID 36/16 => 125% bloat.
>> 
>>> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <bl...@tabular.io> wrote:
>>> I don't think this is just a problem in Trino.
>>> 
>>> If there is no UUID type, then a user must choose between a 36-byte string and a 16-byte binary. That's not a good choice to force people into. If someone chooses binary, then it's harder to work with rows and construct queries even though there is a standard representation for UUIDs. To avoid the user headache, people will probably choose to store values as strings. Using a string would mean that more than half the value is needlessly discarded by default in Iceberg lower/upper bounds instead of keeping the entire value. And since engines don't know what's in the string, the full value must be used in comparison, which is extra work and extra space.
>>> 
>>> Inflated values may not be a problem in some cases. IPv4 addresses are one case where you could argue that it doesn't matter very much that they are typically stored as strings. But I expect the use of UUIDs to be common for ID columns because you can generate them without coordination (unlike an incrementing ID) and that's a concern because the use as an ID makes them likely to be join keys.
>>> 
>>> If we want the values to be stored as 16-byte fixed, then we need to make it easy to get the expected string representation in and out, just like we do with date/time types. I don't think that's specific to any engine.
>>> 
>>>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <ja...@gmail.com> wrote:
>>>> I think points 1&2 don't really apply since a fixed width binary already covers those properties. 
>>>> 
>>>> It seems like this isn't really a concern of iceberg but rather a cosmetic layer that exists primarily (only?) in trino. In that case I would be inclined to say that trino should just use custom metadata and a fixed binary type. That way you still have the desired ux without exposing those extra concepts to the  iceberg. It actually feels like better encapsulation imo. 
>>>> 
>>>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <pi...@starburstdata.com> wrote:
>>>>> Hi,
>>>>> 
>>>>> I agree with Ryan, that it takes some precautions before one can assume uniqueness of UUID values, and that this shouldn't be any special for UUIDs at all.
>>>>> After all, this is just a primitive type, which is commonly used for certain things, but "commonly" doesn't mean "always".
>>>>> 
>>>>> The advantages of having a dedicated type are on 3 layers.
>>>>> The compact representation in the file, and compact representation in memory in the query engine are the ones mentioned above.
>>>>> 
>>>>> The third layer is the usability. Seeing a UUID column i know what values i can expect, so it's more descriptive than `id char(36)`.
>>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), .... without need for casting to varchar.
>>>>> It also removes temptation of casting uuid to varbinary to achieve compact representation.
>>>>> 
>>>>> Thus i think it would be good to have them.
>>>>> 
>>>>> Best
>>>>> PF
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>>> The original reason why I added UUID to the spec was that I thought there would be opportunities to take advantage of UUIDs as unique values and to optimize the use of UUIDs. I was thinking about auto-increment ID fields and how we might do something similar in Iceberg.
>>>>>> 
>>>>>> The reason we have thought about removing UUID is that there aren't as many opportunities to take advantage of UUIDs as I thought. My original assumption was that we could do things like bucket on UUID fields or assume that a UUID field has a high NDV. But that's not necessarily the case with when a UUID field is a foreign key, only when it is used as an identifier or primary key. Before Jack added tracking for row identifier fields, we couldn't know that a UUID was unique in a table. As a result, we didn't invest in support for UUID.
>>>>>> 
>>>>>> Quick aside: Now that row identifier fields are tracked, we can do some of these things with the row identifier fields. Engines can assume that the tuple of row identifier fields is unique in a table for join estimation. And engines can use row identifier fields in sort keys to ensure lots of partition split locations (this is really important for Spark).
>>>>>> 
>>>>>> Coming back to UUIDs, the second reason to have a UUID type is still valid: it is better to represent UUIDs as fixed[16] than as 36 byte UTF-8 strings that are more than twice as large, or even worse UCS-16 Strings that are 4x as large. Since UUIDs are likely to be used in joins, this could really help engines as long as they can keep the values as fixed-width binary.
>>>>>> 
>>>>>> I could go either way on this. I think it is valuable to have a compact representation for UUIDs rather than using the string representation. But that will require investing in the type and building support in engines that won't take advantage of it. If Trino can use this, I think it may be worth keeping and investing in.
>>>>>> 
>>>>>> Ryan
>>>>>> 
>>>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <ye...@gmail.com> wrote:
>>>>>>> Yes I agree with Jacques that fixed binary is what it is in the end. I think It is more about user experience, whether the conversion is done at the user side or Iceberg and engine side. Many people just store UUID as a 36 byte string instead of a 16 byte binary, so with an explicit UUID type, Iceberg can optimize this common use case internally for users. There might be some other benefits I overlooked, but maybe the complication introduced by this type does not really justify the slightly better user experience. I am also on the fence about it.
>>>>>>> 
>>>>>>> -Jack Ye
>>>>>>> 
>>>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <ja...@gmail.com> wrote:
>>>>>>>> What specific arguments are there for it being a first class type besides it is elsewhere? Is there some kind of optimization iceberg or an engine could do if it was typed versus just a bucket of bits? Fixed width binary seems to cover the cases I see in terms of actual functionality in the iceberg libraries or engines…
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yy...@gmail.com> wrote:
>>>>>>>>> One conversation I used to come across regarding UUID deprecation was from https://github.com/apache/iceberg/pull/1611 
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Yan
>>>>>>>>> 
>>>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary <pv...@cloudera.com.invalid> wrote:
>>>>>>>>>> Hi Joshua, 
>>>>>>>>>> 
>>>>>>>>>> I do not have a strong preference about the UUID type, but I would like the highlight, that the type is handled inconsistently in Iceberg with different file formats. (See: https://github.com/apache/iceberg/issues/1881) 
>>>>>>>>>> 
>>>>>>>>>> If we keep the type, it would be good to standardize the handling in every file format. 
>>>>>>>>>> 
>>>>>>>>>> Thanks, Peter 
>>>>>>>>>> 
>>>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <jo...@gmail.com> wrote:
>>>>>>>>>>> Hi. 
>>>>>>>>>>> 
>>>>>>>>>>> UUID is a current data type according to the Iceberg spec (https://iceberg.apache.org/spec/#primitive-types), but there seems to have been some discussion about removing it? I could not find the original discussion, but a reference to the discussion can be found here (https://github.com/trinodb/trino/issues/6663). 
>>>>>>>>>>> 
>>>>>>>>>>> I generally agree with the consensus in the Trino issue to keep UUID in Iceberg. To summarize… 
>>>>>>>>>>> 
>>>>>>>>>>> - It makes sense to keep the type now that row identifiers are supported
>>>>>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>>>>> 
>>>>>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> Ryan Blue
>>>>>> Tabular
>>> 
>>> 
>>> -- 
>>> Ryan Blue
>>> Tabular
> 
> 
> -- 
> Ryan Blue
> Tabular

Re: [DISCUSS] UUID type

Posted by Ryan Blue <bl...@tabular.io>.

Jacques, you make some good points here. I think my argument about
usability leading to performance issues is a stronger argument for engines
than for Iceberg. Still, there are inefficiencies in Iceberg if someone
chooses to use a string in an engine that doesn't have a UUID type.

Another thing to consider is cross-engine support. If Iceberg removes UUID,
then Trino would probably translate to fixed[16]. That results in a table
that's difficult to query in other engines, where people would probably
choose to store the data as a string. On the other hand, if Iceberg keeps
the UUID type then integrations would simply translate to the UUID string
representation before passing data to the other engines. While the engines
would be using 36-byte values in join keys, the user experience issue is
fixed and the data is more compact on disk and in Iceberg's bounds metadata.

While having a UUID type in Iceberg can't really help engines that don't
support UUID take advantage of the type at runtime, it does seem slightly
better to have the UUID type in general since at least one engine supports
it and it provides the expected user experience with a compact
representation.

IPv4 addresses are a good thing to think about as well, since most of the
same arguments apply. If we keep the UUID type, should we also add IPv4 or
IPv6 types? I would probably draw the line at UUID because it helps in
joins, which are an important operation. IPv4 representations aren't that
big of an inconvenience unless you need to do IP manipulation, which is
typically in a UDF and not the query engine. And you can always keep both
representations in a table fairly inexpensively. Does this sound like a
valid rationale for having UUID but not IP types?

Ryan

On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau <ja...@gmail.com>
wrote:

> It seems like Spark, Hive, Dremio and Impala all lack UUID as a native
> type. Which engines are you thinking of that have a native UUID type
> besides the Presto derivatives and support Iceberg?
>
> I agree that Trino should expose a UUID type on top of Iceberg tables. All
> the user experience things that you are describing as important (compact
> storage, friendly display, ddl, clean literals) are possible without it
> being a first class type in Iceberg using a trino specific property.
>
> I don't really have a strong opinion about UUID. In general, type bloat is
> probably just a part of this kind of project. Generally, CHAR(X) and
> VARCHAR(X) feel like much bigger concerns given that they exist in all of
> the engines but not Iceberg--especially when we start talking about views.
>
> Some of this argues for physical vs logical type abstraction. (Something
> that was always challenging in Parquet but also helped to resolve how these
> types are managed in engines that don't support them.)
>
> thanks,
> Jacques
>
> PS: Funny aside, the bloat on an ip address is actually worse than a UUID,
> right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% bloat. UUID
> 36/16 => 125% bloat.
>
> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <bl...@tabular.io> wrote:
>
>> I don't think this is just a problem in Trino.
>>
>> If there is no UUID type, then a user must choose between a 36-byte
>> string and a 16-byte binary. That's not a good choice to force people into.
>> If someone chooses binary, then it's harder to work with rows and construct
>> queries even though there is a standard representation for UUIDs. To avoid
>> the user headache, people will probably choose to store values as strings.
>> Using a string would mean that more than half the value is needlessly
>> discarded by default in Iceberg lower/upper bounds instead of keeping the
>> entire value. And since engines don't know what's in the string, the full
>> value must be used in comparison, which is extra work and extra space.
>>
>> Inflated values may not be a problem in some cases. IPv4 addresses are
>> one case where you could argue that it doesn't matter very much that they
>> are typically stored as strings. But I expect the use of UUIDs to be common
>> for ID columns because you can generate them without coordination (unlike
>> an incrementing ID) and that's a concern because the use as an ID makes
>> them likely to be join keys.
>>
>> If we want the values to be stored as 16-byte fixed, then we need to make
>> it easy to get the expected string representation in and out, just like we
>> do with date/time types. I don't think that's specific to any engine.
>>
>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <ja...@gmail.com>
>> wrote:
>>
>>> I think points 1&2 don't really apply since a fixed width binary already
>>> covers those properties.
>>>
>>> It seems like this isn't really a concern of iceberg but rather a
>>> cosmetic layer that exists primarily (only?) in trino. In that case I would
>>> be inclined to say that trino should just use custom metadata and a fixed
>>> binary type. That way you still have the desired ux without exposing those
>>> extra concepts to the  iceberg. It actually feels like better encapsulation
>>> imo.
>>>
>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <pi...@starburstdata.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I agree with Ryan, that it takes some precautions before one can assume
>>>> uniqueness of UUID values, and that this shouldn't be any special for UUIDs
>>>> at all.
>>>> After all, this is just a primitive type, which is commonly used for
>>>> certain things, but "commonly" doesn't mean "always".
>>>>
>>>> The advantages of having a dedicated type are on 3 layers.
>>>> The compact representation in the file, and compact representation in
>>>> memory in the query engine are the ones mentioned above.
>>>>
>>>> The third layer is the usability. Seeing a UUID column i know what
>>>> values i can expect, so it's more descriptive than `id char(36)`.
>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), .... without
>>>> need for casting to varchar.
>>>> It also removes temptation of casting uuid to varbinary to achieve
>>>> compact representation.
>>>>
>>>> Thus i think it would be good to have them.
>>>>
>>>> Best
>>>> PF
>>>>
>>>>
>>>>
>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <bl...@tabular.io> wrote:
>>>>
>>>>> The original reason why I added UUID to the spec was that I thought
>>>>> there would be opportunities to take advantage of UUIDs as unique values
>>>>> and to optimize the use of UUIDs. I was thinking about auto-increment ID
>>>>> fields and how we might do something similar in Iceberg.
>>>>>
>>>>> The reason we have thought about removing UUID is that there aren't as
>>>>> many opportunities to take advantage of UUIDs as I thought. My original
>>>>> assumption was that we could do things like bucket on UUID fields or assume
>>>>> that a UUID field has a high NDV. But that's not necessarily the case with
>>>>> when a UUID field is a foreign key, only when it is used as an identifier
>>>>> or primary key. Before Jack added tracking for row identifier fields, we
>>>>> couldn't know that a UUID was unique in a table. As a result, we didn't
>>>>> invest in support for UUID.
>>>>>
>>>>> Quick aside: Now that row identifier fields are tracked, we can do
>>>>> some of these things with the row identifier fields. Engines can assume
>>>>> that the tuple of row identifier fields is unique in a table for join
>>>>> estimation. And engines can use row identifier fields in sort keys to
>>>>> ensure lots of partition split locations (this is really important for
>>>>> Spark).
>>>>>
>>>>> Coming back to UUIDs, the second reason to have a UUID type is still
>>>>> valid: it is better to represent UUIDs as fixed[16] than as 36 byte UTF-8
>>>>> strings that are more than twice as large, or even worse UCS-16 Strings
>>>>> that are 4x as large. Since UUIDs are likely to be used in joins, this
>>>>> could really help engines as long as they can keep the values as
>>>>> fixed-width binary.
>>>>>
>>>>> I could go either way on this. I think it is valuable to have a
>>>>> compact representation for UUIDs rather than using the string
>>>>> representation. But that will require investing in the type and building
>>>>> support in engines that won't take advantage of it. If Trino can use this,
>>>>> I think it may be worth keeping and investing in.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <ye...@gmail.com> wrote:
>>>>>
>>>>>> Yes I agree with Jacques that fixed binary is what it is in the end.
>>>>>> I think It is more about user experience, whether the conversion is done at
>>>>>> the user side or Iceberg and engine side. Many people just store UUID as a
>>>>>> 36 byte string instead of a 16 byte binary, so with an explicit UUID type,
>>>>>> Iceberg can optimize this common use case internally for users. There might
>>>>>> be some other benefits I overlooked, but maybe the complication introduced
>>>>>> by this type does not really justify the slightly better user experience. I
>>>>>> am also on the fence about it.
>>>>>>
>>>>>> -Jack Ye
>>>>>>
>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <
>>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>>
>>>>>>> What specific arguments are there for it being a first class type
>>>>>>> besides it is elsewhere? Is there some kind of optimization iceberg or an
>>>>>>> engine could do if it was typed versus just a bucket of bits? Fixed width
>>>>>>> binary seems to cover the cases I see in terms of actual functionality in
>>>>>>> the iceberg libraries or engines…
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yy...@gmail.com> wrote:
>>>>>>>
>>>>>>>> One conversation I used to come across regarding UUID deprecation
>>>>>>>> was from https://github.com/apache/iceberg/pull/1611
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Yan
>>>>>>>>
>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary
>>>>>>>> <pv...@cloudera.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Hi Joshua,
>>>>>>>>>
>>>>>>>>> I do not have a strong preference about the UUID type, but I would
>>>>>>>>> like the highlight, that the type is handled inconsistently in Iceberg with
>>>>>>>>> different file formats. (See:
>>>>>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>>>>>
>>>>>>>>> If we keep the type, it would be good to standardize the handling
>>>>>>>>> in every file format.
>>>>>>>>>
>>>>>>>>> Thanks, Peter
>>>>>>>>>
>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <jo...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi.
>>>>>>>>>>
>>>>>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but there
>>>>>>>>>> seems to have been some discussion about removing it? I could not find the
>>>>>>>>>> original discussion, but a reference to the discussion can be found here (
>>>>>>>>>> https://github.com/trinodb/trino/issues/6663).
>>>>>>>>>>
>>>>>>>>>> I generally agree with the consensus in the Trino issue to keep
>>>>>>>>>> UUID in Iceberg. To summarize…
>>>>>>>>>>
>>>>>>>>>> - It makes sense to keep the type now that row identifiers are
>>>>>>>>>> supported
>>>>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>>>>
>>>>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Ryan Blue
Tabular

Re: [DISCUSS] UUID type

Posted by Jacques Nadeau <ja...@gmail.com>.

It seems like Spark, Hive, Dremio and Impala all lack UUID as a native
type. Which engines are you thinking of that have a native UUID type
besides the Presto derivatives and support Iceberg?

I agree that Trino should expose a UUID type on top of Iceberg tables. All
the user experience things that you are describing as important (compact
storage, friendly display, ddl, clean literals) are possible without it
being a first class type in Iceberg using a trino specific property.

I don't really have a strong opinion about UUID. In general, type bloat is
probably just a part of this kind of project. Generally, CHAR(X) and
VARCHAR(X) feel like much bigger concerns given that they exist in all of
the engines but not Iceberg--especially when we start talking about views.

Some of this argues for physical vs logical type abstraction. (Something
that was always challenging in Parquet but also helped to resolve how these
types are managed in engines that don't support them.)

thanks,
Jacques

PS: Funny aside, the bloat on an ip address is actually worse than a UUID,
right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% bloat. UUID
36/16 => 125% bloat.

On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <bl...@tabular.io> wrote:

> I don't think this is just a problem in Trino.
>
> If there is no UUID type, then a user must choose between a 36-byte string
> and a 16-byte binary. That's not a good choice to force people into. If
> someone chooses binary, then it's harder to work with rows and construct
> queries even though there is a standard representation for UUIDs. To avoid
> the user headache, people will probably choose to store values as strings.
> Using a string would mean that more than half the value is needlessly
> discarded by default in Iceberg lower/upper bounds instead of keeping the
> entire value. And since engines don't know what's in the string, the full
> value must be used in comparison, which is extra work and extra space.
>
> Inflated values may not be a problem in some cases. IPv4 addresses are one
> case where you could argue that it doesn't matter very much that they are
> typically stored as strings. But I expect the use of UUIDs to be common for
> ID columns because you can generate them without coordination (unlike an
> incrementing ID) and that's a concern because the use as an ID makes them
> likely to be join keys.
>
> If we want the values to be stored as 16-byte fixed, then we need to make
> it easy to get the expected string representation in and out, just like we
> do with date/time types. I don't think that's specific to any engine.
>
> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <ja...@gmail.com>
> wrote:
>
>> I think points 1&2 don't really apply since a fixed width binary already
>> covers those properties.
>>
>> It seems like this isn't really a concern of iceberg but rather a
>> cosmetic layer that exists primarily (only?) in trino. In that case I would
>> be inclined to say that trino should just use custom metadata and a fixed
>> binary type. That way you still have the desired ux without exposing those
>> extra concepts to the  iceberg. It actually feels like better encapsulation
>> imo.
>>
>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <pi...@starburstdata.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I agree with Ryan, that it takes some precautions before one can assume
>>> uniqueness of UUID values, and that this shouldn't be any special for UUIDs
>>> at all.
>>> After all, this is just a primitive type, which is commonly used for
>>> certain things, but "commonly" doesn't mean "always".
>>>
>>> The advantages of having a dedicated type are on 3 layers.
>>> The compact representation in the file, and compact representation in
>>> memory in the query engine are the ones mentioned above.
>>>
>>> The third layer is the usability. Seeing a UUID column i know what
>>> values i can expect, so it's more descriptive than `id char(36)`.
>>> This also means i can CREATE TABLE ... AS SELECT uuid(), .... without
>>> need for casting to varchar.
>>> It also removes temptation of casting uuid to varbinary to achieve
>>> compact representation.
>>>
>>> Thus i think it would be good to have them.
>>>
>>> Best
>>> PF
>>>
>>>
>>>
>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <bl...@tabular.io> wrote:
>>>
>>>> The original reason why I added UUID to the spec was that I thought
>>>> there would be opportunities to take advantage of UUIDs as unique values
>>>> and to optimize the use of UUIDs. I was thinking about auto-increment ID
>>>> fields and how we might do something similar in Iceberg.
>>>>
>>>> The reason we have thought about removing UUID is that there aren't as
>>>> many opportunities to take advantage of UUIDs as I thought. My original
>>>> assumption was that we could do things like bucket on UUID fields or assume
>>>> that a UUID field has a high NDV. But that's not necessarily the case with
>>>> when a UUID field is a foreign key, only when it is used as an identifier
>>>> or primary key. Before Jack added tracking for row identifier fields, we
>>>> couldn't know that a UUID was unique in a table. As a result, we didn't
>>>> invest in support for UUID.
>>>>
>>>> Quick aside: Now that row identifier fields are tracked, we can do some
>>>> of these things with the row identifier fields. Engines can assume that the
>>>> tuple of row identifier fields is unique in a table for join estimation.
>>>> And engines can use row identifier fields in sort keys to ensure lots of
>>>> partition split locations (this is really important for Spark).
>>>>
>>>> Coming back to UUIDs, the second reason to have a UUID type is still
>>>> valid: it is better to represent UUIDs as fixed[16] than as 36 byte UTF-8
>>>> strings that are more than twice as large, or even worse UCS-16 Strings
>>>> that are 4x as large. Since UUIDs are likely to be used in joins, this
>>>> could really help engines as long as they can keep the values as
>>>> fixed-width binary.
>>>>
>>>> I could go either way on this. I think it is valuable to have a compact
>>>> representation for UUIDs rather than using the string representation. But
>>>> that will require investing in the type and building support in engines
>>>> that won't take advantage of it. If Trino can use this, I think it may be
>>>> worth keeping and investing in.
>>>>
>>>> Ryan
>>>>
>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <ye...@gmail.com> wrote:
>>>>
>>>>> Yes I agree with Jacques that fixed binary is what it is in the end. I
>>>>> think It is more about user experience, whether the conversion is done at
>>>>> the user side or Iceberg and engine side. Many people just store UUID as a
>>>>> 36 byte string instead of a 16 byte binary, so with an explicit UUID type,
>>>>> Iceberg can optimize this common use case internally for users. There might
>>>>> be some other benefits I overlooked, but maybe the complication introduced
>>>>> by this type does not really justify the slightly better user experience. I
>>>>> am also on the fence about it.
>>>>>
>>>>> -Jack Ye
>>>>>
>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <
>>>>> jacquesnadeau@gmail.com> wrote:
>>>>>
>>>>>> What specific arguments are there for it being a first class type
>>>>>> besides it is elsewhere? Is there some kind of optimization iceberg or an
>>>>>> engine could do if it was typed versus just a bucket of bits? Fixed width
>>>>>> binary seems to cover the cases I see in terms of actual functionality in
>>>>>> the iceberg libraries or engines…
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yy...@gmail.com> wrote:
>>>>>>
>>>>>>> One conversation I used to come across regarding UUID deprecation
>>>>>>> was from https://github.com/apache/iceberg/pull/1611
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Yan
>>>>>>>
>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary
>>>>>>> <pv...@cloudera.com.invalid> wrote:
>>>>>>>
>>>>>>>> Hi Joshua,
>>>>>>>>
>>>>>>>> I do not have a strong preference about the UUID type, but I would
>>>>>>>> like the highlight, that the type is handled inconsistently in Iceberg with
>>>>>>>> different file formats. (See:
>>>>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>>>>
>>>>>>>> If we keep the type, it would be good to standardize the handling
>>>>>>>> in every file format.
>>>>>>>>
>>>>>>>> Thanks, Peter
>>>>>>>>
>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <jo...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi.
>>>>>>>>>
>>>>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but there
>>>>>>>>> seems to have been some discussion about removing it? I could not find the
>>>>>>>>> original discussion, but a reference to the discussion can be found here (
>>>>>>>>> https://github.com/trinodb/trino/issues/6663).
>>>>>>>>>
>>>>>>>>> I generally agree with the consensus in the Trino issue to keep
>>>>>>>>> UUID in Iceberg. To summarize…
>>>>>>>>>
>>>>>>>>> - It makes sense to keep the type now that row identifiers are
>>>>>>>>> supported
>>>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>>>
>>>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>>>>
>>>>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>
> --
> Ryan Blue
> Tabular
>

Re: [DISCUSS] UUID type

Posted by Ryan Blue <bl...@tabular.io>.

I don't think this is just a problem in Trino.

If there is no UUID type, then a user must choose between a 36-byte string
and a 16-byte binary. That's not a good choice to force people into. If
someone chooses binary, then it's harder to work with rows and construct
queries even though there is a standard representation for UUIDs. To avoid
the user headache, people will probably choose to store values as strings.
Using a string would mean that more than half the value is needlessly
discarded by default in Iceberg lower/upper bounds instead of keeping the
entire value. And since engines don't know what's in the string, the full
value must be used in comparison, which is extra work and extra space.

Inflated values may not be a problem in some cases. IPv4 addresses are one
case where you could argue that it doesn't matter very much that they are
typically stored as strings. But I expect the use of UUIDs to be common for
ID columns because you can generate them without coordination (unlike an
incrementing ID) and that's a concern because the use as an ID makes them
likely to be join keys.

If we want the values to be stored as 16-byte fixed, then we need to make
it easy to get the expected string representation in and out, just like we
do with date/time types. I don't think that's specific to any engine.

On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <ja...@gmail.com>
wrote:

> I think points 1&2 don't really apply since a fixed width binary already
> covers those properties.
>
> It seems like this isn't really a concern of iceberg but rather a cosmetic
> layer that exists primarily (only?) in trino. In that case I would be
> inclined to say that trino should just use custom metadata and a fixed
> binary type. That way you still have the desired ux without exposing those
> extra concepts to the  iceberg. It actually feels like better encapsulation
> imo.
>
> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <pi...@starburstdata.com>
> wrote:
>
>> Hi,
>>
>> I agree with Ryan, that it takes some precautions before one can assume
>> uniqueness of UUID values, and that this shouldn't be any special for UUIDs
>> at all.
>> After all, this is just a primitive type, which is commonly used for
>> certain things, but "commonly" doesn't mean "always".
>>
>> The advantages of having a dedicated type are on 3 layers.
>> The compact representation in the file, and compact representation in
>> memory in the query engine are the ones mentioned above.
>>
>> The third layer is the usability. Seeing a UUID column i know what values
>> i can expect, so it's more descriptive than `id char(36)`.
>> This also means i can CREATE TABLE ... AS SELECT uuid(), .... without
>> need for casting to varchar.
>> It also removes temptation of casting uuid to varbinary to achieve
>> compact representation.
>>
>> Thus i think it would be good to have them.
>>
>> Best
>> PF
>>
>>
>>
>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <bl...@tabular.io> wrote:
>>
>>> The original reason why I added UUID to the spec was that I thought
>>> there would be opportunities to take advantage of UUIDs as unique values
>>> and to optimize the use of UUIDs. I was thinking about auto-increment ID
>>> fields and how we might do something similar in Iceberg.
>>>
>>> The reason we have thought about removing UUID is that there aren't as
>>> many opportunities to take advantage of UUIDs as I thought. My original
>>> assumption was that we could do things like bucket on UUID fields or assume
>>> that a UUID field has a high NDV. But that's not necessarily the case with
>>> when a UUID field is a foreign key, only when it is used as an identifier
>>> or primary key. Before Jack added tracking for row identifier fields, we
>>> couldn't know that a UUID was unique in a table. As a result, we didn't
>>> invest in support for UUID.
>>>
>>> Quick aside: Now that row identifier fields are tracked, we can do some
>>> of these things with the row identifier fields. Engines can assume that the
>>> tuple of row identifier fields is unique in a table for join estimation.
>>> And engines can use row identifier fields in sort keys to ensure lots of
>>> partition split locations (this is really important for Spark).
>>>
>>> Coming back to UUIDs, the second reason to have a UUID type is still
>>> valid: it is better to represent UUIDs as fixed[16] than as 36 byte UTF-8
>>> strings that are more than twice as large, or even worse UCS-16 Strings
>>> that are 4x as large. Since UUIDs are likely to be used in joins, this
>>> could really help engines as long as they can keep the values as
>>> fixed-width binary.
>>>
>>> I could go either way on this. I think it is valuable to have a compact
>>> representation for UUIDs rather than using the string representation. But
>>> that will require investing in the type and building support in engines
>>> that won't take advantage of it. If Trino can use this, I think it may be
>>> worth keeping and investing in.
>>>
>>> Ryan
>>>
>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <ye...@gmail.com> wrote:
>>>
>>>> Yes I agree with Jacques that fixed binary is what it is in the end. I
>>>> think It is more about user experience, whether the conversion is done at
>>>> the user side or Iceberg and engine side. Many people just store UUID as a
>>>> 36 byte string instead of a 16 byte binary, so with an explicit UUID type,
>>>> Iceberg can optimize this common use case internally for users. There might
>>>> be some other benefits I overlooked, but maybe the complication introduced
>>>> by this type does not really justify the slightly better user experience. I
>>>> am also on the fence about it.
>>>>
>>>> -Jack Ye
>>>>
>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> What specific arguments are there for it being a first class type
>>>>> besides it is elsewhere? Is there some kind of optimization iceberg or an
>>>>> engine could do if it was typed versus just a bucket of bits? Fixed width
>>>>> binary seems to cover the cases I see in terms of actual functionality in
>>>>> the iceberg libraries or engines…
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yy...@gmail.com> wrote:
>>>>>
>>>>>> One conversation I used to come across regarding UUID deprecation was
>>>>>> from https://github.com/apache/iceberg/pull/1611
>>>>>>
>>>>>> Thanks,
>>>>>> Yan
>>>>>>
>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary <pv...@cloudera.com.invalid>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Joshua,
>>>>>>>
>>>>>>> I do not have a strong preference about the UUID type, but I would
>>>>>>> like the highlight, that the type is handled inconsistently in Iceberg with
>>>>>>> different file formats. (See:
>>>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>>>
>>>>>>> If we keep the type, it would be good to standardize the handling in
>>>>>>> every file format.
>>>>>>>
>>>>>>> Thanks, Peter
>>>>>>>
>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <jo...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi.
>>>>>>>>
>>>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but there seems
>>>>>>>> to have been some discussion about removing it? I could not find the
>>>>>>>> original discussion, but a reference to the discussion can be found here (
>>>>>>>> https://github.com/trinodb/trino/issues/6663).
>>>>>>>>
>>>>>>>> I generally agree with the consensus in the Trino issue to keep
>>>>>>>> UUID in Iceberg. To summarize…
>>>>>>>>
>>>>>>>> - It makes sense to keep the type now that row identifiers are
>>>>>>>> supported
>>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>>
>>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>>>
>>>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

-- 
Ryan Blue
Tabular

Re: [DISCUSS] UUID type

Posted by Jacques Nadeau <ja...@gmail.com>.

I think points 1&2 don't really apply since a fixed width binary already
covers those properties.

It seems like this isn't really a concern of iceberg but rather a cosmetic
layer that exists primarily (only?) in trino. In that case I would be
inclined to say that trino should just use custom metadata and a fixed
binary type. That way you still have the desired ux without exposing those
extra concepts to the  iceberg. It actually feels like better encapsulation
imo.

On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <pi...@starburstdata.com>
wrote:

> Hi,
>
> I agree with Ryan, that it takes some precautions before one can assume
> uniqueness of UUID values, and that this shouldn't be any special for UUIDs
> at all.
> After all, this is just a primitive type, which is commonly used for
> certain things, but "commonly" doesn't mean "always".
>
> The advantages of having a dedicated type are on 3 layers.
> The compact representation in the file, and compact representation in
> memory in the query engine are the ones mentioned above.
>
> The third layer is the usability. Seeing a UUID column i know what values
> i can expect, so it's more descriptive than `id char(36)`.
> This also means i can CREATE TABLE ... AS SELECT uuid(), .... without need
> for casting to varchar.
> It also removes temptation of casting uuid to varbinary to achieve compact
> representation.
>
> Thus i think it would be good to have them.
>
> Best
> PF
>
>
>
> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <bl...@tabular.io> wrote:
>
>> The original reason why I added UUID to the spec was that I thought there
>> would be opportunities to take advantage of UUIDs as unique values and to
>> optimize the use of UUIDs. I was thinking about auto-increment ID fields
>> and how we might do something similar in Iceberg.
>>
>> The reason we have thought about removing UUID is that there aren't as
>> many opportunities to take advantage of UUIDs as I thought. My original
>> assumption was that we could do things like bucket on UUID fields or assume
>> that a UUID field has a high NDV. But that's not necessarily the case with
>> when a UUID field is a foreign key, only when it is used as an identifier
>> or primary key. Before Jack added tracking for row identifier fields, we
>> couldn't know that a UUID was unique in a table. As a result, we didn't
>> invest in support for UUID.
>>
>> Quick aside: Now that row identifier fields are tracked, we can do some
>> of these things with the row identifier fields. Engines can assume that the
>> tuple of row identifier fields is unique in a table for join estimation.
>> And engines can use row identifier fields in sort keys to ensure lots of
>> partition split locations (this is really important for Spark).
>>
>> Coming back to UUIDs, the second reason to have a UUID type is still
>> valid: it is better to represent UUIDs as fixed[16] than as 36 byte UTF-8
>> strings that are more than twice as large, or even worse UCS-16 Strings
>> that are 4x as large. Since UUIDs are likely to be used in joins, this
>> could really help engines as long as they can keep the values as
>> fixed-width binary.
>>
>> I could go either way on this. I think it is valuable to have a compact
>> representation for UUIDs rather than using the string representation. But
>> that will require investing in the type and building support in engines
>> that won't take advantage of it. If Trino can use this, I think it may be
>> worth keeping and investing in.
>>
>> Ryan
>>
>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <ye...@gmail.com> wrote:
>>
>>> Yes I agree with Jacques that fixed binary is what it is in the end. I
>>> think It is more about user experience, whether the conversion is done at
>>> the user side or Iceberg and engine side. Many people just store UUID as a
>>> 36 byte string instead of a 16 byte binary, so with an explicit UUID type,
>>> Iceberg can optimize this common use case internally for users. There might
>>> be some other benefits I overlooked, but maybe the complication introduced
>>> by this type does not really justify the slightly better user experience. I
>>> am also on the fence about it.
>>>
>>> -Jack Ye
>>>
>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <ja...@gmail.com>
>>> wrote:
>>>
>>>> What specific arguments are there for it being a first class type
>>>> besides it is elsewhere? Is there some kind of optimization iceberg or an
>>>> engine could do if it was typed versus just a bucket of bits? Fixed width
>>>> binary seems to cover the cases I see in terms of actual functionality in
>>>> the iceberg libraries or engines…
>>>>
>>>>
>>>>
>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yy...@gmail.com> wrote:
>>>>
>>>>> One conversation I used to come across regarding UUID deprecation was
>>>>> from https://github.com/apache/iceberg/pull/1611
>>>>>
>>>>> Thanks,
>>>>> Yan
>>>>>
>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary <pv...@cloudera.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Hi Joshua,
>>>>>>
>>>>>> I do not have a strong preference about the UUID type, but I would
>>>>>> like the highlight, that the type is handled inconsistently in Iceberg with
>>>>>> different file formats. (See:
>>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>>
>>>>>> If we keep the type, it would be good to standardize the handling in
>>>>>> every file format.
>>>>>>
>>>>>> Thanks, Peter
>>>>>>
>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <jo...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi.
>>>>>>>
>>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but there seems
>>>>>>> to have been some discussion about removing it? I could not find the
>>>>>>> original discussion, but a reference to the discussion can be found here (
>>>>>>> https://github.com/trinodb/trino/issues/6663).
>>>>>>>
>>>>>>> I generally agree with the consensus in the Trino issue to keep UUID
>>>>>>> in Iceberg. To summarize…
>>>>>>>
>>>>>>> - It makes sense to keep the type now that row identifiers are
>>>>>>> supported
>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>
>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>>
>>>>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Re: [DISCUSS] UUID type

Posted by Piotr Findeisen <pi...@starburstdata.com>.

Hi,

I agree with Ryan, that it takes some precautions before one can assume
uniqueness of UUID values, and that this shouldn't be any special for UUIDs
at all.
After all, this is just a primitive type, which is commonly used for
certain things, but "commonly" doesn't mean "always".

The advantages of having a dedicated type are on 3 layers.
The compact representation in the file, and compact representation in
memory in the query engine are the ones mentioned above.

The third layer is the usability. Seeing a UUID column i know what values i
can expect, so it's more descriptive than `id char(36)`.
This also means i can CREATE TABLE ... AS SELECT uuid(), .... without need
for casting to varchar.
It also removes temptation of casting uuid to varbinary to achieve compact
representation.

Thus i think it would be good to have them.

Best
PF



On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <bl...@tabular.io> wrote:

> The original reason why I added UUID to the spec was that I thought there
> would be opportunities to take advantage of UUIDs as unique values and to
> optimize the use of UUIDs. I was thinking about auto-increment ID fields
> and how we might do something similar in Iceberg.
>
> The reason we have thought about removing UUID is that there aren't as
> many opportunities to take advantage of UUIDs as I thought. My original
> assumption was that we could do things like bucket on UUID fields or assume
> that a UUID field has a high NDV. But that's not necessarily the case with
> when a UUID field is a foreign key, only when it is used as an identifier
> or primary key. Before Jack added tracking for row identifier fields, we
> couldn't know that a UUID was unique in a table. As a result, we didn't
> invest in support for UUID.
>
> Quick aside: Now that row identifier fields are tracked, we can do some of
> these things with the row identifier fields. Engines can assume that the
> tuple of row identifier fields is unique in a table for join estimation.
> And engines can use row identifier fields in sort keys to ensure lots of
> partition split locations (this is really important for Spark).
>
> Coming back to UUIDs, the second reason to have a UUID type is still
> valid: it is better to represent UUIDs as fixed[16] than as 36 byte UTF-8
> strings that are more than twice as large, or even worse UCS-16 Strings
> that are 4x as large. Since UUIDs are likely to be used in joins, this
> could really help engines as long as they can keep the values as
> fixed-width binary.
>
> I could go either way on this. I think it is valuable to have a compact
> representation for UUIDs rather than using the string representation. But
> that will require investing in the type and building support in engines
> that won't take advantage of it. If Trino can use this, I think it may be
> worth keeping and investing in.
>
> Ryan
>
> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <ye...@gmail.com> wrote:
>
>> Yes I agree with Jacques that fixed binary is what it is in the end. I
>> think It is more about user experience, whether the conversion is done at
>> the user side or Iceberg and engine side. Many people just store UUID as a
>> 36 byte string instead of a 16 byte binary, so with an explicit UUID type,
>> Iceberg can optimize this common use case internally for users. There might
>> be some other benefits I overlooked, but maybe the complication introduced
>> by this type does not really justify the slightly better user experience. I
>> am also on the fence about it.
>>
>> -Jack Ye
>>
>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <ja...@gmail.com>
>> wrote:
>>
>>> What specific arguments are there for it being a first class type
>>> besides it is elsewhere? Is there some kind of optimization iceberg or an
>>> engine could do if it was typed versus just a bucket of bits? Fixed width
>>> binary seems to cover the cases I see in terms of actual functionality in
>>> the iceberg libraries or engines…
>>>
>>>
>>>
>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yy...@gmail.com> wrote:
>>>
>>>> One conversation I used to come across regarding UUID deprecation was
>>>> from https://github.com/apache/iceberg/pull/1611
>>>>
>>>> Thanks,
>>>> Yan
>>>>
>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary <pv...@cloudera.com.invalid>
>>>> wrote:
>>>>
>>>>> Hi Joshua,
>>>>>
>>>>> I do not have a strong preference about the UUID type, but I would
>>>>> like the highlight, that the type is handled inconsistently in Iceberg with
>>>>> different file formats. (See:
>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>
>>>>> If we keep the type, it would be good to standardize the handling in
>>>>> every file format.
>>>>>
>>>>> Thanks, Peter
>>>>>
>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <jo...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi.
>>>>>>
>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>> https://iceberg.apache.org/spec/#primitive-types), but there seems
>>>>>> to have been some discussion about removing it? I could not find the
>>>>>> original discussion, but a reference to the discussion can be found here (
>>>>>> https://github.com/trinodb/trino/issues/6663).
>>>>>>
>>>>>> I generally agree with the consensus in the Trino issue to keep UUID
>>>>>> in Iceberg. To summarize…
>>>>>>
>>>>>> - It makes sense to keep the type now that row identifiers are
>>>>>> supported
>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>
>>>>>> Does anyone want to remove the type? If so, why?
>>>>>
>>>>>
>
> --
> Ryan Blue
> Tabular
>

Re: [DISCUSS] UUID type

Posted by Ryan Blue <bl...@tabular.io>.

The original reason why I added UUID to the spec was that I thought there
would be opportunities to take advantage of UUIDs as unique values and to
optimize the use of UUIDs. I was thinking about auto-increment ID fields
and how we might do something similar in Iceberg.

The reason we have thought about removing UUID is that there aren't as many
opportunities to take advantage of UUIDs as I thought. My original
assumption was that we could do things like bucket on UUID fields or assume
that a UUID field has a high NDV. But that's not necessarily the case with
when a UUID field is a foreign key, only when it is used as an identifier
or primary key. Before Jack added tracking for row identifier fields, we
couldn't know that a UUID was unique in a table. As a result, we didn't
invest in support for UUID.

Quick aside: Now that row identifier fields are tracked, we can do some of
these things with the row identifier fields. Engines can assume that the
tuple of row identifier fields is unique in a table for join estimation.
And engines can use row identifier fields in sort keys to ensure lots of
partition split locations (this is really important for Spark).

Coming back to UUIDs, the second reason to have a UUID type is still valid:
it is better to represent UUIDs as fixed[16] than as 36 byte UTF-8 strings
that are more than twice as large, or even worse UCS-16 Strings that are 4x
as large. Since UUIDs are likely to be used in joins, this could really
help engines as long as they can keep the values as fixed-width binary.

I could go either way on this. I think it is valuable to have a compact
representation for UUIDs rather than using the string representation. But
that will require investing in the type and building support in engines
that won't take advantage of it. If Trino can use this, I think it may be
worth keeping and investing in.

Ryan

On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <ye...@gmail.com> wrote:

> Yes I agree with Jacques that fixed binary is what it is in the end. I
> think It is more about user experience, whether the conversion is done at
> the user side or Iceberg and engine side. Many people just store UUID as a
> 36 byte string instead of a 16 byte binary, so with an explicit UUID type,
> Iceberg can optimize this common use case internally for users. There might
> be some other benefits I overlooked, but maybe the complication introduced
> by this type does not really justify the slightly better user experience. I
> am also on the fence about it.
>
> -Jack Ye
>
> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <ja...@gmail.com>
> wrote:
>
>> What specific arguments are there for it being a first class type besides
>> it is elsewhere? Is there some kind of optimization iceberg or an engine
>> could do if it was typed versus just a bucket of bits? Fixed width binary
>> seems to cover the cases I see in terms of actual functionality in the
>> iceberg libraries or engines…
>>
>>
>>
>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yy...@gmail.com> wrote:
>>
>>> One conversation I used to come across regarding UUID deprecation was
>>> from https://github.com/apache/iceberg/pull/1611
>>>
>>> Thanks,
>>> Yan
>>>
>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary <pv...@cloudera.com.invalid>
>>> wrote:
>>>
>>>> Hi Joshua,
>>>>
>>>> I do not have a strong preference about the UUID type, but I would like
>>>> the highlight, that the type is handled inconsistently in Iceberg with
>>>> different file formats. (See:
>>>> https://github.com/apache/iceberg/issues/1881)
>>>>
>>>> If we keep the type, it would be good to standardize the handling in
>>>> every file format.
>>>>
>>>> Thanks, Peter
>>>>
>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <jo...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi.
>>>>>
>>>>> UUID is a current data type according to the Iceberg spec (
>>>>> https://iceberg.apache.org/spec/#primitive-types), but there seems to
>>>>> have been some discussion about removing it? I could not find the original
>>>>> discussion, but a reference to the discussion can be found here (
>>>>> https://github.com/trinodb/trino/issues/6663).
>>>>>
>>>>> I generally agree with the consensus in the Trino issue to keep UUID
>>>>> in Iceberg. To summarize…
>>>>>
>>>>> - It makes sense to keep the type now that row identifiers are
>>>>> supported
>>>>> - Some engines (Trino) have support for the UUID type
>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>
>>>>> Does anyone want to remove the type? If so, why?
>>>>
>>>>

-- 
Ryan Blue
Tabular

Re: [DISCUSS] UUID type

Posted by Jack Ye <ye...@gmail.com>.

Yes I agree with Jacques that fixed binary is what it is in the end. I
think It is more about user experience, whether the conversion is done at
the user side or Iceberg and engine side. Many people just store UUID as a
36 byte string instead of a 16 byte binary, so with an explicit UUID type,
Iceberg can optimize this common use case internally for users. There might
be some other benefits I overlooked, but maybe the complication introduced
by this type does not really justify the slightly better user experience. I
am also on the fence about it.

-Jack Ye

On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <ja...@gmail.com>
wrote:

> What specific arguments are there for it being a first class type besides
> it is elsewhere? Is there some kind of optimization iceberg or an engine
> could do if it was typed versus just a bucket of bits? Fixed width binary
> seems to cover the cases I see in terms of actual functionality in the
> iceberg libraries or engines…
>
>
>
> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yy...@gmail.com> wrote:
>
>> One conversation I used to come across regarding UUID deprecation was
>> from https://github.com/apache/iceberg/pull/1611
>>
>> Thanks,
>> Yan
>>
>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary <pv...@cloudera.com.invalid>
>> wrote:
>>
>>> Hi Joshua,
>>>
>>> I do not have a strong preference about the UUID type, but I would like
>>> the highlight, that the type is handled inconsistently in Iceberg with
>>> different file formats. (See:
>>> https://github.com/apache/iceberg/issues/1881)
>>>
>>> If we keep the type, it would be good to standardize the handling in
>>> every file format.
>>>
>>> Thanks, Peter
>>>
>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <jo...@gmail.com> wrote:
>>>
>>>> Hi.
>>>>
>>>> UUID is a current data type according to the Iceberg spec (
>>>> https://iceberg.apache.org/spec/#primitive-types), but there seems to
>>>> have been some discussion about removing it? I could not find the original
>>>> discussion, but a reference to the discussion can be found here (
>>>> https://github.com/trinodb/trino/issues/6663).
>>>>
>>>> I generally agree with the consensus in the Trino issue to keep UUID in
>>>> Iceberg. To summarize…
>>>>
>>>> - It makes sense to keep the type now that row identifiers are supported
>>>> - Some engines (Trino) have support for the UUID type
>>>> - Engines w/o support for UUID type can determine how to map
>>>>
>>>> Does anyone want to remove the type? If so, why?
>>>
>>>

Re: [DISCUSS] UUID type

Posted by Jacques Nadeau <ja...@gmail.com>.

What specific arguments are there for it being a first class type besides
it is elsewhere? Is there some kind of optimization iceberg or an engine
could do if it was typed versus just a bucket of bits? Fixed width binary
seems to cover the cases I see in terms of actual functionality in the
iceberg libraries or engines…



On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yy...@gmail.com> wrote:

> One conversation I used to come across regarding UUID deprecation was from
> https://github.com/apache/iceberg/pull/1611
>
> Thanks,
> Yan
>
> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary <pv...@cloudera.com.invalid>
> wrote:
>
>> Hi Joshua,
>>
>> I do not have a strong preference about the UUID type, but I would like
>> the highlight, that the type is handled inconsistently in Iceberg with
>> different file formats. (See:
>> https://github.com/apache/iceberg/issues/1881)
>>
>> If we keep the type, it would be good to standardize the handling in
>> every file format.
>>
>> Thanks, Peter
>>
>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <jo...@gmail.com> wrote:
>>
>>> Hi.
>>>
>>> UUID is a current data type according to the Iceberg spec (
>>> https://iceberg.apache.org/spec/#primitive-types), but there seems to
>>> have been some discussion about removing it? I could not find the original
>>> discussion, but a reference to the discussion can be found here (
>>> https://github.com/trinodb/trino/issues/6663).
>>>
>>> I generally agree with the consensus in the Trino issue to keep UUID in
>>> Iceberg. To summarize…
>>>
>>> - It makes sense to keep the type now that row identifiers are supported
>>> - Some engines (Trino) have support for the UUID type
>>> - Engines w/o support for UUID type can determine how to map
>>>
>>> Does anyone want to remove the type? If so, why?
>>
>>

Re: [DISCUSS] UUID type

Posted by Yan Yan <yy...@gmail.com>.

One conversation I used to come across regarding UUID deprecation was from
https://github.com/apache/iceberg/pull/1611

Thanks,
Yan

On Tue, Jul 27, 2021 at 1:07 PM Peter Vary <pv...@cloudera.com.invalid>
wrote:

> Hi Joshua,
>
> I do not have a strong preference about the UUID type, but I would like
> the highlight, that the type is handled inconsistently in Iceberg with
> different file formats. (See:
> https://github.com/apache/iceberg/issues/1881)
>
> If we keep the type, it would be good to standardize the handling in every
> file format.
>
> Thanks, Peter
>
> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <jo...@gmail.com> wrote:
>
>> Hi.
>>
>> UUID is a current data type according to the Iceberg spec (
>> https://iceberg.apache.org/spec/#primitive-types), but there seems to
>> have been some discussion about removing it? I could not find the original
>> discussion, but a reference to the discussion can be found here (
>> https://github.com/trinodb/trino/issues/6663).
>>
>> I generally agree with the consensus in the Trino issue to keep UUID in
>> Iceberg. To summarize…
>>
>> - It makes sense to keep the type now that row identifiers are supported
>> - Some engines (Trino) have support for the UUID type
>> - Engines w/o support for UUID type can determine how to map
>>
>> Does anyone want to remove the type? If so, why?
>
>

Re: [DISCUSS] UUID type

Posted by Peter Vary <pv...@cloudera.com.INVALID>.

Hi Joshua,

I do not have a strong preference about the UUID type, but I would like the
highlight, that the type is handled inconsistently in Iceberg with
different file formats. (See: https://github.com/apache/iceberg/issues/1881
)

If we keep the type, it would be good to standardize the handling in every
file format.

Thanks, Peter

On Tue, 27 Jul 2021, 17:08 Joshua Howard, <jo...@gmail.com> wrote:

> Hi.
>
> UUID is a current data type according to the Iceberg spec (
> https://iceberg.apache.org/spec/#primitive-types), but there seems to
> have been some discussion about removing it? I could not find the original
> discussion, but a reference to the discussion can be found here (
> https://github.com/trinodb/trino/issues/6663).
>
> I generally agree with the consensus in the Trino issue to keep UUID in
> Iceberg. To summarize…
>
> - It makes sense to keep the type now that row identifiers are supported
> - Some engines (Trino) have support for the UUID type
> - Engines w/o support for UUID type can determine how to map
>
> Does anyone want to remove the type? If so, why?