You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by Fokko Driesprong <fo...@apache.org> on 2023/12/22 08:49:10 UTC

[DISCUSS] Extending the UUID logical type with Fixed[16]

Hey everyone,

For Iceberg we're using UUIDs in Avro and we're storing them as binary,
rather than a string. This has several advantages such as more compact
storage, more efficient reading, and more efficient skipping. For more
details, please check out the doc that I've created
<https://docs.google.com/document/d/16_oSWrEM7AFUCTe0uuraAEHxywezLfoEz5ahzwvhGUk/edit#heading=h.43xuauwfk7ow>
(and feel free to comment). Also created AVRO-3918
<https://issues.apache.org/jira/browse/AVRO-3918> on Jira to track this.

Looking forward to hearing from y'all!

Kind regards and happy holidays,

Fokko Driesprong

Re: [DISCUSS] Extending the UUID logical type with Fixed[16]

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.
Hey everyone,

Happy New Year! Best wishes for 2024 for you and your family.

I went ahead and created a PR for the spec change:
https://github.com/apache/avro/pull/2672 Let me know if there are any
questions or concerns.

Kind regards,
Fokko

Op vr 22 dec 2023 om 14:52 schreef Fokko Driesprong <fo...@apache.org>:

> Hi Martin and Scott,
>
> Thanks for the question, and that's a good one. I would suggest:
>
> {
>
>   "type": "fixed",
>
>   "size": 16,
>
>   "logicalType": "uuid"
>
> }
>
> This is in line with the other logicalTypes. For example with date:
>
> {
>   "type": "int",
>   "logicalType": "date"
> }
>
> If you don't support the date, you can still read the int itself (days
> since Epoch).
>
> I've added a schema example to the Google doc and created a PR
> <https://github.com/apache/avro/pull/2646/> to clarify the current
> situation.
>
> I am curious about what you guys think of the proposed JSON-type
> representation.
>
> Kind regards,
> Fokko
>
>
> Op vr 22 dec 2023 om 14:25 schreef Scott Belden <sc...@gmail.com>:
>
>> I think you'd have to go with something like one of the first two options
>> (something in the schema) rather than some flag in a library. The problem
>> with an flag in a library is if someone has an avro file they want to
>> deserialize, they might not know if it was encoded with uuids as bytes or
>> strings and they'd be left with guessing one and trying again with the
>> second if the first failed which would not be a pleasant experience.
>>
>> -Scott
>>
>> On Fri, Dec 22, 2023 at 5:00 AM Martin Grigorov <mg...@apache.org>
>> wrote:
>>
>> > Hi,
>> >
>> > How would the application tell Avro what storage type to use - String or
>> > bytes ?
>> > - new logical type ? e.g. "logicalType": "uuid-bytes"
>> > - extra attribute ? e.g. { ..., "logicalType": "uuid", "storage-type":
>> > "bytes" }
>> > - global switch that tells the library to always use "string" or "bytes"
>> > for all UUIDs ?
>> > - ...
>> >
>> > Martin
>> >
>> > On Fri, Dec 22, 2023 at 10:49 AM Fokko Driesprong <fo...@apache.org>
>> > wrote:
>> >
>> > > Hey everyone,
>> > >
>> > > For Iceberg we're using UUIDs in Avro and we're storing them as
>> binary,
>> > > rather than a string. This has several advantages such as more compact
>> > > storage, more efficient reading, and more efficient skipping. For more
>> > > details, please check out the doc that I've created
>> > > <
>> > >
>> >
>> https://docs.google.com/document/d/16_oSWrEM7AFUCTe0uuraAEHxywezLfoEz5ahzwvhGUk/edit#heading=h.43xuauwfk7ow
>> > > >
>> > > (and feel free to comment). Also created AVRO-3918
>> > > <https://issues.apache.org/jira/browse/AVRO-3918> on Jira to track
>> this.
>> > >
>> > > Looking forward to hearing from y'all!
>> > >
>> > > Kind regards and happy holidays,
>> > >
>> > > Fokko Driesprong
>> > >
>> >
>>
>

Re: [DISCUSS] Extending the UUID logical type with Fixed[16]

Posted by Fokko Driesprong <fo...@apache.org>.
Hi Martin and Scott,

Thanks for the question, and that's a good one. I would suggest:

{

  "type": "fixed",

  "size": 16,

  "logicalType": "uuid"

}

This is in line with the other logicalTypes. For example with date:

{
  "type": "int",
  "logicalType": "date"
}

If you don't support the date, you can still read the int itself (days
since Epoch).

I've added a schema example to the Google doc and created a PR
<https://github.com/apache/avro/pull/2646/> to clarify the current
situation.

I am curious about what you guys think of the proposed JSON-type
representation.

Kind regards,
Fokko


Op vr 22 dec 2023 om 14:25 schreef Scott Belden <sc...@gmail.com>:

> I think you'd have to go with something like one of the first two options
> (something in the schema) rather than some flag in a library. The problem
> with an flag in a library is if someone has an avro file they want to
> deserialize, they might not know if it was encoded with uuids as bytes or
> strings and they'd be left with guessing one and trying again with the
> second if the first failed which would not be a pleasant experience.
>
> -Scott
>
> On Fri, Dec 22, 2023 at 5:00 AM Martin Grigorov <mg...@apache.org>
> wrote:
>
> > Hi,
> >
> > How would the application tell Avro what storage type to use - String or
> > bytes ?
> > - new logical type ? e.g. "logicalType": "uuid-bytes"
> > - extra attribute ? e.g. { ..., "logicalType": "uuid", "storage-type":
> > "bytes" }
> > - global switch that tells the library to always use "string" or "bytes"
> > for all UUIDs ?
> > - ...
> >
> > Martin
> >
> > On Fri, Dec 22, 2023 at 10:49 AM Fokko Driesprong <fo...@apache.org>
> > wrote:
> >
> > > Hey everyone,
> > >
> > > For Iceberg we're using UUIDs in Avro and we're storing them as binary,
> > > rather than a string. This has several advantages such as more compact
> > > storage, more efficient reading, and more efficient skipping. For more
> > > details, please check out the doc that I've created
> > > <
> > >
> >
> https://docs.google.com/document/d/16_oSWrEM7AFUCTe0uuraAEHxywezLfoEz5ahzwvhGUk/edit#heading=h.43xuauwfk7ow
> > > >
> > > (and feel free to comment). Also created AVRO-3918
> > > <https://issues.apache.org/jira/browse/AVRO-3918> on Jira to track
> this.
> > >
> > > Looking forward to hearing from y'all!
> > >
> > > Kind regards and happy holidays,
> > >
> > > Fokko Driesprong
> > >
> >
>

Re: [DISCUSS] Extending the UUID logical type with Fixed[16]

Posted by Scott Belden <sc...@gmail.com>.
I think you'd have to go with something like one of the first two options
(something in the schema) rather than some flag in a library. The problem
with an flag in a library is if someone has an avro file they want to
deserialize, they might not know if it was encoded with uuids as bytes or
strings and they'd be left with guessing one and trying again with the
second if the first failed which would not be a pleasant experience.

-Scott

On Fri, Dec 22, 2023 at 5:00 AM Martin Grigorov <mg...@apache.org>
wrote:

> Hi,
>
> How would the application tell Avro what storage type to use - String or
> bytes ?
> - new logical type ? e.g. "logicalType": "uuid-bytes"
> - extra attribute ? e.g. { ..., "logicalType": "uuid", "storage-type":
> "bytes" }
> - global switch that tells the library to always use "string" or "bytes"
> for all UUIDs ?
> - ...
>
> Martin
>
> On Fri, Dec 22, 2023 at 10:49 AM Fokko Driesprong <fo...@apache.org>
> wrote:
>
> > Hey everyone,
> >
> > For Iceberg we're using UUIDs in Avro and we're storing them as binary,
> > rather than a string. This has several advantages such as more compact
> > storage, more efficient reading, and more efficient skipping. For more
> > details, please check out the doc that I've created
> > <
> >
> https://docs.google.com/document/d/16_oSWrEM7AFUCTe0uuraAEHxywezLfoEz5ahzwvhGUk/edit#heading=h.43xuauwfk7ow
> > >
> > (and feel free to comment). Also created AVRO-3918
> > <https://issues.apache.org/jira/browse/AVRO-3918> on Jira to track this.
> >
> > Looking forward to hearing from y'all!
> >
> > Kind regards and happy holidays,
> >
> > Fokko Driesprong
> >
>

Re: [DISCUSS] Extending the UUID logical type with Fixed[16]

Posted by Martin Grigorov <mg...@apache.org>.
Hi,

How would the application tell Avro what storage type to use - String or
bytes ?
- new logical type ? e.g. "logicalType": "uuid-bytes"
- extra attribute ? e.g. { ..., "logicalType": "uuid", "storage-type":
"bytes" }
- global switch that tells the library to always use "string" or "bytes"
for all UUIDs ?
- ...

Martin

On Fri, Dec 22, 2023 at 10:49 AM Fokko Driesprong <fo...@apache.org> wrote:

> Hey everyone,
>
> For Iceberg we're using UUIDs in Avro and we're storing them as binary,
> rather than a string. This has several advantages such as more compact
> storage, more efficient reading, and more efficient skipping. For more
> details, please check out the doc that I've created
> <
> https://docs.google.com/document/d/16_oSWrEM7AFUCTe0uuraAEHxywezLfoEz5ahzwvhGUk/edit#heading=h.43xuauwfk7ow
> >
> (and feel free to comment). Also created AVRO-3918
> <https://issues.apache.org/jira/browse/AVRO-3918> on Jira to track this.
>
> Looking forward to hearing from y'all!
>
> Kind regards and happy holidays,
>
> Fokko Driesprong
>