You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Walaa Eldin Moustafa <wa...@gmail.com> on 2022/03/09 17:45:13 UTC

Re: Hive table compatibility for Iceberg readers

The union type conversion PR is up:
https://github.com/apache/iceberg/pull/4242.

Thanks,
Walaa.


On Fri, Feb 11, 2022 at 8:53 AM Walaa Eldin Moustafa <wa...@gmail.com>
wrote:

> Thanks Ryan! Yes there is an active discussion on the PR on the spec
> aspect.
>
> On Fri, Feb 11, 2022 at 8:47 AM Ryan Blue <bl...@tabular.io> wrote:
>
>> Sounds great. Thanks for the update! That PR is on my list to take a look
>> at, but I still recommend starting with the spec changes. For example, how
>> should default values be stored in Iceberg metadata for each type?
>> Currently, the spec changes just mention defaults without going into detail
>> about how they are tracked and what rules there are about them.
>>
>> On Wed, Feb 9, 2022 at 6:32 PM Walaa Eldin Moustafa <
>> wa.moustafa@gmail.com> wrote:
>>
>>> Thanks Ryan and Owen! Glad we have converged on this. Next steps for us:
>>>
>>> * Continuing the discussion on the default value PR (already ongoing
>>> [1]).
>>> * Filing the union type conversion PR (ETA end of next week).
>>> * Moving listing-based Hive table scan using Iceberg to a separate repo
>>> (likely open source). For this I expect introducing some extension points
>>> to Iceberg such as making some classes SPI. I hope that the community is
>>> okay with that.
>>>
>>> By the way, Owen and I synced on the Hive casing behavior, and it is a
>>> bit more involved: Hive lowers the schema case for all fields (including
>>> nested fields) in the Avro case, but only lowers top-level field case and
>>> preserves inner field case for other formats (we experimented with ORC and
>>> Text). Hope this clarifies the confusion.
>>>
>>> [1] https://github.com/apache/iceberg/pull/2496
>>>
>>> Thanks,
>>> Walaa.
>>>
>>>
>>>
>>> On Wed, Feb 2, 2022 at 2:40 PM Ryan Blue <bl...@tabular.io> wrote:
>>>
>>>> Walaa, thanks for this list. I think most of these are definitely
>>>> useful. I think the best one to focus on first is the default values, since
>>>> those will make Iceberg tables behave more like standard SQL tables, which
>>>> is the goal.
>>>>
>>>> I'm really curious to learn more about #1, but I don't think that I
>>>> have enough detail to know whether it is something that fits in the Iceberg
>>>> project. At Netflix, we had an alternative implementation of Hive and Spark
>>>> tables (Spark tables are slightly different) that we similarly used. But we
>>>> didn't write to both at the same time.
>>>>
>>>> For the others, I'm interested in hearing what other people in the
>>>> community find valuable. I don't think I would use #2 or #3, for example.
>>>> That's because we already support a flag for case insensitive column
>>>> resolution that is well supported throughout Iceberg. If you wanted to use
>>>> alternative names, then I'd probably recommend just turning that on...
>>>> although that may not be an option depending on how you're working with a
>>>> table. It would work in Spark, though. This may be a better feature for
>>>> your system that is built on Iceberg.
>>>>
>>>> Reading unions as structs has come up a couple times so that seems like
>>>> people will want it. I think someone attempted to add this support in the
>>>> past, but ran into issues because the spec is clear that these are NOT
>>>> Iceberg files. There is no guarantee that other implementations will read
>>>> them and Iceberg cannot write them in this form. I'm fairly confident that
>>>> not allowing unions to be written is a good choice, but I would support
>>>> being able to read them.
>>>>
>>>> Ryan
>>>>
>>>> On Mon, Jan 31, 2022 at 4:32 PM Owen O'Malley <ow...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Thu, Jan 27, 2022 at 10:26 PM Walaa Eldin Moustafa <
>>>>> wa.moustafa@gmail.com> wrote:
>>>>>
>>>>>> *2. Iceberg schema lower casing:* Before Iceberg, when users read
>>>>>> Hive tables from Spark, the returned schema is lowercase since Hive stores
>>>>>> all metadata in lowercase mode. If users move to Iceberg, such readers
>>>>>> could break once Iceberg returns proper case schema. This feature is to add
>>>>>> lowercasing for backward compatibility with existing scripts. This feature
>>>>>> is added as an option and is not enabled by default.
>>>>>>
>>>>>
>>>>> This isn't quite correct. Hive lowercases top-level columns. It does
>>>>> not lowercase field names inside structs.
>>>>>
>>>>>
>>>>>> *3. Hive table proper casing:* conversely, we leverage the Avro
>>>>>> schema to supplement the lower case Hive schema when reading Hive tables.
>>>>>> This is useful if someone wants to still get proper cased schemas while
>>>>>> still in the Hive mode (to be forward-compatible with Iceberg). The same
>>>>>> flag used in (2) is used here.
>>>>>>
>>>>>
>>>>> Are there users of Avro schemas in Hive outside of LinkedIn? I've
>>>>> never seen it used. I don't think you should tie #2 and #3 together.
>>>>>
>>>>> Supporting default values and union types are useful extensions.
>>>>>
>>>>> .. Owen
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>