You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Jack Ye <ye...@gmail.com> on 2021/09/08 04:31:41 UTC

Re: Proposal: Support for views in Iceberg

Hi everyone,

I have been thinking about the view support during the weekend, and I
realize there is a conflict that Trino today already claims to support
Iceberg view through Hive metastore.

I believe we need to figure out a path forward around this issue before
voting to pass the current proposal to avoid confusions for end users. I
have summarized the issue here with a few different potential solutions:

https://docs.google.com/document/d/1uupI7JJHEZIkHufo7sU4Enpwgg-ODCVBE6ocFUVD9oQ/edit?usp=sharing

Please let me know what you think.

Best,
Jack Ye

On Thu, Aug 26, 2021 at 3:29 PM Phillip Cloud <cp...@gmail.com> wrote:

> On Thu, Aug 26, 2021 at 6:07 PM Jacques Nadeau <ja...@gmail.com>
> wrote:
>
>>
>> On Thu, Aug 26, 2021 at 2:44 PM Ryan Blue <bl...@tabular.io> wrote:
>>
>>> Would a physical plan be portable for the purpose of an engine-agnostic
>>> view?
>>>
>>
>> My goal is it would be. There may be optional "hints" that a particular
>> engine could leverage and others wouldn't but I think the goal should be
>> that the IR is entirely engine-agnostic. Even in the Arrow project proper,
>> there are really two independent heavy-weight engines that have their own
>> capabilities and trajectories (c++ vs rust).
>>
>>
>>> Physical plan details seem specific to an engine to me, but maybe I'm
>>> thinking too much about how Spark is implemented. My inclination would be
>>> to accept only logical IR, which could just mean accepting a subset of the
>>> standard.
>>>
>>
>> I think it is very likely that different consumers will only support a
>> subset of plans. That being said, I'm not sure what you're specifically
>> trying to mitigate or avoid. I'd be inclined to simply allow the full
>> breadth of IR within Iceberg. If it is well specified, an engine can either
>> choose to execute or not (same as the proposal wrt to SQL syntax or if a
>> function is missing on an engine). The engine may even have internal
>> rewrites if it likes doing things a different way than what is requested.
>>
>
> I also believe that consumers will not be expected to support all plans.
> It will depend on the consumer, but many of the instanations of Read/Write
> relations won't be executable for many consumers, for example.
>
>
>>
>>
>>> The document that Micah linked to is interesting, but I'm not sure that
>>> our goals are aligned.
>>>
>>
>> I think there is much commonality here and I'd argue it would be best to
>> really try to see if a unified set of goals works well. I think Arrow IR is
>> young enough that it can still be shaped/adapted. It may be that there
>> should be some give or take on each side. It's possible that the goals are
>> too far apart to unify but my gut is that they are close enough that we
>> should try since it would be a great force multiplier.
>>
>>
>>> For one thing, it seems to make assumptions about the IR being used for
>>> Arrow data (at least in Wes' proposal), when I think that it may be easier
>>> to be agnostic to vectorization.
>>>
>>
>> Other than using the Arrow schema/types, I'm not at all convinced that
>> the IR should be Arrow centric. I've actually argued to some that Arrow IR
>> should be independent of Arrow to be its best self. Let's try to review it
>> and see if/where we can avoid a tight coupling between plans and arrow
>> specific concepts.
>>
>
> Just to echo Jacques's comments here, the only thing that is Arrow
> specific right now is the use of its type system. Literals, for example,
> are encoded entirely in flatbuffers.
>
> Would love feedback on the current PR [1]. I'm looking to merge the first
> iteration soonish, so please review at your earliest convenience.
>
>
>>
>>
>>> It also delegates forward/backward compatibility to flatbuffers, when I
>>> think compatibility should be part of the semantics and not delegated to
>>> serialization. For example, if I have Join("inner", a.id, b.id) and I
>>> evolve that to allow additional predicates Join("inner", a.id, b.id,
>>> a.x < b.y) then just because I can deserialize it doesn't mean it is
>>> compatible.
>>>
>>
>> I don't think that flatbuffers alone can solve all compatibility
>> problems. It can solve some and I'd expect that implementation libraries
>> will have to solve others. Would love to hear if others disagree (and think
>> flatbuffers can solve everything wrt compatibility).
>>
>
> I agree, I think you need both to achieve sane versioning. The version
> needs to be shipped along with the IR, and libraries need to be able deal
> with the different versions. I could be wrong, but I think it probably
> makes more sense to start versioning the IR once the dust has settled a bit.
>
>
>>
>> J
>>
>
> [1]: https://github.com/apache/arrow/pull/10934
>

Re: Proposal: Support for views in Iceberg

Posted by Anjali Norwood <an...@netflix.com.INVALID>.

Hi All,

Please see the spec in markdown format at the PR here
<https://github.com/apache/iceberg/pull/3188> to facilitate
adding/responding to comments. Please review.

thanks,
Anjali

On Tue, Sep 7, 2021 at 9:31 PM Jack Ye <ye...@gmail.com> wrote:

> Hi everyone,
>
> I have been thinking about the view support during the weekend, and I
> realize there is a conflict that Trino today already claims to support
> Iceberg view through Hive metastore.
>
> I believe we need to figure out a path forward around this issue before
> voting to pass the current proposal to avoid confusions for end users. I
> have summarized the issue here with a few different potential solutions:
>
>
> https://docs.google.com/document/d/1uupI7JJHEZIkHufo7sU4Enpwgg-ODCVBE6ocFUVD9oQ/edit?usp=sharing
>
> Please let me know what you think.
>
> Best,
> Jack Ye
>
> On Thu, Aug 26, 2021 at 3:29 PM Phillip Cloud <cp...@gmail.com> wrote:
>
>> On Thu, Aug 26, 2021 at 6:07 PM Jacques Nadeau <ja...@gmail.com>
>> wrote:
>>
>>>
>>> On Thu, Aug 26, 2021 at 2:44 PM Ryan Blue <bl...@tabular.io> wrote:
>>>
>>>> Would a physical plan be portable for the purpose of an engine-agnostic
>>>> view?
>>>>
>>>
>>> My goal is it would be. There may be optional "hints" that a particular
>>> engine could leverage and others wouldn't but I think the goal should be
>>> that the IR is entirely engine-agnostic. Even in the Arrow project proper,
>>> there are really two independent heavy-weight engines that have their own
>>> capabilities and trajectories (c++ vs rust).
>>>
>>>
>>>> Physical plan details seem specific to an engine to me, but maybe I'm
>>>> thinking too much about how Spark is implemented. My inclination would be
>>>> to accept only logical IR, which could just mean accepting a subset of the
>>>> standard.
>>>>
>>>
>>> I think it is very likely that different consumers will only support a
>>> subset of plans. That being said, I'm not sure what you're specifically
>>> trying to mitigate or avoid. I'd be inclined to simply allow the full
>>> breadth of IR within Iceberg. If it is well specified, an engine can either
>>> choose to execute or not (same as the proposal wrt to SQL syntax or if a
>>> function is missing on an engine). The engine may even have internal
>>> rewrites if it likes doing things a different way than what is requested.
>>>
>>
>> I also believe that consumers will not be expected to support all plans.
>> It will depend on the consumer, but many of the instanations of Read/Write
>> relations won't be executable for many consumers, for example.
>>
>>
>>>
>>>
>>>> The document that Micah linked to is interesting, but I'm not sure that
>>>> our goals are aligned.
>>>>
>>>
>>> I think there is much commonality here and I'd argue it would be best to
>>> really try to see if a unified set of goals works well. I think Arrow IR is
>>> young enough that it can still be shaped/adapted. It may be that there
>>> should be some give or take on each side. It's possible that the goals are
>>> too far apart to unify but my gut is that they are close enough that we
>>> should try since it would be a great force multiplier.
>>>
>>>
>>>> For one thing, it seems to make assumptions about the IR being used for
>>>> Arrow data (at least in Wes' proposal), when I think that it may be easier
>>>> to be agnostic to vectorization.
>>>>
>>>
>>> Other than using the Arrow schema/types, I'm not at all convinced that
>>> the IR should be Arrow centric. I've actually argued to some that Arrow IR
>>> should be independent of Arrow to be its best self. Let's try to review it
>>> and see if/where we can avoid a tight coupling between plans and arrow
>>> specific concepts.
>>>
>>
>> Just to echo Jacques's comments here, the only thing that is Arrow
>> specific right now is the use of its type system. Literals, for example,
>> are encoded entirely in flatbuffers.
>>
>> Would love feedback on the current PR [1]. I'm looking to merge the first
>> iteration soonish, so please review at your earliest convenience.
>>
>>
>>>
>>>
>>>> It also delegates forward/backward compatibility to flatbuffers, when I
>>>> think compatibility should be part of the semantics and not delegated to
>>>> serialization. For example, if I have Join("inner", a.id, b.id) and I
>>>> evolve that to allow additional predicates Join("inner", a.id, b.id,
>>>> a.x < b.y) then just because I can deserialize it doesn't mean it is
>>>> compatible.
>>>>
>>>
>>> I don't think that flatbuffers alone can solve all compatibility
>>> problems. It can solve some and I'd expect that implementation libraries
>>> will have to solve others. Would love to hear if others disagree (and think
>>> flatbuffers can solve everything wrt compatibility).
>>>
>>
>> I agree, I think you need both to achieve sane versioning. The version
>> needs to be shipped along with the IR, and libraries need to be able deal
>> with the different versions. I could be wrong, but I think it probably
>> makes more sense to start versioning the IR once the dust has settled a bit.
>>
>>
>>>
>>> J
>>>
>>
>> [1]: https://github.com/apache/arrow/pull/10934
>>
>