You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Jacques Nadeau <ja...@gmail.com> on 2021/09/08 15:20:41 UTC

A new project focused on serialized algebra

Hey all,

For some time I've been thinking that having a common serialized
representation of query plans would be helpful across multiple related
projects. I started working on something independently in this vein several
months ago. Since then, Arrow has started exploring "Arrow IR" and in
Iceberg, Piotr and others were proposing something similar to support a
cross-engine structured view. Given the different veins of interest, I
think we should combine forces on a consolidated consensus-driven solution.

As I've had more conversations with different people, I've come to the
conclusion that given the complexity of the task and people's
competing priorities, a separate "Switzerland" project is the best way to
find common ground. As such, I've started to sketch out a specification [1]
called Substrait. I'd love to collaborate with the Iceberg community to
ensure the specification does a good job of supporting the needs of this
project.

For those that are interested, please join Slack and/or start a discussion
on GitHub. My first goal is to come to consensus on the type system of
simple [2], compound [3] and physical [4] types. The general approach I'm
trying to follow is:

   - Use Spark, Trino, Arrow and Iceberg as the four indicators of whether
   something should be first class. It must exist in at least two systems to
   be formalized.
   - Avoid a formal distinction between logical and physical (types,
   operators, etc)
   - Lean more towards simple types than compound types when systems
   generally use only a constrained set of parameters (e.g. timestamp(3) and
   timestamp(6) as opposed to timestamp(x)).


Links for Substrait:
Site: https://substrait.io
Spec source: https://github.com/substrait-io/substrait/tree/main/site/docs
Binary format: https://github.com/substrait-io/substrait/tree/main/binary

Please let me know your thoughts,
Jacques

[1] https://substrait.io/spec/specification/#components
[2] https://substrait.io/types/simple_logical_types/
[3] https://substrait.io/types/compound_logical_types/
[4] https://substrait.io/types/physical_types/

Re: A new project focused on serialized algebra

Posted by Jacques Nadeau <ja...@gmail.com>.
There are also some good conversations happening in on the Github
discussion forum [1].

We're trying to drive consensus around the type system to start [2]. Would
love the Iceberg community members to weigh in as the contributors are
fairly Arrow heavy atm.

Thanks,
Jacques


[1] https://github.com/substrait-io/substrait/discussions
[2] https://github.com/substrait-io/substrait/discussions/2


On Fri, Sep 10, 2021 at 9:19 AM Ryan Blue <bl...@tabular.io> wrote:

> Nevermind, I see there's a Substrait Slack community. Here's the invite
> link for anyone else that's interested:
> https://join.slack.com/t/substrait/shared_invite/zt-vivbux2c-~B1jEWcR0wYhq5k4LHuoLQ
>
> On Fri, Sep 10, 2021 at 9:16 AM Ryan Blue <bl...@tabular.io> wrote:
>
>> Thanks, Jacques! I think it's a great idea to have this as an external
>> project so that it doesn't get tied to a particular set of goals for an
>> existing project.
>>
>> Where is a good place to discuss this? Should we create a #substrait room
>> on Iceberg Slack? ASF Slack? On this thread?
>>
>> Ryan
>>
>> On Wed, Sep 8, 2021 at 8:21 AM Jacques Nadeau <ja...@gmail.com>
>> wrote:
>>
>>> Hey all,
>>>
>>> For some time I've been thinking that having a common serialized
>>> representation of query plans would be helpful across multiple related
>>> projects. I started working on something independently in this vein several
>>> months ago. Since then, Arrow has started exploring "Arrow IR" and in
>>> Iceberg, Piotr and others were proposing something similar to support a
>>> cross-engine structured view. Given the different veins of interest, I
>>> think we should combine forces on a consolidated consensus-driven solution.
>>>
>>> As I've had more conversations with different people, I've come to the
>>> conclusion that given the complexity of the task and people's
>>> competing priorities, a separate "Switzerland" project is the best way to
>>> find common ground. As such, I've started to sketch out a specification [1]
>>> called Substrait. I'd love to collaborate with the Iceberg community to
>>> ensure the specification does a good job of supporting the needs of this
>>> project.
>>>
>>> For those that are interested, please join Slack and/or start a
>>> discussion on GitHub. My first goal is to come to consensus on the type
>>> system of simple [2], compound [3] and physical [4] types. The general
>>> approach I'm trying to follow is:
>>>
>>>    - Use Spark, Trino, Arrow and Iceberg as the four indicators of
>>>    whether something should be first class. It must exist in at least two
>>>    systems to be formalized.
>>>    - Avoid a formal distinction between logical and physical (types,
>>>    operators, etc)
>>>    - Lean more towards simple types than compound types when systems
>>>    generally use only a constrained set of parameters (e.g. timestamp(3) and
>>>    timestamp(6) as opposed to timestamp(x)).
>>>
>>>
>>> Links for Substrait:
>>> Site: https://substrait.io
>>> Spec source:
>>> https://github.com/substrait-io/substrait/tree/main/site/docs
>>> Binary format:
>>> https://github.com/substrait-io/substrait/tree/main/binary
>>>
>>> Please let me know your thoughts,
>>> Jacques
>>>
>>> [1] https://substrait.io/spec/specification/#components
>>> [2] https://substrait.io/types/simple_logical_types/
>>> [3] https://substrait.io/types/compound_logical_types/
>>> [4] https://substrait.io/types/physical_types/
>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>
>
> --
> Ryan Blue
> Tabular
>

Re: A new project focused on serialized algebra

Posted by Ryan Blue <bl...@tabular.io>.
Nevermind, I see there's a Substrait Slack community. Here's the invite
link for anyone else that's interested:
https://join.slack.com/t/substrait/shared_invite/zt-vivbux2c-~B1jEWcR0wYhq5k4LHuoLQ

On Fri, Sep 10, 2021 at 9:16 AM Ryan Blue <bl...@tabular.io> wrote:

> Thanks, Jacques! I think it's a great idea to have this as an external
> project so that it doesn't get tied to a particular set of goals for an
> existing project.
>
> Where is a good place to discuss this? Should we create a #substrait room
> on Iceberg Slack? ASF Slack? On this thread?
>
> Ryan
>
> On Wed, Sep 8, 2021 at 8:21 AM Jacques Nadeau <ja...@gmail.com>
> wrote:
>
>> Hey all,
>>
>> For some time I've been thinking that having a common serialized
>> representation of query plans would be helpful across multiple related
>> projects. I started working on something independently in this vein several
>> months ago. Since then, Arrow has started exploring "Arrow IR" and in
>> Iceberg, Piotr and others were proposing something similar to support a
>> cross-engine structured view. Given the different veins of interest, I
>> think we should combine forces on a consolidated consensus-driven solution.
>>
>> As I've had more conversations with different people, I've come to the
>> conclusion that given the complexity of the task and people's
>> competing priorities, a separate "Switzerland" project is the best way to
>> find common ground. As such, I've started to sketch out a specification [1]
>> called Substrait. I'd love to collaborate with the Iceberg community to
>> ensure the specification does a good job of supporting the needs of this
>> project.
>>
>> For those that are interested, please join Slack and/or start a
>> discussion on GitHub. My first goal is to come to consensus on the type
>> system of simple [2], compound [3] and physical [4] types. The general
>> approach I'm trying to follow is:
>>
>>    - Use Spark, Trino, Arrow and Iceberg as the four indicators of
>>    whether something should be first class. It must exist in at least two
>>    systems to be formalized.
>>    - Avoid a formal distinction between logical and physical (types,
>>    operators, etc)
>>    - Lean more towards simple types than compound types when systems
>>    generally use only a constrained set of parameters (e.g. timestamp(3) and
>>    timestamp(6) as opposed to timestamp(x)).
>>
>>
>> Links for Substrait:
>> Site: https://substrait.io
>> Spec source:
>> https://github.com/substrait-io/substrait/tree/main/site/docs
>> Binary format: https://github.com/substrait-io/substrait/tree/main/binary
>>
>> Please let me know your thoughts,
>> Jacques
>>
>> [1] https://substrait.io/spec/specification/#components
>> [2] https://substrait.io/types/simple_logical_types/
>> [3] https://substrait.io/types/compound_logical_types/
>> [4] https://substrait.io/types/physical_types/
>>
>>
>
> --
> Ryan Blue
> Tabular
>


-- 
Ryan Blue
Tabular

Re: A new project focused on serialized algebra

Posted by Ryan Blue <bl...@tabular.io>.
Thanks, Jacques! I think it's a great idea to have this as an external
project so that it doesn't get tied to a particular set of goals for an
existing project.

Where is a good place to discuss this? Should we create a #substrait room
on Iceberg Slack? ASF Slack? On this thread?

Ryan

On Wed, Sep 8, 2021 at 8:21 AM Jacques Nadeau <ja...@gmail.com>
wrote:

> Hey all,
>
> For some time I've been thinking that having a common serialized
> representation of query plans would be helpful across multiple related
> projects. I started working on something independently in this vein several
> months ago. Since then, Arrow has started exploring "Arrow IR" and in
> Iceberg, Piotr and others were proposing something similar to support a
> cross-engine structured view. Given the different veins of interest, I
> think we should combine forces on a consolidated consensus-driven solution.
>
> As I've had more conversations with different people, I've come to the
> conclusion that given the complexity of the task and people's
> competing priorities, a separate "Switzerland" project is the best way to
> find common ground. As such, I've started to sketch out a specification [1]
> called Substrait. I'd love to collaborate with the Iceberg community to
> ensure the specification does a good job of supporting the needs of this
> project.
>
> For those that are interested, please join Slack and/or start a discussion
> on GitHub. My first goal is to come to consensus on the type system of
> simple [2], compound [3] and physical [4] types. The general approach I'm
> trying to follow is:
>
>    - Use Spark, Trino, Arrow and Iceberg as the four indicators of
>    whether something should be first class. It must exist in at least two
>    systems to be formalized.
>    - Avoid a formal distinction between logical and physical (types,
>    operators, etc)
>    - Lean more towards simple types than compound types when systems
>    generally use only a constrained set of parameters (e.g. timestamp(3) and
>    timestamp(6) as opposed to timestamp(x)).
>
>
> Links for Substrait:
> Site: https://substrait.io
> Spec source: https://github.com/substrait-io/substrait/tree/main/site/docs
> Binary format: https://github.com/substrait-io/substrait/tree/main/binary
>
> Please let me know your thoughts,
> Jacques
>
> [1] https://substrait.io/spec/specification/#components
> [2] https://substrait.io/types/simple_logical_types/
> [3] https://substrait.io/types/compound_logical_types/
> [4] https://substrait.io/types/physical_types/
>
>

-- 
Ryan Blue
Tabular