You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Andrew Pilloud <ap...@google.com> on 2018/06/11 18:39:04 UTC

Building and visualizing the Beam SQL graph

We are currently converting the Calcite Rel tree to Beam by recursively
building a tree of nested PTransforms. This results in a weird nested graph
in the dataflow UI where each node contains its inputs nested inside of it.
I'm going to change the internal data structure for converting the tree
from a PTransform to a PCollection, which will result in a more accurate
representation of the tree structure being built and should simplify the
code as well. This will not change the public interface to SQL, which will
remain a PTransform. Any thoughts or objections?

I was also wondering if there are tools for visualizing the Beam graph
aside from the dataflow runner UI. What other tools exist?

Andrew

Re: Building and visualizing the Beam SQL graph

Posted by Huygaa Batsaikhan <ba...@google.com>.
I was also wondering the same thing. I don't think there is any
visualization tool for Beam. :(

On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud <ap...@google.com> wrote:

> We are currently converting the Calcite Rel tree to Beam by recursively
> building a tree of nested PTransforms. This results in a weird nested graph
> in the dataflow UI where each node contains its inputs nested inside of it.
> I'm going to change the internal data structure for converting the tree
> from a PTransform to a PCollection, which will result in a more accurate
> representation of the tree structure being built and should simplify the
> code as well. This will not change the public interface to SQL, which will
> remain a PTransform. Any thoughts or objections?
>
> I was also wondering if there are tools for visualizing the Beam graph
> aside from the dataflow runner UI. What other tools exist?
>
> Andrew
>

Re: Building and visualizing the Beam SQL graph

Posted by Andrew Pilloud <ap...@google.com>.
Not quite a revert, we still want to keep the actual transformation inside
a PTransform but the input of that PTransform will be different for each
node type (joins have multiple inputs for example). We have this function
as our builder right now:

PTransform<PCollectionTuple, PCollection<Row>> toPTransform();

When I'm done we will have this:

PCollection<Row> toPCollection(Pipeline pipeline);

Andrew

On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin <ke...@google.com> wrote:

> Not answering the original question, but doesn't "explain" satisfy the SQL
> use case?
>
> Going forward we probably want to solve this in a more general way. We
> have at least 3 ways to represent the pipeline:
>  - how runner executes it;
>  - what it looks like when constructed;
>  - what the user was describing in DSL;
> And there will probably be more, if extra layers are built on top of DSLs.
>
> If possible, we probably should be able to map any level of abstraction to
> any other to better understand and debug the pipelines.
>
>
> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles <kl...@google.com> wrote:
>
>> In other words, revert https://github.com/apache/beam/pull/4705/files,
>> at least in spirit? I agree :-)
>>
>> Kenn
>>
>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud <ap...@google.com>
>> wrote:
>>
>>> We are currently converting the Calcite Rel tree to Beam by recursively
>>> building a tree of nested PTransforms. This results in a weird nested graph
>>> in the dataflow UI where each node contains its inputs nested inside of it.
>>> I'm going to change the internal data structure for converting the tree
>>> from a PTransform to a PCollection, which will result in a more accurate
>>> representation of the tree structure being built and should simplify the
>>> code as well. This will not change the public interface to SQL, which will
>>> remain a PTransform. Any thoughts or objections?
>>>
>>> I was also wondering if there are tools for visualizing the Beam graph
>>> aside from the dataflow runner UI. What other tools exist?
>>>
>>> Andrew
>>>
>>

Re: Building and visualizing the Beam SQL graph

Posted by Anton Kedin <ke...@google.com>.
From the visualization perspective I really loved the interactive runner
demo where it shows the graph:
https://www.youtube.com/watch?v=c5CjA1e3Cqw&t=27s

On Wed, Jun 13, 2018 at 4:36 PM Kenneth Knowles <kl...@google.com> wrote:

> Another thing to consider is that we might return something like a
> "SqlPCollection" that is the PCollection<Row> plus additional metadata that
> is useful to the shell / enumerable converter (such as if the PCollection
> has a known finite size due to LIMIT, even if it is "unbounded", and the
> shell can return control to the user once it receives enough rows). After
> your proposed change this will be much more natural to do, so that's
> another point in favor of the refactor.
>
> Kenn
>
> On Wed, Jun 13, 2018 at 10:22 AM Andrew Pilloud <ap...@google.com>
> wrote:
>
>> One of my goals is to make the graph easier to read and map back to the
>> SQL EXPLAIN output. The way the graph is currently built (`toPTransform` vs
>> `toPCollection`) does make a big difference in that graph. I think it is
>> also important to have a common function to do the apply with consistent
>> naming. I think that will greatly help with ease of understanding. It
>> sounds like what really want is this in the BeamRelNode interface:
>>
>> PInput buildPInput(Pipeline pipeline);
>> PTransform<PInput, PCollection<Row>> buildPTransform();
>>
>> default PCollection<Row> toPCollection(Pipeline pipeline) {
>>     return buildPInput(pipeline).apply(getStageName(), buildPTransform());
>> }
>>
>> Andrew
>>
>> On Mon, Jun 11, 2018 at 2:27 PM Mingmin Xu <mi...@gmail.com> wrote:
>>
>>> EXPLAIN shows the execution plan in SQL perspective only. After
>>> converting to a Beam composite PTransform, there're more steps underneath,
>>> each Runner re-org Beam PTransforms again which makes the final pipeline
>>> hard to read. In SQL module itself, I don't see any difference between
>>> `toPTransform` and `toPCollection`. We could have an easy-to-understand
>>> step name when converting RelNodes, but Runners show the graph to
>>> developers.
>>>
>>> Mingmin
>>>
>>> On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud <ap...@google.com>
>>> wrote:
>>>
>>>> That sounds correct. And because each rel node might have a different
>>>> input there isn't a standard interface (like PTransform<
>>>> PCollection<Row>, PCollection<Row>> toPTransform());
>>>>
>>>> Andrew
>>>>
>>>> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles <kl...@google.com> wrote:
>>>>
>>>>> Agree with that. It will be kind of tricky to generalize. I think
>>>>> there are some criteria in this case that might apply in other cases:
>>>>>
>>>>> 1. Each rel node (or construct of a DSL) should have a PTransform for
>>>>> how it computes its result from its inputs.
>>>>> 2. The inputs to that PTransform should actually be the inputs to the
>>>>> rel node!
>>>>>
>>>>> So I tried to improve #1 but I probably made #2 worse.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin <ke...@google.com> wrote:
>>>>>
>>>>>> Not answering the original question, but doesn't "explain" satisfy
>>>>>> the SQL use case?
>>>>>>
>>>>>> Going forward we probably want to solve this in a more general way.
>>>>>> We have at least 3 ways to represent the pipeline:
>>>>>>  - how runner executes it;
>>>>>>  - what it looks like when constructed;
>>>>>>  - what the user was describing in DSL;
>>>>>> And there will probably be more, if extra layers are built on top of
>>>>>> DSLs.
>>>>>>
>>>>>> If possible, we probably should be able to map any level of
>>>>>> abstraction to any other to better understand and debug the pipelines.
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles <kl...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> In other words, revert
>>>>>>> https://github.com/apache/beam/pull/4705/files, at least in spirit?
>>>>>>> I agree :-)
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud <ap...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> We are currently converting the Calcite Rel tree to Beam by
>>>>>>>> recursively building a tree of nested PTransforms. This results in a weird
>>>>>>>> nested graph in the dataflow UI where each node contains its inputs nested
>>>>>>>> inside of it. I'm going to change the internal data structure for
>>>>>>>> converting the tree from a PTransform to a PCollection, which will result
>>>>>>>> in a more accurate representation of the tree structure being built and
>>>>>>>> should simplify the code as well. This will not change the public interface
>>>>>>>> to SQL, which will remain a PTransform. Any thoughts or objections?
>>>>>>>>
>>>>>>>> I was also wondering if there are tools for visualizing the Beam
>>>>>>>> graph aside from the dataflow runner UI. What other tools exist?
>>>>>>>>
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>
>>>
>>>
>>> --
>>> ----
>>> Mingmin
>>>
>>

Re: Building and visualizing the Beam SQL graph

Posted by Andrew Pilloud <ap...@google.com>.
The name can't just be a constant string as it needs to be unique for every
node in the graph. We currently build the name as the rel node class name
plus the unique ID from the calcite graph (plus a unique ID from the beam
graph, but that goes away with some bug fixes).

I've written this up into a PR. https://github.com/apache/beam/pull/5673

Andrew

On Fri, Jun 15, 2018 at 7:51 AM Kenneth Knowles <kl...@google.com> wrote:

> @Reuven: I think DSLs are better served by having their own wrappers than
> by putting their data into generic attributes. They would need attributes
> if they needed to put them in and have them come back out, but usually the
> DSL has a higher-level view and no need for Beam to propagate data on its
> behalf, in fact it is simpler to do it directly at the DSL level. That is
> the case for SQL and LIMIT.
>
> @Mingmin: Agree. The name on each node is the portable way to describe
> transforms. It should be locally unique and the composite structure makes
> them globally unique. Do all our runners use it to make their UIs pretty? I
> don't know. It would be great to check on that and improve it.
>
> @Andrew: Do we really want getStageName()? Can it just be a constant
> string, with the composite structure giving context?
>
> Kenn
>
> On Thu, Jun 14, 2018 at 11:12 AM Mingmin Xu <mi...@gmail.com> wrote:
>
>> Is there a guideline about how the name provided in `PCollection.apply(
>> String name, PTransform<? super PCollection<Row>, PCollection<Row>> t)`
>> is adopted in different runners? I suppose that should be the option, to
>> have a readable graph for all runners, instead of 'adjust' it to make
>> DataFlow runner works only.
>>
>> On Thu, Jun 14, 2018 at 8:53 AM, Reuven Lax <re...@google.com> wrote:
>>
>>> There was a previous discussion about having generic attributes on
>>> PCollection. Maybe this is a good driving use case?
>>>
>>> On Wed, Jun 13, 2018 at 4:36 PM Kenneth Knowles <kl...@google.com> wrote:
>>>
>>>> Another thing to consider is that we might return something like a
>>>> "SqlPCollection" that is the PCollection<Row> plus additional metadata that
>>>> is useful to the shell / enumerable converter (such as if the PCollection
>>>> has a known finite size due to LIMIT, even if it is "unbounded", and the
>>>> shell can return control to the user once it receives enough rows). After
>>>> your proposed change this will be much more natural to do, so that's
>>>> another point in favor of the refactor.
>>>>
>>>> Kenn
>>>>
>>>> On Wed, Jun 13, 2018 at 10:22 AM Andrew Pilloud <ap...@google.com>
>>>> wrote:
>>>>
>>>>> One of my goals is to make the graph easier to read and map back to
>>>>> the SQL EXPLAIN output. The way the graph is currently built
>>>>> (`toPTransform` vs `toPCollection`) does make a big difference in that
>>>>> graph. I think it is also important to have a common function to do the
>>>>> apply with consistent naming. I think that will greatly help with ease of
>>>>> understanding. It sounds like what really want is this in the BeamRelNode
>>>>> interface:
>>>>>
>>>>> PInput buildPInput(Pipeline pipeline);
>>>>> PTransform<PInput, PCollection<Row>> buildPTransform();
>>>>>
>>>>> default PCollection<Row> toPCollection(Pipeline pipeline) {
>>>>>     return buildPInput(pipeline).apply(getStageName(),
>>>>> buildPTransform());
>>>>> }
>>>>>
>>>>> Andrew
>>>>>
>>>>> On Mon, Jun 11, 2018 at 2:27 PM Mingmin Xu <mi...@gmail.com> wrote:
>>>>>
>>>>>> EXPLAIN shows the execution plan in SQL perspective only. After
>>>>>> converting to a Beam composite PTransform, there're more steps underneath,
>>>>>> each Runner re-org Beam PTransforms again which makes the final pipeline
>>>>>> hard to read. In SQL module itself, I don't see any difference between
>>>>>> `toPTransform` and `toPCollection`. We could have an easy-to-understand
>>>>>> step name when converting RelNodes, but Runners show the graph to
>>>>>> developers.
>>>>>>
>>>>>> Mingmin
>>>>>>
>>>>>> On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud <ap...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> That sounds correct. And because each rel node might have a
>>>>>>> different input there isn't a standard interface (like PTransform<
>>>>>>> PCollection<Row>, PCollection<Row>> toPTransform());
>>>>>>>
>>>>>>> Andrew
>>>>>>>
>>>>>>> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles <kl...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Agree with that. It will be kind of tricky to generalize. I think
>>>>>>>> there are some criteria in this case that might apply in other cases:
>>>>>>>>
>>>>>>>> 1. Each rel node (or construct of a DSL) should have a PTransform
>>>>>>>> for how it computes its result from its inputs.
>>>>>>>> 2. The inputs to that PTransform should actually be the inputs to
>>>>>>>> the rel node!
>>>>>>>>
>>>>>>>> So I tried to improve #1 but I probably made #2 worse.
>>>>>>>>
>>>>>>>> Kenn
>>>>>>>>
>>>>>>>> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin <ke...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Not answering the original question, but doesn't "explain" satisfy
>>>>>>>>> the SQL use case?
>>>>>>>>>
>>>>>>>>> Going forward we probably want to solve this in a more general
>>>>>>>>> way. We have at least 3 ways to represent the pipeline:
>>>>>>>>>  - how runner executes it;
>>>>>>>>>  - what it looks like when constructed;
>>>>>>>>>  - what the user was describing in DSL;
>>>>>>>>> And there will probably be more, if extra layers are built on top
>>>>>>>>> of DSLs.
>>>>>>>>>
>>>>>>>>> If possible, we probably should be able to map any level of
>>>>>>>>> abstraction to any other to better understand and debug the pipelines.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles <kl...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> In other words, revert
>>>>>>>>>> https://github.com/apache/beam/pull/4705/files, at least in
>>>>>>>>>> spirit? I agree :-)
>>>>>>>>>>
>>>>>>>>>> Kenn
>>>>>>>>>>
>>>>>>>>>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud <
>>>>>>>>>> apilloud@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> We are currently converting the Calcite Rel tree to Beam by
>>>>>>>>>>> recursively building a tree of nested PTransforms. This results in a weird
>>>>>>>>>>> nested graph in the dataflow UI where each node contains its inputs nested
>>>>>>>>>>> inside of it. I'm going to change the internal data structure for
>>>>>>>>>>> converting the tree from a PTransform to a PCollection, which will result
>>>>>>>>>>> in a more accurate representation of the tree structure being built and
>>>>>>>>>>> should simplify the code as well. This will not change the public interface
>>>>>>>>>>> to SQL, which will remain a PTransform. Any thoughts or objections?
>>>>>>>>>>>
>>>>>>>>>>> I was also wondering if there are tools for visualizing the Beam
>>>>>>>>>>> graph aside from the dataflow runner UI. What other tools exist?
>>>>>>>>>>>
>>>>>>>>>>> Andrew
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> ----
>>>>>> Mingmin
>>>>>>
>>>>>
>>
>>
>> --
>> ----
>> Mingmin
>>
>

Re: Building and visualizing the Beam SQL graph

Posted by Kenneth Knowles <kl...@google.com>.
@Reuven: I think DSLs are better served by having their own wrappers than
by putting their data into generic attributes. They would need attributes
if they needed to put them in and have them come back out, but usually the
DSL has a higher-level view and no need for Beam to propagate data on its
behalf, in fact it is simpler to do it directly at the DSL level. That is
the case for SQL and LIMIT.

@Mingmin: Agree. The name on each node is the portable way to describe
transforms. It should be locally unique and the composite structure makes
them globally unique. Do all our runners use it to make their UIs pretty? I
don't know. It would be great to check on that and improve it.

@Andrew: Do we really want getStageName()? Can it just be a constant
string, with the composite structure giving context?

Kenn

On Thu, Jun 14, 2018 at 11:12 AM Mingmin Xu <mi...@gmail.com> wrote:

> Is there a guideline about how the name provided in `PCollection.apply(
> String name, PTransform<? super PCollection<Row>, PCollection<Row>> t)`
> is adopted in different runners? I suppose that should be the option, to
> have a readable graph for all runners, instead of 'adjust' it to make
> DataFlow runner works only.
>
> On Thu, Jun 14, 2018 at 8:53 AM, Reuven Lax <re...@google.com> wrote:
>
>> There was a previous discussion about having generic attributes on
>> PCollection. Maybe this is a good driving use case?
>>
>> On Wed, Jun 13, 2018 at 4:36 PM Kenneth Knowles <kl...@google.com> wrote:
>>
>>> Another thing to consider is that we might return something like a
>>> "SqlPCollection" that is the PCollection<Row> plus additional metadata that
>>> is useful to the shell / enumerable converter (such as if the PCollection
>>> has a known finite size due to LIMIT, even if it is "unbounded", and the
>>> shell can return control to the user once it receives enough rows). After
>>> your proposed change this will be much more natural to do, so that's
>>> another point in favor of the refactor.
>>>
>>> Kenn
>>>
>>> On Wed, Jun 13, 2018 at 10:22 AM Andrew Pilloud <ap...@google.com>
>>> wrote:
>>>
>>>> One of my goals is to make the graph easier to read and map back to the
>>>> SQL EXPLAIN output. The way the graph is currently built (`toPTransform` vs
>>>> `toPCollection`) does make a big difference in that graph. I think it is
>>>> also important to have a common function to do the apply with consistent
>>>> naming. I think that will greatly help with ease of understanding. It
>>>> sounds like what really want is this in the BeamRelNode interface:
>>>>
>>>> PInput buildPInput(Pipeline pipeline);
>>>> PTransform<PInput, PCollection<Row>> buildPTransform();
>>>>
>>>> default PCollection<Row> toPCollection(Pipeline pipeline) {
>>>>     return buildPInput(pipeline).apply(getStageName(),
>>>> buildPTransform());
>>>> }
>>>>
>>>> Andrew
>>>>
>>>> On Mon, Jun 11, 2018 at 2:27 PM Mingmin Xu <mi...@gmail.com> wrote:
>>>>
>>>>> EXPLAIN shows the execution plan in SQL perspective only. After
>>>>> converting to a Beam composite PTransform, there're more steps underneath,
>>>>> each Runner re-org Beam PTransforms again which makes the final pipeline
>>>>> hard to read. In SQL module itself, I don't see any difference between
>>>>> `toPTransform` and `toPCollection`. We could have an easy-to-understand
>>>>> step name when converting RelNodes, but Runners show the graph to
>>>>> developers.
>>>>>
>>>>> Mingmin
>>>>>
>>>>> On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud <ap...@google.com>
>>>>> wrote:
>>>>>
>>>>>> That sounds correct. And because each rel node might have a different
>>>>>> input there isn't a standard interface (like PTransform<
>>>>>> PCollection<Row>, PCollection<Row>> toPTransform());
>>>>>>
>>>>>> Andrew
>>>>>>
>>>>>> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles <kl...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Agree with that. It will be kind of tricky to generalize. I think
>>>>>>> there are some criteria in this case that might apply in other cases:
>>>>>>>
>>>>>>> 1. Each rel node (or construct of a DSL) should have a PTransform
>>>>>>> for how it computes its result from its inputs.
>>>>>>> 2. The inputs to that PTransform should actually be the inputs to
>>>>>>> the rel node!
>>>>>>>
>>>>>>> So I tried to improve #1 but I probably made #2 worse.
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin <ke...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Not answering the original question, but doesn't "explain" satisfy
>>>>>>>> the SQL use case?
>>>>>>>>
>>>>>>>> Going forward we probably want to solve this in a more general way.
>>>>>>>> We have at least 3 ways to represent the pipeline:
>>>>>>>>  - how runner executes it;
>>>>>>>>  - what it looks like when constructed;
>>>>>>>>  - what the user was describing in DSL;
>>>>>>>> And there will probably be more, if extra layers are built on top
>>>>>>>> of DSLs.
>>>>>>>>
>>>>>>>> If possible, we probably should be able to map any level of
>>>>>>>> abstraction to any other to better understand and debug the pipelines.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles <kl...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> In other words, revert
>>>>>>>>> https://github.com/apache/beam/pull/4705/files, at least in
>>>>>>>>> spirit? I agree :-)
>>>>>>>>>
>>>>>>>>> Kenn
>>>>>>>>>
>>>>>>>>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud <
>>>>>>>>> apilloud@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> We are currently converting the Calcite Rel tree to Beam by
>>>>>>>>>> recursively building a tree of nested PTransforms. This results in a weird
>>>>>>>>>> nested graph in the dataflow UI where each node contains its inputs nested
>>>>>>>>>> inside of it. I'm going to change the internal data structure for
>>>>>>>>>> converting the tree from a PTransform to a PCollection, which will result
>>>>>>>>>> in a more accurate representation of the tree structure being built and
>>>>>>>>>> should simplify the code as well. This will not change the public interface
>>>>>>>>>> to SQL, which will remain a PTransform. Any thoughts or objections?
>>>>>>>>>>
>>>>>>>>>> I was also wondering if there are tools for visualizing the Beam
>>>>>>>>>> graph aside from the dataflow runner UI. What other tools exist?
>>>>>>>>>>
>>>>>>>>>> Andrew
>>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ----
>>>>> Mingmin
>>>>>
>>>>
>
>
> --
> ----
> Mingmin
>

Re: Building and visualizing the Beam SQL graph

Posted by Mingmin Xu <mi...@gmail.com>.
Is there a guideline about how the name provided in `PCollection.apply(
String name, PTransform<? super PCollection<Row>, PCollection<Row>> t)` is
adopted in different runners? I suppose that should be the option, to have
a readable graph for all runners, instead of 'adjust' it to make DataFlow
runner works only.

On Thu, Jun 14, 2018 at 8:53 AM, Reuven Lax <re...@google.com> wrote:

> There was a previous discussion about having generic attributes on
> PCollection. Maybe this is a good driving use case?
>
> On Wed, Jun 13, 2018 at 4:36 PM Kenneth Knowles <kl...@google.com> wrote:
>
>> Another thing to consider is that we might return something like a
>> "SqlPCollection" that is the PCollection<Row> plus additional metadata that
>> is useful to the shell / enumerable converter (such as if the PCollection
>> has a known finite size due to LIMIT, even if it is "unbounded", and the
>> shell can return control to the user once it receives enough rows). After
>> your proposed change this will be much more natural to do, so that's
>> another point in favor of the refactor.
>>
>> Kenn
>>
>> On Wed, Jun 13, 2018 at 10:22 AM Andrew Pilloud <ap...@google.com>
>> wrote:
>>
>>> One of my goals is to make the graph easier to read and map back to the
>>> SQL EXPLAIN output. The way the graph is currently built (`toPTransform` vs
>>> `toPCollection`) does make a big difference in that graph. I think it is
>>> also important to have a common function to do the apply with consistent
>>> naming. I think that will greatly help with ease of understanding. It
>>> sounds like what really want is this in the BeamRelNode interface:
>>>
>>> PInput buildPInput(Pipeline pipeline);
>>> PTransform<PInput, PCollection<Row>> buildPTransform();
>>>
>>> default PCollection<Row> toPCollection(Pipeline pipeline) {
>>>     return buildPInput(pipeline).apply(getStageName(),
>>> buildPTransform());
>>> }
>>>
>>> Andrew
>>>
>>> On Mon, Jun 11, 2018 at 2:27 PM Mingmin Xu <mi...@gmail.com> wrote:
>>>
>>>> EXPLAIN shows the execution plan in SQL perspective only. After
>>>> converting to a Beam composite PTransform, there're more steps underneath,
>>>> each Runner re-org Beam PTransforms again which makes the final pipeline
>>>> hard to read. In SQL module itself, I don't see any difference between
>>>> `toPTransform` and `toPCollection`. We could have an easy-to-understand
>>>> step name when converting RelNodes, but Runners show the graph to
>>>> developers.
>>>>
>>>> Mingmin
>>>>
>>>> On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud <ap...@google.com>
>>>> wrote:
>>>>
>>>>> That sounds correct. And because each rel node might have a different
>>>>> input there isn't a standard interface (like PTransform<PCollection<
>>>>> Row>, PCollection<Row>> toPTransform());
>>>>>
>>>>> Andrew
>>>>>
>>>>> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles <kl...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Agree with that. It will be kind of tricky to generalize. I think
>>>>>> there are some criteria in this case that might apply in other cases:
>>>>>>
>>>>>> 1. Each rel node (or construct of a DSL) should have a PTransform for
>>>>>> how it computes its result from its inputs.
>>>>>> 2. The inputs to that PTransform should actually be the inputs to the
>>>>>> rel node!
>>>>>>
>>>>>> So I tried to improve #1 but I probably made #2 worse.
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin <ke...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Not answering the original question, but doesn't "explain" satisfy
>>>>>>> the SQL use case?
>>>>>>>
>>>>>>> Going forward we probably want to solve this in a more general way.
>>>>>>> We have at least 3 ways to represent the pipeline:
>>>>>>>  - how runner executes it;
>>>>>>>  - what it looks like when constructed;
>>>>>>>  - what the user was describing in DSL;
>>>>>>> And there will probably be more, if extra layers are built on top of
>>>>>>> DSLs.
>>>>>>>
>>>>>>> If possible, we probably should be able to map any level of
>>>>>>> abstraction to any other to better understand and debug the pipelines.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles <kl...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> In other words, revert https://github.com/
>>>>>>>> apache/beam/pull/4705/files, at least in spirit? I agree :-)
>>>>>>>>
>>>>>>>> Kenn
>>>>>>>>
>>>>>>>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud <
>>>>>>>> apilloud@google.com> wrote:
>>>>>>>>
>>>>>>>>> We are currently converting the Calcite Rel tree to Beam by
>>>>>>>>> recursively building a tree of nested PTransforms. This results in a weird
>>>>>>>>> nested graph in the dataflow UI where each node contains its inputs nested
>>>>>>>>> inside of it. I'm going to change the internal data structure for
>>>>>>>>> converting the tree from a PTransform to a PCollection, which will result
>>>>>>>>> in a more accurate representation of the tree structure being built and
>>>>>>>>> should simplify the code as well. This will not change the public interface
>>>>>>>>> to SQL, which will remain a PTransform. Any thoughts or objections?
>>>>>>>>>
>>>>>>>>> I was also wondering if there are tools for visualizing the Beam
>>>>>>>>> graph aside from the dataflow runner UI. What other tools exist?
>>>>>>>>>
>>>>>>>>> Andrew
>>>>>>>>>
>>>>>>>>
>>>>
>>>>
>>>> --
>>>> ----
>>>> Mingmin
>>>>
>>>


-- 
----
Mingmin

Re: Building and visualizing the Beam SQL graph

Posted by Reuven Lax <re...@google.com>.
There was a previous discussion about having generic attributes on
PCollection. Maybe this is a good driving use case?

On Wed, Jun 13, 2018 at 4:36 PM Kenneth Knowles <kl...@google.com> wrote:

> Another thing to consider is that we might return something like a
> "SqlPCollection" that is the PCollection<Row> plus additional metadata that
> is useful to the shell / enumerable converter (such as if the PCollection
> has a known finite size due to LIMIT, even if it is "unbounded", and the
> shell can return control to the user once it receives enough rows). After
> your proposed change this will be much more natural to do, so that's
> another point in favor of the refactor.
>
> Kenn
>
> On Wed, Jun 13, 2018 at 10:22 AM Andrew Pilloud <ap...@google.com>
> wrote:
>
>> One of my goals is to make the graph easier to read and map back to the
>> SQL EXPLAIN output. The way the graph is currently built (`toPTransform` vs
>> `toPCollection`) does make a big difference in that graph. I think it is
>> also important to have a common function to do the apply with consistent
>> naming. I think that will greatly help with ease of understanding. It
>> sounds like what really want is this in the BeamRelNode interface:
>>
>> PInput buildPInput(Pipeline pipeline);
>> PTransform<PInput, PCollection<Row>> buildPTransform();
>>
>> default PCollection<Row> toPCollection(Pipeline pipeline) {
>>     return buildPInput(pipeline).apply(getStageName(), buildPTransform());
>> }
>>
>> Andrew
>>
>> On Mon, Jun 11, 2018 at 2:27 PM Mingmin Xu <mi...@gmail.com> wrote:
>>
>>> EXPLAIN shows the execution plan in SQL perspective only. After
>>> converting to a Beam composite PTransform, there're more steps underneath,
>>> each Runner re-org Beam PTransforms again which makes the final pipeline
>>> hard to read. In SQL module itself, I don't see any difference between
>>> `toPTransform` and `toPCollection`. We could have an easy-to-understand
>>> step name when converting RelNodes, but Runners show the graph to
>>> developers.
>>>
>>> Mingmin
>>>
>>> On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud <ap...@google.com>
>>> wrote:
>>>
>>>> That sounds correct. And because each rel node might have a different
>>>> input there isn't a standard interface (like PTransform<
>>>> PCollection<Row>, PCollection<Row>> toPTransform());
>>>>
>>>> Andrew
>>>>
>>>> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles <kl...@google.com> wrote:
>>>>
>>>>> Agree with that. It will be kind of tricky to generalize. I think
>>>>> there are some criteria in this case that might apply in other cases:
>>>>>
>>>>> 1. Each rel node (or construct of a DSL) should have a PTransform for
>>>>> how it computes its result from its inputs.
>>>>> 2. The inputs to that PTransform should actually be the inputs to the
>>>>> rel node!
>>>>>
>>>>> So I tried to improve #1 but I probably made #2 worse.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin <ke...@google.com> wrote:
>>>>>
>>>>>> Not answering the original question, but doesn't "explain" satisfy
>>>>>> the SQL use case?
>>>>>>
>>>>>> Going forward we probably want to solve this in a more general way.
>>>>>> We have at least 3 ways to represent the pipeline:
>>>>>>  - how runner executes it;
>>>>>>  - what it looks like when constructed;
>>>>>>  - what the user was describing in DSL;
>>>>>> And there will probably be more, if extra layers are built on top of
>>>>>> DSLs.
>>>>>>
>>>>>> If possible, we probably should be able to map any level of
>>>>>> abstraction to any other to better understand and debug the pipelines.
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles <kl...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> In other words, revert
>>>>>>> https://github.com/apache/beam/pull/4705/files, at least in spirit?
>>>>>>> I agree :-)
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud <ap...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> We are currently converting the Calcite Rel tree to Beam by
>>>>>>>> recursively building a tree of nested PTransforms. This results in a weird
>>>>>>>> nested graph in the dataflow UI where each node contains its inputs nested
>>>>>>>> inside of it. I'm going to change the internal data structure for
>>>>>>>> converting the tree from a PTransform to a PCollection, which will result
>>>>>>>> in a more accurate representation of the tree structure being built and
>>>>>>>> should simplify the code as well. This will not change the public interface
>>>>>>>> to SQL, which will remain a PTransform. Any thoughts or objections?
>>>>>>>>
>>>>>>>> I was also wondering if there are tools for visualizing the Beam
>>>>>>>> graph aside from the dataflow runner UI. What other tools exist?
>>>>>>>>
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>
>>>
>>>
>>> --
>>> ----
>>> Mingmin
>>>
>>

Re: Building and visualizing the Beam SQL graph

Posted by Kenneth Knowles <kl...@google.com>.
Another thing to consider is that we might return something like a
"SqlPCollection" that is the PCollection<Row> plus additional metadata that
is useful to the shell / enumerable converter (such as if the PCollection
has a known finite size due to LIMIT, even if it is "unbounded", and the
shell can return control to the user once it receives enough rows). After
your proposed change this will be much more natural to do, so that's
another point in favor of the refactor.

Kenn

On Wed, Jun 13, 2018 at 10:22 AM Andrew Pilloud <ap...@google.com> wrote:

> One of my goals is to make the graph easier to read and map back to the
> SQL EXPLAIN output. The way the graph is currently built (`toPTransform` vs
> `toPCollection`) does make a big difference in that graph. I think it is
> also important to have a common function to do the apply with consistent
> naming. I think that will greatly help with ease of understanding. It
> sounds like what really want is this in the BeamRelNode interface:
>
> PInput buildPInput(Pipeline pipeline);
> PTransform<PInput, PCollection<Row>> buildPTransform();
>
> default PCollection<Row> toPCollection(Pipeline pipeline) {
>     return buildPInput(pipeline).apply(getStageName(), buildPTransform());
> }
>
> Andrew
>
> On Mon, Jun 11, 2018 at 2:27 PM Mingmin Xu <mi...@gmail.com> wrote:
>
>> EXPLAIN shows the execution plan in SQL perspective only. After
>> converting to a Beam composite PTransform, there're more steps underneath,
>> each Runner re-org Beam PTransforms again which makes the final pipeline
>> hard to read. In SQL module itself, I don't see any difference between
>> `toPTransform` and `toPCollection`. We could have an easy-to-understand
>> step name when converting RelNodes, but Runners show the graph to
>> developers.
>>
>> Mingmin
>>
>> On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud <ap...@google.com>
>> wrote:
>>
>>> That sounds correct. And because each rel node might have a different
>>> input there isn't a standard interface (like PTransform<PCollection<Row>,
>>> PCollection<Row>> toPTransform());
>>>
>>> Andrew
>>>
>>> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles <kl...@google.com> wrote:
>>>
>>>> Agree with that. It will be kind of tricky to generalize. I think there
>>>> are some criteria in this case that might apply in other cases:
>>>>
>>>> 1. Each rel node (or construct of a DSL) should have a PTransform for
>>>> how it computes its result from its inputs.
>>>> 2. The inputs to that PTransform should actually be the inputs to the
>>>> rel node!
>>>>
>>>> So I tried to improve #1 but I probably made #2 worse.
>>>>
>>>> Kenn
>>>>
>>>> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin <ke...@google.com> wrote:
>>>>
>>>>> Not answering the original question, but doesn't "explain" satisfy the
>>>>> SQL use case?
>>>>>
>>>>> Going forward we probably want to solve this in a more general way. We
>>>>> have at least 3 ways to represent the pipeline:
>>>>>  - how runner executes it;
>>>>>  - what it looks like when constructed;
>>>>>  - what the user was describing in DSL;
>>>>> And there will probably be more, if extra layers are built on top of
>>>>> DSLs.
>>>>>
>>>>> If possible, we probably should be able to map any level of
>>>>> abstraction to any other to better understand and debug the pipelines.
>>>>>
>>>>>
>>>>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles <kl...@google.com>
>>>>> wrote:
>>>>>
>>>>>> In other words, revert https://github.com/apache/beam/pull/4705/files,
>>>>>> at least in spirit? I agree :-)
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud <ap...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> We are currently converting the Calcite Rel tree to Beam by
>>>>>>> recursively building a tree of nested PTransforms. This results in a weird
>>>>>>> nested graph in the dataflow UI where each node contains its inputs nested
>>>>>>> inside of it. I'm going to change the internal data structure for
>>>>>>> converting the tree from a PTransform to a PCollection, which will result
>>>>>>> in a more accurate representation of the tree structure being built and
>>>>>>> should simplify the code as well. This will not change the public interface
>>>>>>> to SQL, which will remain a PTransform. Any thoughts or objections?
>>>>>>>
>>>>>>> I was also wondering if there are tools for visualizing the Beam
>>>>>>> graph aside from the dataflow runner UI. What other tools exist?
>>>>>>>
>>>>>>> Andrew
>>>>>>>
>>>>>>
>>
>>
>> --
>> ----
>> Mingmin
>>
>

Re: Building and visualizing the Beam SQL graph

Posted by Andrew Pilloud <ap...@google.com>.
One of my goals is to make the graph easier to read and map back to the SQL
EXPLAIN output. The way the graph is currently built (`toPTransform` vs
`toPCollection`) does make a big difference in that graph. I think it is
also important to have a common function to do the apply with consistent
naming. I think that will greatly help with ease of understanding. It
sounds like what really want is this in the BeamRelNode interface:

PInput buildPInput(Pipeline pipeline);
PTransform<PInput, PCollection<Row>> buildPTransform();

default PCollection<Row> toPCollection(Pipeline pipeline) {
    return buildPInput(pipeline).apply(getStageName(), buildPTransform());
}

Andrew

On Mon, Jun 11, 2018 at 2:27 PM Mingmin Xu <mi...@gmail.com> wrote:

> EXPLAIN shows the execution plan in SQL perspective only. After converting
> to a Beam composite PTransform, there're more steps underneath, each Runner
> re-org Beam PTransforms again which makes the final pipeline hard to read.
> In SQL module itself, I don't see any difference between `toPTransform` and
> `toPCollection`. We could have an easy-to-understand step name when
> converting RelNodes, but Runners show the graph to developers.
>
> Mingmin
>
> On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud <ap...@google.com>
> wrote:
>
>> That sounds correct. And because each rel node might have a different
>> input there isn't a standard interface (like PTransform<PCollection<Row>,
>> PCollection<Row>> toPTransform());
>>
>> Andrew
>>
>> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles <kl...@google.com> wrote:
>>
>>> Agree with that. It will be kind of tricky to generalize. I think there
>>> are some criteria in this case that might apply in other cases:
>>>
>>> 1. Each rel node (or construct of a DSL) should have a PTransform for
>>> how it computes its result from its inputs.
>>> 2. The inputs to that PTransform should actually be the inputs to the
>>> rel node!
>>>
>>> So I tried to improve #1 but I probably made #2 worse.
>>>
>>> Kenn
>>>
>>> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin <ke...@google.com> wrote:
>>>
>>>> Not answering the original question, but doesn't "explain" satisfy the
>>>> SQL use case?
>>>>
>>>> Going forward we probably want to solve this in a more general way. We
>>>> have at least 3 ways to represent the pipeline:
>>>>  - how runner executes it;
>>>>  - what it looks like when constructed;
>>>>  - what the user was describing in DSL;
>>>> And there will probably be more, if extra layers are built on top of
>>>> DSLs.
>>>>
>>>> If possible, we probably should be able to map any level of abstraction
>>>> to any other to better understand and debug the pipelines.
>>>>
>>>>
>>>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles <kl...@google.com>
>>>> wrote:
>>>>
>>>>> In other words, revert https://github.com/apache/beam/pull/4705/files,
>>>>> at least in spirit? I agree :-)
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud <ap...@google.com>
>>>>> wrote:
>>>>>
>>>>>> We are currently converting the Calcite Rel tree to Beam by
>>>>>> recursively building a tree of nested PTransforms. This results in a weird
>>>>>> nested graph in the dataflow UI where each node contains its inputs nested
>>>>>> inside of it. I'm going to change the internal data structure for
>>>>>> converting the tree from a PTransform to a PCollection, which will result
>>>>>> in a more accurate representation of the tree structure being built and
>>>>>> should simplify the code as well. This will not change the public interface
>>>>>> to SQL, which will remain a PTransform. Any thoughts or objections?
>>>>>>
>>>>>> I was also wondering if there are tools for visualizing the Beam
>>>>>> graph aside from the dataflow runner UI. What other tools exist?
>>>>>>
>>>>>> Andrew
>>>>>>
>>>>>
>
>
> --
> ----
> Mingmin
>

Re: Building and visualizing the Beam SQL graph

Posted by Mingmin Xu <mi...@gmail.com>.
EXPLAIN shows the execution plan in SQL perspective only. After converting
to a Beam composite PTransform, there're more steps underneath, each Runner
re-org Beam PTransforms again which makes the final pipeline hard to read.
In SQL module itself, I don't see any difference between `toPTransform` and
`toPCollection`. We could have an easy-to-understand step name when
converting RelNodes, but Runners show the graph to developers.

Mingmin

On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud <ap...@google.com> wrote:

> That sounds correct. And because each rel node might have a different
> input there isn't a standard interface (like PTransform<PCollection<Row>,
> PCollection<Row>> toPTransform());
>
> Andrew
>
> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles <kl...@google.com> wrote:
>
>> Agree with that. It will be kind of tricky to generalize. I think there
>> are some criteria in this case that might apply in other cases:
>>
>> 1. Each rel node (or construct of a DSL) should have a PTransform for how
>> it computes its result from its inputs.
>> 2. The inputs to that PTransform should actually be the inputs to the rel
>> node!
>>
>> So I tried to improve #1 but I probably made #2 worse.
>>
>> Kenn
>>
>> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin <ke...@google.com> wrote:
>>
>>> Not answering the original question, but doesn't "explain" satisfy the
>>> SQL use case?
>>>
>>> Going forward we probably want to solve this in a more general way. We
>>> have at least 3 ways to represent the pipeline:
>>>  - how runner executes it;
>>>  - what it looks like when constructed;
>>>  - what the user was describing in DSL;
>>> And there will probably be more, if extra layers are built on top of
>>> DSLs.
>>>
>>> If possible, we probably should be able to map any level of abstraction
>>> to any other to better understand and debug the pipelines.
>>>
>>>
>>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles <kl...@google.com> wrote:
>>>
>>>> In other words, revert https://github.com/apache/beam/pull/4705/files,
>>>> at least in spirit? I agree :-)
>>>>
>>>> Kenn
>>>>
>>>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud <ap...@google.com>
>>>> wrote:
>>>>
>>>>> We are currently converting the Calcite Rel tree to Beam by
>>>>> recursively building a tree of nested PTransforms. This results in a weird
>>>>> nested graph in the dataflow UI where each node contains its inputs nested
>>>>> inside of it. I'm going to change the internal data structure for
>>>>> converting the tree from a PTransform to a PCollection, which will result
>>>>> in a more accurate representation of the tree structure being built and
>>>>> should simplify the code as well. This will not change the public interface
>>>>> to SQL, which will remain a PTransform. Any thoughts or objections?
>>>>>
>>>>> I was also wondering if there are tools for visualizing the Beam graph
>>>>> aside from the dataflow runner UI. What other tools exist?
>>>>>
>>>>> Andrew
>>>>>
>>>>


-- 
----
Mingmin

Re: Building and visualizing the Beam SQL graph

Posted by Andrew Pilloud <ap...@google.com>.
That sounds correct. And because each rel node might have a different input
there isn't a standard interface (like PTransform<PCollection<Row>,
PCollection<Row>> toPTransform());

Andrew

On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles <kl...@google.com> wrote:

> Agree with that. It will be kind of tricky to generalize. I think there
> are some criteria in this case that might apply in other cases:
>
> 1. Each rel node (or construct of a DSL) should have a PTransform for how
> it computes its result from its inputs.
> 2. The inputs to that PTransform should actually be the inputs to the rel
> node!
>
> So I tried to improve #1 but I probably made #2 worse.
>
> Kenn
>
> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin <ke...@google.com> wrote:
>
>> Not answering the original question, but doesn't "explain" satisfy the
>> SQL use case?
>>
>> Going forward we probably want to solve this in a more general way. We
>> have at least 3 ways to represent the pipeline:
>>  - how runner executes it;
>>  - what it looks like when constructed;
>>  - what the user was describing in DSL;
>> And there will probably be more, if extra layers are built on top of DSLs.
>>
>> If possible, we probably should be able to map any level of abstraction
>> to any other to better understand and debug the pipelines.
>>
>>
>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles <kl...@google.com> wrote:
>>
>>> In other words, revert https://github.com/apache/beam/pull/4705/files,
>>> at least in spirit? I agree :-)
>>>
>>> Kenn
>>>
>>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud <ap...@google.com>
>>> wrote:
>>>
>>>> We are currently converting the Calcite Rel tree to Beam by recursively
>>>> building a tree of nested PTransforms. This results in a weird nested graph
>>>> in the dataflow UI where each node contains its inputs nested inside of it.
>>>> I'm going to change the internal data structure for converting the tree
>>>> from a PTransform to a PCollection, which will result in a more accurate
>>>> representation of the tree structure being built and should simplify the
>>>> code as well. This will not change the public interface to SQL, which will
>>>> remain a PTransform. Any thoughts or objections?
>>>>
>>>> I was also wondering if there are tools for visualizing the Beam graph
>>>> aside from the dataflow runner UI. What other tools exist?
>>>>
>>>> Andrew
>>>>
>>>

Re: Building and visualizing the Beam SQL graph

Posted by Kenneth Knowles <kl...@google.com>.
Agree with that. It will be kind of tricky to generalize. I think there are
some criteria in this case that might apply in other cases:

1. Each rel node (or construct of a DSL) should have a PTransform for how
it computes its result from its inputs.
2. The inputs to that PTransform should actually be the inputs to the rel
node!

So I tried to improve #1 but I probably made #2 worse.

Kenn

On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin <ke...@google.com> wrote:

> Not answering the original question, but doesn't "explain" satisfy the SQL
> use case?
>
> Going forward we probably want to solve this in a more general way. We
> have at least 3 ways to represent the pipeline:
>  - how runner executes it;
>  - what it looks like when constructed;
>  - what the user was describing in DSL;
> And there will probably be more, if extra layers are built on top of DSLs.
>
> If possible, we probably should be able to map any level of abstraction to
> any other to better understand and debug the pipelines.
>
>
> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles <kl...@google.com> wrote:
>
>> In other words, revert https://github.com/apache/beam/pull/4705/files,
>> at least in spirit? I agree :-)
>>
>> Kenn
>>
>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud <ap...@google.com>
>> wrote:
>>
>>> We are currently converting the Calcite Rel tree to Beam by recursively
>>> building a tree of nested PTransforms. This results in a weird nested graph
>>> in the dataflow UI where each node contains its inputs nested inside of it.
>>> I'm going to change the internal data structure for converting the tree
>>> from a PTransform to a PCollection, which will result in a more accurate
>>> representation of the tree structure being built and should simplify the
>>> code as well. This will not change the public interface to SQL, which will
>>> remain a PTransform. Any thoughts or objections?
>>>
>>> I was also wondering if there are tools for visualizing the Beam graph
>>> aside from the dataflow runner UI. What other tools exist?
>>>
>>> Andrew
>>>
>>

Re: Building and visualizing the Beam SQL graph

Posted by Anton Kedin <ke...@google.com>.
Not answering the original question, but doesn't "explain" satisfy the SQL
use case?

Going forward we probably want to solve this in a more general way. We have
at least 3 ways to represent the pipeline:
 - how runner executes it;
 - what it looks like when constructed;
 - what the user was describing in DSL;
And there will probably be more, if extra layers are built on top of DSLs.

If possible, we probably should be able to map any level of abstraction to
any other to better understand and debug the pipelines.


On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles <kl...@google.com> wrote:

> In other words, revert https://github.com/apache/beam/pull/4705/files, at
> least in spirit? I agree :-)
>
> Kenn
>
> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud <ap...@google.com>
> wrote:
>
>> We are currently converting the Calcite Rel tree to Beam by recursively
>> building a tree of nested PTransforms. This results in a weird nested graph
>> in the dataflow UI where each node contains its inputs nested inside of it.
>> I'm going to change the internal data structure for converting the tree
>> from a PTransform to a PCollection, which will result in a more accurate
>> representation of the tree structure being built and should simplify the
>> code as well. This will not change the public interface to SQL, which will
>> remain a PTransform. Any thoughts or objections?
>>
>> I was also wondering if there are tools for visualizing the Beam graph
>> aside from the dataflow runner UI. What other tools exist?
>>
>> Andrew
>>
>

Re: Building and visualizing the Beam SQL graph

Posted by Kenneth Knowles <kl...@google.com>.
In other words, revert https://github.com/apache/beam/pull/4705/files, at
least in spirit? I agree :-)

Kenn

On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud <ap...@google.com> wrote:

> We are currently converting the Calcite Rel tree to Beam by recursively
> building a tree of nested PTransforms. This results in a weird nested graph
> in the dataflow UI where each node contains its inputs nested inside of it.
> I'm going to change the internal data structure for converting the tree
> from a PTransform to a PCollection, which will result in a more accurate
> representation of the tree structure being built and should simplify the
> code as well. This will not change the public interface to SQL, which will
> remain a PTransform. Any thoughts or objections?
>
> I was also wondering if there are tools for visualizing the Beam graph
> aside from the dataflow runner UI. What other tools exist?
>
> Andrew
>