You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Igor Durovic <id...@gmail.com> on 2019/10/21 22:26:58 UTC

Interactive Beam Example Failing [BEAM-8451]

Hi everyone,

The interactive beam example using the DirectRunner fails after execution
of the last cell. The recursion limit is exceeded during the calculation of
the cache label because of a circular reference in the PipelineInfo object.

The constructor
<https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py#L375>
for
the PipelineInfo class creates a mapping from each pcollection to the
transforms that produce and consume it. The issue arises when there exists
a transform that is both a producer and a consumer for the same
pcollection. This occurs when a transform's expand method returns the same
pcoll object that's passed into it. The specific transform causing the
failure of the example is MaybeReshuffle
<https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py#L2464>,
which
is used in the Create transform. Replacing "return pcoll" with "return
pcoll | Map(lambda x: x)" seems to fix the problem.

A workaround for this issue on the interactive beam side would be fairly
simple, but it seems to me that there should be more validation of
pipelines to prevent the use of transforms that return the same pcoll
that's passed in, or at least a mention of this in the transform style
guide. My understanding is that pcollections are produced by a single
transform (they even have a field called "producer" that references only
one transform). If that's the case then that property of pcollections
should be enforced.

I made ticket BEAM-8451 to track this issue.

I'm still new to beam so I apologize if I'm fundamentally misunderstanding
something. I'm not exactly sure what the next step should be and would
appreciate some recommendations. I can submit a PR to solve the immediate
problem of the failing example but the underlying problem should also be
addressed at some point. I also apologize if people are already aware of
this problem.

Thank You!
Igor Durovic

Re: [EXT] Re: Interactive Beam Example Failing [BEAM-8451]

Posted by Chuck Yang <ch...@getcruise.com>.
IMO returning the input PCollection in a PTransform should be a valid
albeit trivial use case. I have put a suggested fix for supporting
these kinds of transforms in the interactive runner here:
https://github.com/apache/beam/pull/9865 . I'm also new to beam dev so
if there's something I'm missing please let me know :-)

On Mon, Oct 21, 2019 at 4:46 PM Robert Bradshaw <ro...@google.com> wrote:
>
> Thanks for trying this out. Yes, this is definitely something that
> should be supported (and tested).
>
> On Mon, Oct 21, 2019 at 3:40 PM Igor Durovic <id...@gmail.com> wrote:
> >
> > Hi everyone,
> >
> > The interactive beam example using the DirectRunner fails after execution of the last cell. The recursion limit is exceeded during the calculation of the cache label because of a circular reference in the PipelineInfo object.
> >
> > The constructor for the PipelineInfo class creates a mapping from each pcollection to the transforms that produce and consume it. The issue arises when there exists a transform that is both a producer and a consumer for the same pcollection. This occurs when a transform's expand method returns the same pcoll object that's passed into it. The specific transform causing the failure of the example is MaybeReshuffle, which is used in the Create transform. Replacing "return pcoll" with "return pcoll | Map(lambda x: x)" seems to fix the problem.
> >
> > A workaround for this issue on the interactive beam side would be fairly simple, but it seems to me that there should be more validation of pipelines to prevent the use of transforms that return the same pcoll that's passed in, or at least a mention of this in the transform style guide. My understanding is that pcollections are produced by a single transform (they even have a field called "producer" that references only one transform). If that's the case then that property of pcollections should be enforced.
> >
> > I made ticket BEAM-8451 to track this issue.
> >
> > I'm still new to beam so I apologize if I'm fundamentally misunderstanding something. I'm not exactly sure what the next step should be and would appreciate some recommendations. I can submit a PR to solve the immediate problem of the failing example but the underlying problem should also be addressed at some point. I also apologize if people are already aware of this problem.
> >
> > Thank You!
> > Igor Durovic

-- 


*Confidentiality Note:* We care about protecting our proprietary 
information, confidential material, and trade secrets. This message may 
contain some or all of those things. Cruise will suffer material harm if 
anyone other than the intended recipient disseminates or takes any action 
based on this message. If you have received this message (including any 
attachments) in error, please delete it immediately and notify the sender 
promptly.

Re: Interactive Beam Example Failing [BEAM-8451]

Posted by David Yan <da...@google.com>.
+Sam Rohde <sr...@google.com> is working on streaming support for
Interactive Beam.

The high level idea is to capture a bounded segment of the unbounded data
source for replayablity and determinism, and to use TestStream, which has
the ability to control the clock of the pipeline, to replay the data, so
streaming semantics that depend on processing time (e.g. processing time
triggers) will be deterministic.

David

On Thu, Oct 24, 2019 at 4:39 PM Robert Bradshaw <ro...@google.com> wrote:

> Yes, there are plans to support streaming for interactive beam. David
> Yan (cc'd) is leading this effort.
>
> On Thu, Oct 24, 2019 at 1:50 PM Harsh Vardhan <an...@google.com> wrote:
> >
> > Thanks, +1 to adding support for streaming on Interactive Beam (+David
> Yan)
> >
> >
> > On Thu, Oct 24, 2019 at 1:45 PM Hai Lu <lh...@apache.org> wrote:
> >>
> >> Hi Robert,
> >>
> >> We're trying out iBeam at LinkedIn for Python. As Igor mentioned, there
> seems to be some inconsistency in the behavior of interactive beam. We can
> suggest some fixes from our end but we would need some support from the
> community.
> >>
> >> Also, is there a plan to support iBeam for streaming mode? We're
> interested in that use case as well.
> >>
> >> Thanks,
> >> Hai
> >>
> >> On Mon, Oct 21, 2019 at 4:45 PM Robert Bradshaw <ro...@google.com>
> wrote:
> >>>
> >>> Thanks for trying this out. Yes, this is definitely something that
> >>> should be supported (and tested).
> >>>
> >>> On Mon, Oct 21, 2019 at 3:40 PM Igor Durovic <id...@gmail.com>
> wrote:
> >>> >
> >>> > Hi everyone,
> >>> >
> >>> > The interactive beam example using the DirectRunner fails after
> execution of the last cell. The recursion limit is exceeded during the
> calculation of the cache label because of a circular reference in the
> PipelineInfo object.
> >>> >
> >>> > The constructor for the PipelineInfo class creates a mapping from
> each pcollection to the transforms that produce and consume it. The issue
> arises when there exists a transform that is both a producer and a consumer
> for the same pcollection. This occurs when a transform's expand method
> returns the same pcoll object that's passed into it. The specific transform
> causing the failure of the example is MaybeReshuffle, which is used in the
> Create transform. Replacing "return pcoll" with "return pcoll | Map(lambda
> x: x)" seems to fix the problem.
> >>> >
> >>> > A workaround for this issue on the interactive beam side would be
> fairly simple, but it seems to me that there should be more validation of
> pipelines to prevent the use of transforms that return the same pcoll
> that's passed in, or at least a mention of this in the transform style
> guide. My understanding is that pcollections are produced by a single
> transform (they even have a field called "producer" that references only
> one transform). If that's the case then that property of pcollections
> should be enforced.
> >>> >
> >>> > I made ticket BEAM-8451 to track this issue.
> >>> >
> >>> > I'm still new to beam so I apologize if I'm fundamentally
> misunderstanding something. I'm not exactly sure what the next step should
> be and would appreciate some recommendations. I can submit a PR to solve
> the immediate problem of the failing example but the underlying problem
> should also be addressed at some point. I also apologize if people are
> already aware of this problem.
> >>> >
> >>> > Thank You!
> >>> > Igor Durovic
>

Re: Interactive Beam Example Failing [BEAM-8451]

Posted by Robert Bradshaw <ro...@google.com>.
Yes, there are plans to support streaming for interactive beam. David
Yan (cc'd) is leading this effort.

On Thu, Oct 24, 2019 at 1:50 PM Harsh Vardhan <an...@google.com> wrote:
>
> Thanks, +1 to adding support for streaming on Interactive Beam (+David Yan)
>
>
> On Thu, Oct 24, 2019 at 1:45 PM Hai Lu <lh...@apache.org> wrote:
>>
>> Hi Robert,
>>
>> We're trying out iBeam at LinkedIn for Python. As Igor mentioned, there seems to be some inconsistency in the behavior of interactive beam. We can suggest some fixes from our end but we would need some support from the community.
>>
>> Also, is there a plan to support iBeam for streaming mode? We're interested in that use case as well.
>>
>> Thanks,
>> Hai
>>
>> On Mon, Oct 21, 2019 at 4:45 PM Robert Bradshaw <ro...@google.com> wrote:
>>>
>>> Thanks for trying this out. Yes, this is definitely something that
>>> should be supported (and tested).
>>>
>>> On Mon, Oct 21, 2019 at 3:40 PM Igor Durovic <id...@gmail.com> wrote:
>>> >
>>> > Hi everyone,
>>> >
>>> > The interactive beam example using the DirectRunner fails after execution of the last cell. The recursion limit is exceeded during the calculation of the cache label because of a circular reference in the PipelineInfo object.
>>> >
>>> > The constructor for the PipelineInfo class creates a mapping from each pcollection to the transforms that produce and consume it. The issue arises when there exists a transform that is both a producer and a consumer for the same pcollection. This occurs when a transform's expand method returns the same pcoll object that's passed into it. The specific transform causing the failure of the example is MaybeReshuffle, which is used in the Create transform. Replacing "return pcoll" with "return pcoll | Map(lambda x: x)" seems to fix the problem.
>>> >
>>> > A workaround for this issue on the interactive beam side would be fairly simple, but it seems to me that there should be more validation of pipelines to prevent the use of transforms that return the same pcoll that's passed in, or at least a mention of this in the transform style guide. My understanding is that pcollections are produced by a single transform (they even have a field called "producer" that references only one transform). If that's the case then that property of pcollections should be enforced.
>>> >
>>> > I made ticket BEAM-8451 to track this issue.
>>> >
>>> > I'm still new to beam so I apologize if I'm fundamentally misunderstanding something. I'm not exactly sure what the next step should be and would appreciate some recommendations. I can submit a PR to solve the immediate problem of the failing example but the underlying problem should also be addressed at some point. I also apologize if people are already aware of this problem.
>>> >
>>> > Thank You!
>>> > Igor Durovic

Re: Interactive Beam Example Failing [BEAM-8451]

Posted by Harsh Vardhan <an...@google.com>.
Thanks, +1 to adding support for streaming on Interactive Beam (+David Yan
<da...@google.com>)


On Thu, Oct 24, 2019 at 1:45 PM Hai Lu <lh...@apache.org> wrote:

> Hi Robert,
>
> We're trying out iBeam at LinkedIn for Python. As Igor mentioned, there
> seems to be some inconsistency in the behavior of interactive beam. We can
> suggest some fixes from our end but we would need some support from the
> community.
>
> Also, is there a plan to support iBeam for streaming mode? We're
> interested in that use case as well.
>
> Thanks,
> Hai
>
> On Mon, Oct 21, 2019 at 4:45 PM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> Thanks for trying this out. Yes, this is definitely something that
>> should be supported (and tested).
>>
>> On Mon, Oct 21, 2019 at 3:40 PM Igor Durovic <id...@gmail.com> wrote:
>> >
>> > Hi everyone,
>> >
>> > The interactive beam example using the DirectRunner fails after
>> execution of the last cell. The recursion limit is exceeded during the
>> calculation of the cache label because of a circular reference in the
>> PipelineInfo object.
>> >
>> > The constructor for the PipelineInfo class creates a mapping from each
>> pcollection to the transforms that produce and consume it. The issue arises
>> when there exists a transform that is both a producer and a consumer for
>> the same pcollection. This occurs when a transform's expand method returns
>> the same pcoll object that's passed into it. The specific transform causing
>> the failure of the example is MaybeReshuffle, which is used in the Create
>> transform. Replacing "return pcoll" with "return pcoll | Map(lambda x: x)"
>> seems to fix the problem.
>> >
>> > A workaround for this issue on the interactive beam side would be
>> fairly simple, but it seems to me that there should be more validation of
>> pipelines to prevent the use of transforms that return the same pcoll
>> that's passed in, or at least a mention of this in the transform style
>> guide. My understanding is that pcollections are produced by a single
>> transform (they even have a field called "producer" that references only
>> one transform). If that's the case then that property of pcollections
>> should be enforced.
>> >
>> > I made ticket BEAM-8451 to track this issue.
>> >
>> > I'm still new to beam so I apologize if I'm fundamentally
>> misunderstanding something. I'm not exactly sure what the next step should
>> be and would appreciate some recommendations. I can submit a PR to solve
>> the immediate problem of the failing example but the underlying problem
>> should also be addressed at some point. I also apologize if people are
>> already aware of this problem.
>> >
>> > Thank You!
>> > Igor Durovic
>>
>

Re: Interactive Beam Example Failing [BEAM-8451]

Posted by Hai Lu <lh...@apache.org>.
Hi Robert,

We're trying out iBeam at LinkedIn for Python. As Igor mentioned, there
seems to be some inconsistency in the behavior of interactive beam. We can
suggest some fixes from our end but we would need some support from the
community.

Also, is there a plan to support iBeam for streaming mode? We're interested
in that use case as well.

Thanks,
Hai

On Mon, Oct 21, 2019 at 4:45 PM Robert Bradshaw <ro...@google.com> wrote:

> Thanks for trying this out. Yes, this is definitely something that
> should be supported (and tested).
>
> On Mon, Oct 21, 2019 at 3:40 PM Igor Durovic <id...@gmail.com> wrote:
> >
> > Hi everyone,
> >
> > The interactive beam example using the DirectRunner fails after
> execution of the last cell. The recursion limit is exceeded during the
> calculation of the cache label because of a circular reference in the
> PipelineInfo object.
> >
> > The constructor for the PipelineInfo class creates a mapping from each
> pcollection to the transforms that produce and consume it. The issue arises
> when there exists a transform that is both a producer and a consumer for
> the same pcollection. This occurs when a transform's expand method returns
> the same pcoll object that's passed into it. The specific transform causing
> the failure of the example is MaybeReshuffle, which is used in the Create
> transform. Replacing "return pcoll" with "return pcoll | Map(lambda x: x)"
> seems to fix the problem.
> >
> > A workaround for this issue on the interactive beam side would be fairly
> simple, but it seems to me that there should be more validation of
> pipelines to prevent the use of transforms that return the same pcoll
> that's passed in, or at least a mention of this in the transform style
> guide. My understanding is that pcollections are produced by a single
> transform (they even have a field called "producer" that references only
> one transform). If that's the case then that property of pcollections
> should be enforced.
> >
> > I made ticket BEAM-8451 to track this issue.
> >
> > I'm still new to beam so I apologize if I'm fundamentally
> misunderstanding something. I'm not exactly sure what the next step should
> be and would appreciate some recommendations. I can submit a PR to solve
> the immediate problem of the failing example but the underlying problem
> should also be addressed at some point. I also apologize if people are
> already aware of this problem.
> >
> > Thank You!
> > Igor Durovic
>

Re: Interactive Beam Example Failing [BEAM-8451]

Posted by Robert Bradshaw <ro...@google.com>.
Thanks for trying this out. Yes, this is definitely something that
should be supported (and tested).

On Mon, Oct 21, 2019 at 3:40 PM Igor Durovic <id...@gmail.com> wrote:
>
> Hi everyone,
>
> The interactive beam example using the DirectRunner fails after execution of the last cell. The recursion limit is exceeded during the calculation of the cache label because of a circular reference in the PipelineInfo object.
>
> The constructor for the PipelineInfo class creates a mapping from each pcollection to the transforms that produce and consume it. The issue arises when there exists a transform that is both a producer and a consumer for the same pcollection. This occurs when a transform's expand method returns the same pcoll object that's passed into it. The specific transform causing the failure of the example is MaybeReshuffle, which is used in the Create transform. Replacing "return pcoll" with "return pcoll | Map(lambda x: x)" seems to fix the problem.
>
> A workaround for this issue on the interactive beam side would be fairly simple, but it seems to me that there should be more validation of pipelines to prevent the use of transforms that return the same pcoll that's passed in, or at least a mention of this in the transform style guide. My understanding is that pcollections are produced by a single transform (they even have a field called "producer" that references only one transform). If that's the case then that property of pcollections should be enforced.
>
> I made ticket BEAM-8451 to track this issue.
>
> I'm still new to beam so I apologize if I'm fundamentally misunderstanding something. I'm not exactly sure what the next step should be and would appreciate some recommendations. I can submit a PR to solve the immediate problem of the failing example but the underlying problem should also be addressed at some point. I also apologize if people are already aware of this problem.
>
> Thank You!
> Igor Durovic