You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Daniel Collins <dp...@google.com> on 2021/10/14 13:22:42 UTC

Force usage of runner_v2 on dataflow?

Hi all,

I'd like to make use of SDF bundle finalizers which require runner_v2 on
dataflow. Is there any way to fail a pipeline quickly if not running on
runner_v2 or (better) test if the runner supports bundle finalizers?

-Daniel

Re: Force usage of runner_v2 on dataflow?

Posted by Luke Cwik <lc...@google.com>.
That's a good point Reuven, that should be able to be done.

On Thu, Oct 14, 2021 at 11:24 AM Reuven Lax <re...@google.com> wrote:

> If bundle finalizers don't work on Dataflow's runner v1, can we make
> Dataflow reject that graph earlier in the construction process?
>
> On Thu, Oct 14, 2021 at 10:07 AM Daniel Collins <dp...@google.com>
> wrote:
>
>> Hi Luke,
>>
>> I'm asking this as an author of an SDF, so this isn't particularly
>> useful. I'm trying to head off users trying to run my SDF without
>> runner_v2, since the error messages you get when using bundle finalizers
>> without it are very late in the process (i.e. you have to wait for the job
>> to start and for it to try to run the SDF that declares a bundle finalizer).
>>
>> Can I implement
>> https://beam.apache.org/releases/javadoc/2.33.0/org/apache/beam/sdk/transforms/PTransform.html#validate-org.apache.beam.sdk.options.PipelineOptions-
>> and check that the experiments contains "use_runner_v2" or "enable_prime"?
>> Or will this break for some use cases?
>>
>> -Daniel
>>
>> On Thu, Oct 14, 2021 at 12:34 PM Luke Cwik <lc...@google.com> wrote:
>>
>>> During pipeline construction when you have access to the
>>> PipelineResult[1] and can cancel it. Unfortunately this object or one like
>>> it isn't available during pipeline execution easily. You could try to
>>> re-use the code within the DataflowPipelineJob#cancel method[2].
>>>
>>> 1:
>>> https://github.com/apache/beam/blob/bd0ac384b13c39e577ef211d5cb346ae0021e796/sdks/java/core/src/main/java/org/apache/beam/sdk/PipelineResult.java#L46
>>> 2:
>>> https://github.com/apache/beam/blob/bd0ac384b13c39e577ef211d5cb346ae0021e796/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java#L436
>>>
>>> On Thu, Oct 14, 2021 at 6:23 AM Daniel Collins <dp...@google.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'd like to make use of SDF bundle finalizers which require runner_v2
>>>> on dataflow. Is there any way to fail a pipeline quickly if not running on
>>>> runner_v2 or (better) test if the runner supports bundle finalizers?
>>>>
>>>> -Daniel
>>>>
>>>

Re: Force usage of runner_v2 on dataflow?

Posted by Reuven Lax <re...@google.com>.
If bundle finalizers don't work on Dataflow's runner v1, can we make
Dataflow reject that graph earlier in the construction process?

On Thu, Oct 14, 2021 at 10:07 AM Daniel Collins <dp...@google.com>
wrote:

> Hi Luke,
>
> I'm asking this as an author of an SDF, so this isn't particularly useful.
> I'm trying to head off users trying to run my SDF without runner_v2, since
> the error messages you get when using bundle finalizers without it are very
> late in the process (i.e. you have to wait for the job to start and for it
> to try to run the SDF that declares a bundle finalizer).
>
> Can I implement
> https://beam.apache.org/releases/javadoc/2.33.0/org/apache/beam/sdk/transforms/PTransform.html#validate-org.apache.beam.sdk.options.PipelineOptions-
> and check that the experiments contains "use_runner_v2" or "enable_prime"?
> Or will this break for some use cases?
>
> -Daniel
>
> On Thu, Oct 14, 2021 at 12:34 PM Luke Cwik <lc...@google.com> wrote:
>
>> During pipeline construction when you have access to the
>> PipelineResult[1] and can cancel it. Unfortunately this object or one like
>> it isn't available during pipeline execution easily. You could try to
>> re-use the code within the DataflowPipelineJob#cancel method[2].
>>
>> 1:
>> https://github.com/apache/beam/blob/bd0ac384b13c39e577ef211d5cb346ae0021e796/sdks/java/core/src/main/java/org/apache/beam/sdk/PipelineResult.java#L46
>> 2:
>> https://github.com/apache/beam/blob/bd0ac384b13c39e577ef211d5cb346ae0021e796/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java#L436
>>
>> On Thu, Oct 14, 2021 at 6:23 AM Daniel Collins <dp...@google.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I'd like to make use of SDF bundle finalizers which require runner_v2 on
>>> dataflow. Is there any way to fail a pipeline quickly if not running on
>>> runner_v2 or (better) test if the runner supports bundle finalizers?
>>>
>>> -Daniel
>>>
>>

Re: Force usage of runner_v2 on dataflow?

Posted by Luke Cwik <lc...@google.com>.
You're right, it isn't a great solution. Checking for the experiment will
cause issues for those who want to use the SDF on other runners which
support bundle finalization and SDF (e.g. the direct runner which many
users use when they are starting to write their pipeline).

You could always write in the public documentation that Dataflow customers
need to use Dataflow Prime (aka runner v2).

With portability, there are a set of "requirements" that are written into
the pipeline proto (one of which is bundle finalization) that runners are
meant to reject if they don't support it. This won't help with non-portable
pipelines though (e.g. Dataflow runner v1).

On Thu, Oct 14, 2021 at 10:07 AM Daniel Collins <dp...@google.com>
wrote:

> Hi Luke,
>
> I'm asking this as an author of an SDF, so this isn't particularly useful.
> I'm trying to head off users trying to run my SDF without runner_v2, since
> the error messages you get when using bundle finalizers without it are very
> late in the process (i.e. you have to wait for the job to start and for it
> to try to run the SDF that declares a bundle finalizer).
>
> Can I implement
> https://beam.apache.org/releases/javadoc/2.33.0/org/apache/beam/sdk/transforms/PTransform.html#validate-org.apache.beam.sdk.options.PipelineOptions-
> and check that the experiments contains "use_runner_v2" or "enable_prime"?
> Or will this break for some use cases?
>
> -Daniel
>
> On Thu, Oct 14, 2021 at 12:34 PM Luke Cwik <lc...@google.com> wrote:
>
>> During pipeline construction when you have access to the
>> PipelineResult[1] and can cancel it. Unfortunately this object or one like
>> it isn't available during pipeline execution easily. You could try to
>> re-use the code within the DataflowPipelineJob#cancel method[2].
>>
>> 1:
>> https://github.com/apache/beam/blob/bd0ac384b13c39e577ef211d5cb346ae0021e796/sdks/java/core/src/main/java/org/apache/beam/sdk/PipelineResult.java#L46
>> 2:
>> https://github.com/apache/beam/blob/bd0ac384b13c39e577ef211d5cb346ae0021e796/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java#L436
>>
>> On Thu, Oct 14, 2021 at 6:23 AM Daniel Collins <dp...@google.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I'd like to make use of SDF bundle finalizers which require runner_v2 on
>>> dataflow. Is there any way to fail a pipeline quickly if not running on
>>> runner_v2 or (better) test if the runner supports bundle finalizers?
>>>
>>> -Daniel
>>>
>>

Re: Force usage of runner_v2 on dataflow?

Posted by Daniel Collins <dp...@google.com>.
Hi Luke,

I'm asking this as an author of an SDF, so this isn't particularly useful.
I'm trying to head off users trying to run my SDF without runner_v2, since
the error messages you get when using bundle finalizers without it are very
late in the process (i.e. you have to wait for the job to start and for it
to try to run the SDF that declares a bundle finalizer).

Can I implement
https://beam.apache.org/releases/javadoc/2.33.0/org/apache/beam/sdk/transforms/PTransform.html#validate-org.apache.beam.sdk.options.PipelineOptions-
and check that the experiments contains "use_runner_v2" or "enable_prime"?
Or will this break for some use cases?

-Daniel

On Thu, Oct 14, 2021 at 12:34 PM Luke Cwik <lc...@google.com> wrote:

> During pipeline construction when you have access to the PipelineResult[1]
> and can cancel it. Unfortunately this object or one like it isn't available
> during pipeline execution easily. You could try to re-use the code within
> the DataflowPipelineJob#cancel method[2].
>
> 1:
> https://github.com/apache/beam/blob/bd0ac384b13c39e577ef211d5cb346ae0021e796/sdks/java/core/src/main/java/org/apache/beam/sdk/PipelineResult.java#L46
> 2:
> https://github.com/apache/beam/blob/bd0ac384b13c39e577ef211d5cb346ae0021e796/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java#L436
>
> On Thu, Oct 14, 2021 at 6:23 AM Daniel Collins <dp...@google.com>
> wrote:
>
>> Hi all,
>>
>> I'd like to make use of SDF bundle finalizers which require runner_v2 on
>> dataflow. Is there any way to fail a pipeline quickly if not running on
>> runner_v2 or (better) test if the runner supports bundle finalizers?
>>
>> -Daniel
>>
>

Re: Force usage of runner_v2 on dataflow?

Posted by Luke Cwik <lc...@google.com>.
During pipeline construction when you have access to the PipelineResult[1]
and can cancel it. Unfortunately this object or one like it isn't available
during pipeline execution easily. You could try to re-use the code within
the DataflowPipelineJob#cancel method[2].

1:
https://github.com/apache/beam/blob/bd0ac384b13c39e577ef211d5cb346ae0021e796/sdks/java/core/src/main/java/org/apache/beam/sdk/PipelineResult.java#L46
2:
https://github.com/apache/beam/blob/bd0ac384b13c39e577ef211d5cb346ae0021e796/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java#L436

On Thu, Oct 14, 2021 at 6:23 AM Daniel Collins <dp...@google.com> wrote:

> Hi all,
>
> I'd like to make use of SDF bundle finalizers which require runner_v2 on
> dataflow. Is there any way to fail a pipeline quickly if not running on
> runner_v2 or (better) test if the runner supports bundle finalizers?
>
> -Daniel
>