You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Marcin Kuthan <ma...@gmail.com> on 2022/05/05 06:39:02 UTC

Dataflow runner v1 vs. v2 for Java pipelines

Hi

I experimented a bit with Dataflow Runner v2 and I do not see a strong
advantage for pure Java pipelines. I would like to hear the voices from
more experienced Dataflow Runner v2 users:

* Do you run your production Java pipelines using Dataflow Runner V2?
* Streaming or batch?
* Custom containers or default ones?
* Have you observed any regression after migration from Dataflow Runner V1?

Finally, the WordCount example (Beam 2.38) does not work if only the
following parameter is set "--dataflowServiceOptions=use_runner_v2".

java.lang.RuntimeException: Failed to create a workflow job: Dataflow
Runner v2 requires a valid FnApi job, Please resubmit your job with a valid
configuration. Note that if using Templates, you may need to regenerate
your template with the '--use_runner_v2'.
    at org.apache.beam.runners.dataflow.DataflowRunner.run
(DataflowRunner.java:1330)
    at org.apache.beam.runners.dataflow.DataflowRunner.run
(DataflowRunner.java:196)
    at org.apache.beam.sdk.Pipeline.run (Pipeline.java:323)
    at org.apache.beam.sdk.Pipeline.run (Pipeline.java:309)
    at org.apache.beam.examples.WordCount.runWordCount (WordCount.java:196)
    at org.apache.beam.examples.WordCount.main (WordCount.java:203)

I'm able to run the job with "--experiments=use_runner_v2" option and it
seems that runner v2 is enabled. The steps are different, especially for
the reader part of the pipeline.
If I specify both options "--dataflowServiceOptions=use_runner_v2" and
"--experiments=use_runner_v2" the job looks identical to the previous one,
so the dataflowServiceOptions seems to be redundant. But the official
documentation says clearly to use the dataflowServiceOptions option to
enable Dataflow Runner v2:
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2
.

Thanks for sharing your thoughts
Marcin Kuthan

Re: Dataflow runner v1 vs. v2 for Java pipelines

Posted by Robert Bradshaw <ro...@google.com>.
On Fri, May 6, 2022 at 7:08 AM Marcin Kuthan <ma...@gmail.com> wrote:
>
> Hi Robert
>
> Thank you for the answer and creating the Beam issue. The documentation has been also fixed, very nice!
>
> Do you know when the runner v2 will become the default runner for Java pipelines? It is a matter of months or years?

Good question. Hard to predict, but I'd say likely somewhere between
those two extremes. I'd be surprised if we don't start rolling out
before the end of this year, but it's not happening in the next month
or two. (Batch and Streaming might roll out at different cadences as
well.)

> BTW. For testing purposes I sucesfully deployed one of the Google provided Dataflow template (PubsubToBigQuery).
> Although, some hacks are required, as documented here: https://github.com/GoogleCloudPlatform/DataflowTemplates/issues/382

Thanks for powering through this. We'll definitely want to make this
experience smoother.

> On Thu, 5 May 2022 at 18:18, Robert Bradshaw <ro...@google.com> wrote:
>>
>> FYI, I filed https://issues.apache.org/jira/browse/BEAM-14421
>>
>> On Thu, May 5, 2022 at 9:14 AM Robert Bradshaw <ro...@google.com> wrote:
>> >
>> > On Wed, May 4, 2022 at 11:40 PM Marcin Kuthan <ma...@gmail.com> wrote:
>> > >
>> > > Hi
>> > >
>> > > I experimented a bit with Dataflow Runner v2 and I do not see a strong advantage for pure Java pipelines. I would like to hear the voices from more experienced Dataflow Runner v2 users:
>> > >
>> > > * Do you run your production Java pipelines using Dataflow Runner V2?
>> > > * Streaming or batch?
>> > > * Custom containers or default ones?
>> > > * Have you observed any regression after migration from Dataflow Runner V1?
>> >
>> > Dataflow Runner v2 should mostly be an internal implementation detail,
>> > although it does offer additional features that are not available for
>> > Dataflow Runner v1 (e.g. fully splittable SDFs, multi-language, custom
>> > containers). Eventually Dataflow will run all pipelines on this new
>> > backend (e.g. all of Go and most of Python are already in this state)
>> > though as you've noticed, you are welcome (and it's in fact
>> > encouraged) to try it out sooner.
>> >
>> > I'd welcome any other feedback people have in trying it out here as well.
>> >
>> > > Finally, the WordCount example (Beam 2.38) does not work if only the following parameter is set "--dataflowServiceOptions=use_runner_v2".
>> >
>> > Thanks for reporting this! This is unexpected--the two should
>> > essentially be aliases (the latter only named such as the name
>> > "experiments" was off-putting for mature features). I'll look into it.
>> > In the meantime, go ahead and use the experiments flag.
>> >
>> > > java.lang.RuntimeException: Failed to create a workflow job: Dataflow Runner v2 requires a valid FnApi job, Please resubmit your job with a valid configuration. Note that if using Templates, you may need to regenerate your template with the '--use_runner_v2'.
>> > >     at org.apache.beam.runners.dataflow.DataflowRunner.run (DataflowRunner.java:1330)
>> > >     at org.apache.beam.runners.dataflow.DataflowRunner.run (DataflowRunner.java:196)
>> > >     at org.apache.beam.sdk.Pipeline.run (Pipeline.java:323)
>> > >     at org.apache.beam.sdk.Pipeline.run (Pipeline.java:309)
>> > >     at org.apache.beam.examples.WordCount.runWordCount (WordCount.java:196)
>> > >     at org.apache.beam.examples.WordCount.main (WordCount.java:203)
>> > >
>> > > I'm able to run the job with "--experiments=use_runner_v2" option and it seems that runner v2 is enabled. The steps are different, especially for the reader part of the pipeline.
>> > > If I specify both options "--dataflowServiceOptions=use_runner_v2" and "--experiments=use_runner_v2" the job looks identical to the previous one, so the dataflowServiceOptions seems to be redundant. But the official documentation says clearly to use the dataflowServiceOptions option to enable Dataflow Runner v2: https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2.
>> > >
>> > > Thanks for sharing your thoughts
>> > > Marcin Kuthan

Re: Dataflow runner v1 vs. v2 for Java pipelines

Posted by Marcin Kuthan <ma...@gmail.com>.
Hi Robert

Thank you for the answer and creating the Beam issue. The documentation has
been also fixed, very nice!

Do you know when the runner v2 will become the default runner for Java
pipelines? It is a matter of months or years?

BTW. For testing purposes I sucesfully deployed one of the Google provided
Dataflow template (PubsubToBigQuery).
Although, some hacks are required, as documented here:
https://github.com/GoogleCloudPlatform/DataflowTemplates/issues/382


On Thu, 5 May 2022 at 18:18, Robert Bradshaw <ro...@google.com> wrote:

> FYI, I filed https://issues.apache.org/jira/browse/BEAM-14421
>
> On Thu, May 5, 2022 at 9:14 AM Robert Bradshaw <ro...@google.com>
> wrote:
> >
> > On Wed, May 4, 2022 at 11:40 PM Marcin Kuthan <ma...@gmail.com>
> wrote:
> > >
> > > Hi
> > >
> > > I experimented a bit with Dataflow Runner v2 and I do not see a strong
> advantage for pure Java pipelines. I would like to hear the voices from
> more experienced Dataflow Runner v2 users:
> > >
> > > * Do you run your production Java pipelines using Dataflow Runner V2?
> > > * Streaming or batch?
> > > * Custom containers or default ones?
> > > * Have you observed any regression after migration from Dataflow
> Runner V1?
> >
> > Dataflow Runner v2 should mostly be an internal implementation detail,
> > although it does offer additional features that are not available for
> > Dataflow Runner v1 (e.g. fully splittable SDFs, multi-language, custom
> > containers). Eventually Dataflow will run all pipelines on this new
> > backend (e.g. all of Go and most of Python are already in this state)
> > though as you've noticed, you are welcome (and it's in fact
> > encouraged) to try it out sooner.
> >
> > I'd welcome any other feedback people have in trying it out here as well.
> >
> > > Finally, the WordCount example (Beam 2.38) does not work if only the
> following parameter is set "--dataflowServiceOptions=use_runner_v2".
> >
> > Thanks for reporting this! This is unexpected--the two should
> > essentially be aliases (the latter only named such as the name
> > "experiments" was off-putting for mature features). I'll look into it.
> > In the meantime, go ahead and use the experiments flag.
> >
> > > java.lang.RuntimeException: Failed to create a workflow job: Dataflow
> Runner v2 requires a valid FnApi job, Please resubmit your job with a valid
> configuration. Note that if using Templates, you may need to regenerate
> your template with the '--use_runner_v2'.
> > >     at org.apache.beam.runners.dataflow.DataflowRunner.run
> (DataflowRunner.java:1330)
> > >     at org.apache.beam.runners.dataflow.DataflowRunner.run
> (DataflowRunner.java:196)
> > >     at org.apache.beam.sdk.Pipeline.run (Pipeline.java:323)
> > >     at org.apache.beam.sdk.Pipeline.run (Pipeline.java:309)
> > >     at org.apache.beam.examples.WordCount.runWordCount
> (WordCount.java:196)
> > >     at org.apache.beam.examples.WordCount.main (WordCount.java:203)
> > >
> > > I'm able to run the job with "--experiments=use_runner_v2" option and
> it seems that runner v2 is enabled. The steps are different, especially for
> the reader part of the pipeline.
> > > If I specify both options "--dataflowServiceOptions=use_runner_v2" and
> "--experiments=use_runner_v2" the job looks identical to the previous one,
> so the dataflowServiceOptions seems to be redundant. But the official
> documentation says clearly to use the dataflowServiceOptions option to
> enable Dataflow Runner v2:
> https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2
> .
> > >
> > > Thanks for sharing your thoughts
> > > Marcin Kuthan
>

Re: Dataflow runner v1 vs. v2 for Java pipelines

Posted by Robert Bradshaw <ro...@google.com>.
FYI, I filed https://issues.apache.org/jira/browse/BEAM-14421

On Thu, May 5, 2022 at 9:14 AM Robert Bradshaw <ro...@google.com> wrote:
>
> On Wed, May 4, 2022 at 11:40 PM Marcin Kuthan <ma...@gmail.com> wrote:
> >
> > Hi
> >
> > I experimented a bit with Dataflow Runner v2 and I do not see a strong advantage for pure Java pipelines. I would like to hear the voices from more experienced Dataflow Runner v2 users:
> >
> > * Do you run your production Java pipelines using Dataflow Runner V2?
> > * Streaming or batch?
> > * Custom containers or default ones?
> > * Have you observed any regression after migration from Dataflow Runner V1?
>
> Dataflow Runner v2 should mostly be an internal implementation detail,
> although it does offer additional features that are not available for
> Dataflow Runner v1 (e.g. fully splittable SDFs, multi-language, custom
> containers). Eventually Dataflow will run all pipelines on this new
> backend (e.g. all of Go and most of Python are already in this state)
> though as you've noticed, you are welcome (and it's in fact
> encouraged) to try it out sooner.
>
> I'd welcome any other feedback people have in trying it out here as well.
>
> > Finally, the WordCount example (Beam 2.38) does not work if only the following parameter is set "--dataflowServiceOptions=use_runner_v2".
>
> Thanks for reporting this! This is unexpected--the two should
> essentially be aliases (the latter only named such as the name
> "experiments" was off-putting for mature features). I'll look into it.
> In the meantime, go ahead and use the experiments flag.
>
> > java.lang.RuntimeException: Failed to create a workflow job: Dataflow Runner v2 requires a valid FnApi job, Please resubmit your job with a valid configuration. Note that if using Templates, you may need to regenerate your template with the '--use_runner_v2'.
> >     at org.apache.beam.runners.dataflow.DataflowRunner.run (DataflowRunner.java:1330)
> >     at org.apache.beam.runners.dataflow.DataflowRunner.run (DataflowRunner.java:196)
> >     at org.apache.beam.sdk.Pipeline.run (Pipeline.java:323)
> >     at org.apache.beam.sdk.Pipeline.run (Pipeline.java:309)
> >     at org.apache.beam.examples.WordCount.runWordCount (WordCount.java:196)
> >     at org.apache.beam.examples.WordCount.main (WordCount.java:203)
> >
> > I'm able to run the job with "--experiments=use_runner_v2" option and it seems that runner v2 is enabled. The steps are different, especially for the reader part of the pipeline.
> > If I specify both options "--dataflowServiceOptions=use_runner_v2" and "--experiments=use_runner_v2" the job looks identical to the previous one, so the dataflowServiceOptions seems to be redundant. But the official documentation says clearly to use the dataflowServiceOptions option to enable Dataflow Runner v2: https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2.
> >
> > Thanks for sharing your thoughts
> > Marcin Kuthan

Re: Dataflow runner v1 vs. v2 for Java pipelines

Posted by Robert Bradshaw <ro...@google.com>.
On Wed, May 4, 2022 at 11:40 PM Marcin Kuthan <ma...@gmail.com> wrote:
>
> Hi
>
> I experimented a bit with Dataflow Runner v2 and I do not see a strong advantage for pure Java pipelines. I would like to hear the voices from more experienced Dataflow Runner v2 users:
>
> * Do you run your production Java pipelines using Dataflow Runner V2?
> * Streaming or batch?
> * Custom containers or default ones?
> * Have you observed any regression after migration from Dataflow Runner V1?

Dataflow Runner v2 should mostly be an internal implementation detail,
although it does offer additional features that are not available for
Dataflow Runner v1 (e.g. fully splittable SDFs, multi-language, custom
containers). Eventually Dataflow will run all pipelines on this new
backend (e.g. all of Go and most of Python are already in this state)
though as you've noticed, you are welcome (and it's in fact
encouraged) to try it out sooner.

I'd welcome any other feedback people have in trying it out here as well.

> Finally, the WordCount example (Beam 2.38) does not work if only the following parameter is set "--dataflowServiceOptions=use_runner_v2".

Thanks for reporting this! This is unexpected--the two should
essentially be aliases (the latter only named such as the name
"experiments" was off-putting for mature features). I'll look into it.
In the meantime, go ahead and use the experiments flag.

> java.lang.RuntimeException: Failed to create a workflow job: Dataflow Runner v2 requires a valid FnApi job, Please resubmit your job with a valid configuration. Note that if using Templates, you may need to regenerate your template with the '--use_runner_v2'.
>     at org.apache.beam.runners.dataflow.DataflowRunner.run (DataflowRunner.java:1330)
>     at org.apache.beam.runners.dataflow.DataflowRunner.run (DataflowRunner.java:196)
>     at org.apache.beam.sdk.Pipeline.run (Pipeline.java:323)
>     at org.apache.beam.sdk.Pipeline.run (Pipeline.java:309)
>     at org.apache.beam.examples.WordCount.runWordCount (WordCount.java:196)
>     at org.apache.beam.examples.WordCount.main (WordCount.java:203)
>
> I'm able to run the job with "--experiments=use_runner_v2" option and it seems that runner v2 is enabled. The steps are different, especially for the reader part of the pipeline.
> If I specify both options "--dataflowServiceOptions=use_runner_v2" and "--experiments=use_runner_v2" the job looks identical to the previous one, so the dataflowServiceOptions seems to be redundant. But the official documentation says clearly to use the dataflowServiceOptions option to enable Dataflow Runner v2: https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#dataflow-runner-v2.
>
> Thanks for sharing your thoughts
> Marcin Kuthan