You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Claire McGinty <cl...@gmail.com> on 2022/05/03 17:38:12 UTC

SplittableDoFn-based source doesn't efficiently scale up in Dataflow

Hi Beam users,

I'm looking for input on one of our IOs that we recently migrated
<https://github.com/spotify/scio/pull/4260> to SplittableDoFn. When running
in Dataflow we saw performance gains in every aspect (VCPU hours, total
memory time) except for total elapsed time: the SplittableDoFn
implementation took 1.5x as many minutes as it did previously for about
~900GB of Parquet files.

It seems like the issue is that it isn't scaling up as much as the old
BoundedSource version. I ran the SplittableDoFn implementation a couple
times to be sure, but reliably, it only scaled up to 30%-50% the max number
of workers as it used to. Both implementations of this IO have the same
base level of "splittability" (Parquet row groups) so I'm not sure what the
issue could be.

I saw in an older user@ thread, using Dataflow Runner V2 was suggested as a
mitigation. I did re-try my job using Dataflow Prime and saw significant
improvement; but we're not able to migrate our entire fleet to V2 at this
time.

Is there any workaround for Dataflow Runner V1 to improve the scale-up for
SplittableDoFn sources?

Thanks!
Claire

Re: SplittableDoFn-based source doesn't efficiently scale up in Dataflow

Posted by Claire McGinty <cl...@gmail.com>.

Can you clarify a bit what you mean by being over-aggressive in the
splitRestriction? We can't go any smaller as far as the unit of
splittability (a single row group).

Thanks!
-Claire

On Tue, May 3, 2022 at 9:14 PM Robert Bradshaw <ro...@google.com> wrote:

> On Tue, May 3, 2022 at 10:39 AM Claire McGinty
> <cl...@gmail.com> wrote:
> >
> > Hi Beam users,
> >
> > I'm looking for input on one of our IOs that we recently migrated to
> SplittableDoFn. When running in Dataflow we saw performance gains in every
> aspect (VCPU hours, total memory time) except for total elapsed time: the
> SplittableDoFn implementation took 1.5x as many minutes as it did
> previously for about ~900GB of Parquet files.
> >
> > It seems like the issue is that it isn't scaling up as much as the old
> BoundedSource version. I ran the SplittableDoFn implementation a couple
> times to be sure, but reliably, it only scaled up to 30%-50% the max number
> of workers as it used to. Both implementations of this IO have the same
> base level of "splittability" (Parquet row groups) so I'm not sure what the
> issue could be.
> >
> > I saw in an older user@ thread, using Dataflow Runner V2 was suggested
> as a mitigation. I did re-try my job using Dataflow Prime and saw
> significant improvement; but we're not able to migrate our entire fleet to
> V2 at this time.
>
> Note that you can pass use_runner_v2 to use Dataflow Runner V2 if
> there are other Prime features that you're not ready for yet. (It
> would be good to understand what issues you're running into as well,
> if you're able to share.)
>
> > Is there any workaround for Dataflow Runner V1 to improve the scale-up
> for SplittableDoFn sources?
>
> There are architectural constraints with Runner V1 in executing
> SplittableDoFns as well as Runner V2 can do. Upgrading to Runner V2
> really is the best mitigation. But one possible migration might be to
> be over-aggressive in your splitRestriction implementation.
>

Re: SplittableDoFn-based source doesn't efficiently scale up in Dataflow

Posted by Robert Bradshaw <ro...@google.com>.

On Tue, May 3, 2022 at 10:39 AM Claire McGinty
<cl...@gmail.com> wrote:
>
> Hi Beam users,
>
> I'm looking for input on one of our IOs that we recently migrated to SplittableDoFn. When running in Dataflow we saw performance gains in every aspect (VCPU hours, total memory time) except for total elapsed time: the SplittableDoFn implementation took 1.5x as many minutes as it did previously for about ~900GB of Parquet files.
>
> It seems like the issue is that it isn't scaling up as much as the old BoundedSource version. I ran the SplittableDoFn implementation a couple times to be sure, but reliably, it only scaled up to 30%-50% the max number of workers as it used to. Both implementations of this IO have the same base level of "splittability" (Parquet row groups) so I'm not sure what the issue could be.
>
> I saw in an older user@ thread, using Dataflow Runner V2 was suggested as a mitigation. I did re-try my job using Dataflow Prime and saw significant improvement; but we're not able to migrate our entire fleet to V2 at this time.

Note that you can pass use_runner_v2 to use Dataflow Runner V2 if
there are other Prime features that you're not ready for yet. (It
would be good to understand what issues you're running into as well,
if you're able to share.)

> Is there any workaround for Dataflow Runner V1 to improve the scale-up for SplittableDoFn sources?

There are architectural constraints with Runner V1 in executing
SplittableDoFns as well as Runner V2 can do. Upgrading to Runner V2
really is the best mitigation. But one possible migration might be to
be over-aggressive in your splitRestriction implementation.