You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Steve Niemitz <sn...@apache.org> on 2019/03/12 13:55:26 UTC

Performance of stateful DoFn vs CombineByKey

Hi all.

I'm curious if anyone has done any comparison of the performance of a
pipeline that uses CombineByKey, vs one that uses a stateful DoFn with
combining state. [1]

More specifically, if I had a pipeline that had a CombineByKey configured
with early firings every N minutes, and I replaced the CBK with a stateful
DoFn with combining state and a timer that fired every N minutes instead,
would there be a (significant?) performance difference?  Specifically I'm
using dataflow (with streaming engine) but I'd be curious for other runners
as well

If no one has tried this I might do a benchmark to test, I'd be very
interested to see the results.

[1]
https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/state/CombiningState.html

Re: Performance of stateful DoFn vs CombineByKey

Posted by Steve Niemitz <sn...@apache.org>.
Interesting, thanks for the info!  Combiner lifting definitely makes sense
here, but as you mentioned I'm curious how much it helps performance in a
streaming pipeline.  The blog post you linked is great, I wonder if it
possible to make this information more visible?  It's pretty buried in the
blog list now, and I'll admit I never even got that far, because there's
another post on stateful processing almost directly above it.

I still plan on trying to do some benchmarks here because it'd be
interesting to see the differences.  I'll make sure to post results when I
do.

On Thu, Mar 14, 2019 at 3:43 PM Kenneth Knowles <ke...@apache.org> wrote:

> Combine admits many more execution plans than stateful ParDo:
>
>  - "Combiner lifting" or "mapper-side combine", in which the CombineFn is
> used to reduce data before shuffling. This is tremendous in batch, but can
> still matter in streaming.
>  - Hot key fanout & recombine. This is important in both batch & streaming.
>
> I tried to cover the issues a little in this section of my blog post on
> state, because it also answers the converse question: why/when would you
> use state (without timers) when Combine is so similar?
> https://beam.apache.org/blog/2017/02/13/stateful-processing.html#how-does-stateful-processing-fit-into-the-beam-model
>
> And here's a slide with the same idea but side-by-side illustrations:
> https://s.apache.org/ffsf-2017-beam-state#slide=id.g1dbf0d46d2_0_258
>
> Kenn
>
> On Tue, Mar 12, 2019 at 6:55 AM Steve Niemitz <sn...@apache.org> wrote:
>
>> Hi all.
>>
>> I'm curious if anyone has done any comparison of the performance of a
>> pipeline that uses CombineByKey, vs one that uses a stateful DoFn with
>> combining state. [1]
>>
>> More specifically, if I had a pipeline that had a CombineByKey configured
>> with early firings every N minutes, and I replaced the CBK with a stateful
>> DoFn with combining state and a timer that fired every N minutes instead,
>> would there be a (significant?) performance difference?  Specifically I'm
>> using dataflow (with streaming engine) but I'd be curious for other runners
>> as well
>>
>> If no one has tried this I might do a benchmark to test, I'd be very
>> interested to see the results.
>>
>> [1]
>> https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/state/CombiningState.html
>>
>

Re: Performance of stateful DoFn vs CombineByKey

Posted by Kenneth Knowles <ke...@apache.org>.
Combine admits many more execution plans than stateful ParDo:

 - "Combiner lifting" or "mapper-side combine", in which the CombineFn is
used to reduce data before shuffling. This is tremendous in batch, but can
still matter in streaming.
 - Hot key fanout & recombine. This is important in both batch & streaming.

I tried to cover the issues a little in this section of my blog post on
state, because it also answers the converse question: why/when would you
use state (without timers) when Combine is so similar?
https://beam.apache.org/blog/2017/02/13/stateful-processing.html#how-does-stateful-processing-fit-into-the-beam-model

And here's a slide with the same idea but side-by-side illustrations:
https://s.apache.org/ffsf-2017-beam-state#slide=id.g1dbf0d46d2_0_258

Kenn

On Tue, Mar 12, 2019 at 6:55 AM Steve Niemitz <sn...@apache.org> wrote:

> Hi all.
>
> I'm curious if anyone has done any comparison of the performance of a
> pipeline that uses CombineByKey, vs one that uses a stateful DoFn with
> combining state. [1]
>
> More specifically, if I had a pipeline that had a CombineByKey configured
> with early firings every N minutes, and I replaced the CBK with a stateful
> DoFn with combining state and a timer that fired every N minutes instead,
> would there be a (significant?) performance difference?  Specifically I'm
> using dataflow (with streaming engine) but I'd be curious for other runners
> as well
>
> If no one has tried this I might do a benchmark to test, I'd be very
> interested to see the results.
>
> [1]
> https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/state/CombiningState.html
>