You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by adam moore <ne...@gmail.com> on 2020/03/12 04:00:01 UTC

Splitting & merging with CoGroupByKey on a per-row basis

Hi beamers!

Enjoying using apache beam with Python.

I'm often finding myself creating patterns of:

input -> (key, calc1 result)
input -> (key, calc2 result)
input -> (key, calc3 result)
...
input -> (key, calcN result)

CoGroupByKey -> merge results of calc1, calc2, calc3....calcN
                into original input dict

(where input is a dict which represents a row)

Up until now, I've been using CoGroupByKey, but I wonder if there's a
better way such that as a set of calcs for a row is finished it can
continue to be processed down-pipeline.

Given that:

- There's only one result for each calc for each row
- The calculations can be made in parallel

... waiting for the whole set to be realized at the merging CoGroupByKey
seems like an unnecessary bottleneck.

Is there a better pattern for splitting / processing in parallel / merging
results per-row?

Thanks,
Adam