You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by adam moore <ne...@gmail.com> on 2020/03/12 04:00:01 UTC
Splitting & merging with CoGroupByKey on a per-row basis
Hi beamers!
Enjoying using apache beam with Python.
I'm often finding myself creating patterns of:
input -> (key, calc1 result)
input -> (key, calc2 result)
input -> (key, calc3 result)
...
input -> (key, calcN result)
CoGroupByKey -> merge results of calc1, calc2, calc3....calcN
into original input dict
(where input is a dict which represents a row)
Up until now, I've been using CoGroupByKey, but I wonder if there's a
better way such that as a set of calcs for a row is finished it can
continue to be processed down-pipeline.
Given that:
- There's only one result for each calc for each row
- The calculations can be made in parallel
... waiting for the whole set to be realized at the merging CoGroupByKey
seems like an unnecessary bottleneck.
Is there a better pattern for splitting / processing in parallel / merging
results per-row?
Thanks,
Adam