You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Anand Singh Kunwar <an...@gmail.com> on 2020/02/03 14:01:22 UTC

[Go SDK] Apache Beam GroupByKey

Hi
I have been trying out the apache beam go SDK for running aggregation on
logs from kafka. I have been trying to write a kafka reader using ParDoFns
and aggregating the result using WindowInto and GroupByKey. However the
`GroupByKey`'s output only outputs result if I finish my kafka message
outputs. I am using the direct runner for testing these out. It seems like
a bug in the runner/GroupByKey implementation as it seems to assume that
the output is going to exhaust even in the case of windowed input.

Pipeline representation:
impulse -> kafka consumption ParDo(consumes kafka messages in a loop and
emits metric with beam.EventTime) -> add timestamp ParDo ->
WindowInto(fixed windows) -> Add window max Timestamp as key Pardo ->
*GroupByKey*

The *GroupByKey *gets stuck until first kafka consumption pardo exits. This
doesn't seem desired. I haven't been unable to pinpoint the exact line of
code why this gets stuck. Can someone help me out with this.

Thanks for your help.

Best
Anand Singh Kunwar

Re: [Go SDK] Apache Beam GroupByKey

Posted by Robert Burke <ro...@frantil.com>.

The Go SDK doesn't currently have the ability to self split bundles. This
is being worked on as part of SplittableDoFns.

Unfortunately this means the only Source that works in streaming modes is
the Google PubSubIO source which only works on Dataflow, since that runner
replaces the  node with an internal implementation (same as Java and
Python).

With proper windowing, the Group By Key behaves as expected if once ParDos
can self checkpoint.

Another approach that will be worked on in the summer is supporting Cross
Language Transforms so that Go pipelines can re-use the Java IOs which
would also resolve this problem.

Thank you for understanding, as we work to get the Go SDK out of its
experimental state.

On Mon, Feb 3, 2020, 9:13 AM Anand Singh Kunwar <an...@gmail.com>
wrote:

> Hi
> I have been trying out the apache beam go SDK for running aggregation on
> logs from kafka. I have been trying to write a kafka reader using ParDoFns
> and aggregating the result using WindowInto and GroupByKey. However the
> `GroupByKey`'s output only outputs result if I finish my kafka message
> outputs. I am using the direct runner for testing these out. It seems like
> a bug in the runner/GroupByKey implementation as it seems to assume that
> the output is going to exhaust even in the case of windowed input.
>
> Pipeline representation:
> impulse -> kafka consumption ParDo(consumes kafka messages in a loop and
> emits metric with beam.EventTime) -> add timestamp ParDo ->
> WindowInto(fixed windows) -> Add window max Timestamp as key Pardo ->
> *GroupByKey*
>
> The *GroupByKey *gets stuck until first kafka consumption pardo exits.
> This doesn't seem desired. I haven't been unable to pinpoint the exact line
> of code why this gets stuck. Can someone help me out with this.
>
> Thanks for your help.
>
> Best
> Anand Singh Kunwar
>