You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Nishu <ni...@gmail.com> on 2017/11/13 09:56:13 UTC

Apache Beam and Spark runner

Hi ,

I am writing a streaming pipeline in Apache beam using spark runner.
Use case : To join the multiple kafka streams using windowed collections.
I use GroupByKey to group the events based on common business key and that
output is used as input for Join operation. Pipeline run on direct runner
as expected but on Spark cluster(v2.1), it throws the Accumulator error.
*"Exception in thread "main" java.lang.AssertionError: assertion failed:
copyAndReset must return a zero value copy"*

I tried the same pipeline on Spark cluster(v1.6), there it runs without any
error but doesn't perform the join operations on the streams .

I got couple of questions.

1. Does spark runner support spark version 2.x?

2. Regarding the triggers, currently only ProcessingTimeTrigger is
supported in Capability Matrix
<https://beam.apache.org/documentation/runners/capability-matrix/#cap-summary-what>
,
can we expect to have support for more trigger in near future sometime soon
? Also, GroupByKey and Accumulating panes features, are those supported for
spark for streaming pipeline?

3. According to the documentation, Storage level
<https://beam.apache.org/documentation/runners/spark/#pipeline-options-for-the-spark-runner>
is
set to IN_MEMORY for streaming pipelines. Can we configure it to disk as
well?

4. Is there checkpointing feature supported for Spark runner? In case if
Beam pipeline fails unexpectedly, can we read the state from the last run.

It will be great if someone could help to know above.

-- 
Thanks & Regards,
Nishu Tayal

Re: Apache Beam and Spark runner

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
See my answers on the dev mailing list.

NB: no need to "flood" both mailing lists ;)

Regards
JB

On 11/13/2017 10:56 AM, Nishu wrote:
> Hi ,
> 
> I am writing a streaming pipeline in Apache beam using spark runner.
> Use case : To join the multiple kafka streams using windowed collections.  I use 
> GroupByKey to group the events based on common business key and that output is 
> used as input for Join operation. Pipeline run on direct runner as expected but 
> on Spark cluster(v2.1), it throws the Accumulator error.
> *"Exception in thread "main" java.lang.AssertionError: assertion failed: 
> copyAndReset must return a zero value copy"*
> *
> *
> I tried the same pipeline on Spark cluster(v1.6), there it runs without any 
> error but doesn't perform the join operations on the streams .
> 
> I got couple of questions.
> 
> 1. Does spark runner support spark version 2.x?
> 
> 2. Regarding the triggers, currently only ProcessingTimeTrigger is supported in 
> Capability Matrix 
> <https://beam.apache.org/documentation/runners/capability-matrix/#cap-summary-what> , 
> can we expect to have support for more trigger in near future sometime soon ? 
> Also, GroupByKey and Accumulating panes features, are those supported for spark 
> for streaming pipeline?
> 
> 3. According to the documentation, Storage level 
> <https://beam.apache.org/documentation/runners/spark/#pipeline-options-for-the-spark-runner> is 
> set to IN_MEMORY for streaming pipelines. Can we configure it to disk as well?
> 
> 4. Is there checkpointing feature supported for Spark runner? In case if Beam 
> pipeline fails unexpectedly, can we read the state from the last run.
> 
> It will be great if someone could help to know above.
> 
> -- 
> Thanks & Regards,
> Nishu Tayal

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com