You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@beam.apache.org by "Amit Sela (JIRA)" <ji...@apache.org> on 2018/10/02 02:10:00 UTC

[jira] [Commented] (BEAM-5519) Spark Streaming Duplicated Encoding/Decoding Effort

    [ https://issues.apache.org/jira/browse/BEAM-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16634849#comment-16634849 ] 

Amit Sela commented on BEAM-5519:
---------------------------------

[~winkelman.kyle] you bring up a good point.
IIRC we did all this mess to guarantee that in case a shuffle is required by Spark, we control it (initiate it), and it applies to RDDs containing serialized data only.
This might have been _before_ we got the "force default partitioner" in place, or a mixed process.. not sure.
You can do test your change (in streaming, which uses {{UpdateStateByKey}} and so creates shuffles etc.) to make sure that no shuffle occurs on RDDs containing deserialized data. In addition, you can use a non-Kryo serializable (if the runner still defaults to Kryo underneath..) and make sure it doesn't fail.
Hope that helps!

> Spark Streaming Duplicated Encoding/Decoding Effort
> ---------------------------------------------------
>
>                 Key: BEAM-5519
>                 URL: https://issues.apache.org/jira/browse/BEAM-5519
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-spark
>            Reporter: Kyle Winkelman
>            Assignee: Kyle Winkelman
>            Priority: Major
>              Labels: spark, spark-streaming
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> When using the SparkRunner in streaming mode. There is a call to groupByKey followed by a call to updateStateByKey. BEAM-1815 fixed an issue where this used to cause 2 shuffles but it still causes 2 encode/decode cycles.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)