You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Kåre Blakstad (JIRA)" <ji...@apache.org> on 2015/10/05 12:41:26 UTC

[jira] [Comment Edited] (SPARK-6404) Call broadcast() in each interval for spark streaming programs.

    [ https://issues.apache.org/jira/browse/SPARK-6404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14943143#comment-14943143 ] 

Kåre Blakstad edited comment on SPARK-6404 at 10/5/15 10:41 AM:
----------------------------------------------------------------

I do believe there's an issue with this approach. The first one being one must broadcast at the specified batch interval. I would rather define this interval myself for each broadcast, since it might be big database or file reads, which is not necessary every micro batch. Also, if you want to reuse some data for different broadcast, eg. do some transformations over it before it's broadcasted, this would be much harder, due to the evaluation of the expression being done local to the RDD transformation. 

Today I solve this by using a mutable broadcast var which is updated with an Akka scheduler after the previous broadcast is unpersisted, but I'm not sure that Spark internals approve of this as the best way.

EDIT: And with the deprecation of actorSystem, this will be even worse.


was (Author: kareblak):
I do believe there's an issue with this approach. The first one being one must broadcast at the specified batch interval. I would rather define this interval myself for each broadcast, since it might be big database or file reads, which is not necessary every micro batch. Also, if you want to reuse some data for different broadcast, eg. do some transformations over it before it's broadcasted, this would be much harder, due to the evaluation of the expression being done local to the RDD transformation. 

Today I solve this by using a mutable broadcast var which is updated with an Akka scheduler after the previous broadcast is unpersisted, but I'm not sure that Spark internals approve of this as the best way.

> Call broadcast() in each interval for spark streaming programs.
> ---------------------------------------------------------------
>
>                 Key: SPARK-6404
>                 URL: https://issues.apache.org/jira/browse/SPARK-6404
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Yifan Wang
>
> If I understand it correctly, Spark’s broadcast() function will be called only once at the beginning of the batch. For streaming applications that need to run for 24/7, it is often needed to update variables that shared by broadcast() dynamically. It would be ideal if broadcast() could be called at the beginning of each interval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org