You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Nipun Arora <ni...@gmail.com> on 2017/05/02 19:44:51 UTC

[Spark Streaming] Dynamic Broadcast Variable Update

Hi All,

To support our Spark Streaming based anomaly detection tool, we have made a
patch in Spark 1.6.2 to dynamically update broadcast variables.

I'll first explain our use-case, which I believe should be common to
several people using Spark Streaming applications. Broadcast variables are
often used to store values "machine learning models", which can then be
used on streaming data to "test" and get the desired results (for our case
anomalies). Unfortunately, in the current spark, broadcast variables are
final and can only be initialized once before the initialization of the
streaming context. Hence, if a new model is learned the streaming system
cannot be updated without shutting down the application, broadcasting
again, and restarting the application. Our goal was to re-broadcast
variables without requiring a downtime of the streaming service.

The key to this implementation is a live re-broadcastVariable() interface,
which can be triggered in between micro-batch executions, without any
re-boot required for the streaming application. At a high level the task is
done by re-fetching broadcast variable information from the spark driver,
and then re-distribute it to the workers. The micro-batch execution is
blocked while the update is made, by taking a lock on the execution. We
have already tested this in our prototype deployment of our anomaly
detection service and can successfully re-broadcast the broadcast variables
with no downtime.

We would like to integrate these changes in spark, can anyone please let me
know the process of submitting patches/ new features to spark. Also. I
understand that the current version of Spark is 2.1. However, our changes
have been done and tested on Spark 1.6.2, will this be a problem?

Thanks
Nipun