You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Nipun Arora <ni...@gmail.com> on 2017/05/02 18:56:49 UTC

[Spark Streaming] Dynamic Broadcast Variable Update

Hi All,

To support our Spark Streaming based anomaly detection tool, we have made a
patch in Spark 1.6.2 to dynamically update broadcast variables.

I'll first explain our use-case, which I believe should be common to
several people using Spark Streaming applications. Broadcast variables are
often used to store values "machine learning models", which can then be
used on streaming data to "test" and get the desired results (for our case
anomalies). Unfortunately, in the current spark, broadcast variables are
final and can only be initialized once before the initialization of the
streaming context. Hence, if a new model is learned the streaming system
cannot be updated without shutting down the application, broadcasting
again, and restarting the application. Our goal was to re-broadcast
variables without requiring a downtime of the streaming service.

The key to this implementation is a live re-broadcastVariable() interface,
which can be triggered in between micro-batch executions, without any
re-boot required for the streaming application. At a high level the task is
done by re-fetching broadcast variable information from the spark driver,
and then re-distribute it to the workers. The micro-batch execution is
blocked while the update is made, by taking a lock on the execution. We
have already tested this in our prototype deployment of our anomaly
detection service and can successfully re-broadcast the broadcast variables
with no downtime.

We would like to integrate these changes in spark, can anyone please let me
know the process of submitting patches/ new features to spark. Also. I
understand that the current version of Spark is 2.1. However, our changes
have been done and tested on Spark 1.6.2, will this be a problem?

Thanks
Nipun

Re: [Spark Streaming] Dynamic Broadcast Variable Update

Posted by Tim Smith <se...@gmail.com>.

One, I think, you should take this to the spark developer list.

Two, I suspect broadcast variables aren't the best solution for the use
case, you describe. Maybe an in-memory data/object/file store like tachyon
is a better fit.

Thanks,

Tim


On Tue, May 2, 2017 at 11:56 AM, Nipun Arora <ni...@gmail.com>
wrote:

> Hi All,
>
> To support our Spark Streaming based anomaly detection tool, we have made
> a patch in Spark 1.6.2 to dynamically update broadcast variables.
>
> I'll first explain our use-case, which I believe should be common to
> several people using Spark Streaming applications. Broadcast variables are
> often used to store values "machine learning models", which can then be
> used on streaming data to "test" and get the desired results (for our case
> anomalies). Unfortunately, in the current spark, broadcast variables are
> final and can only be initialized once before the initialization of the
> streaming context. Hence, if a new model is learned the streaming system
> cannot be updated without shutting down the application, broadcasting
> again, and restarting the application. Our goal was to re-broadcast
> variables without requiring a downtime of the streaming service.
>
> The key to this implementation is a live re-broadcastVariable() interface,
> which can be triggered in between micro-batch executions, without any
> re-boot required for the streaming application. At a high level the task is
> done by re-fetching broadcast variable information from the spark driver,
> and then re-distribute it to the workers. The micro-batch execution is
> blocked while the update is made, by taking a lock on the execution. We
> have already tested this in our prototype deployment of our anomaly
> detection service and can successfully re-broadcast the broadcast variables
> with no downtime.
>
> We would like to integrate these changes in spark, can anyone please let
> me know the process of submitting patches/ new features to spark. Also. I
> understand that the current version of Spark is 2.1. However, our changes
> have been done and tested on Spark 1.6.2, will this be a problem?
>
> Thanks
> Nipun
>



-- 

--
Thanks,

Tim

Re: [Spark Streaming] Dynamic Broadcast Variable Update

Posted by Pierce Lamb <ri...@gmail.com>.

Hi Nipun,

To expand a bit, you might find this stackoverflow answer useful:

http://stackoverflow.com/a/39753976/3723346

Most spark + database combinations can handle a use case like this.

Hope this helps,

Pierce

On Thu, May 4, 2017 at 9:18 AM, Gene Pang <ge...@gmail.com> wrote:

> As Tim pointed out, Alluxio (renamed from Tachyon) may be able to help
> you. Here is some documentation on how to run Alluxio and Spark together
> <http://www.alluxio.org/docs/1.4/en/Running-Spark-on-Alluxio.html>, and
> here is a blog post on a Spark streaming + Alluxio use case
> <https://www.alluxio.com/blog/qunar-performs-real-time-data-analytics-up-to-300x-faster-with-alluxio>
> .
>
> Hope that helps,
> Gene
>
> On Tue, May 2, 2017 at 11:56 AM, Nipun Arora <ni...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> To support our Spark Streaming based anomaly detection tool, we have made
>> a patch in Spark 1.6.2 to dynamically update broadcast variables.
>>
>> I'll first explain our use-case, which I believe should be common to
>> several people using Spark Streaming applications. Broadcast variables are
>> often used to store values "machine learning models", which can then be
>> used on streaming data to "test" and get the desired results (for our case
>> anomalies). Unfortunately, in the current spark, broadcast variables are
>> final and can only be initialized once before the initialization of the
>> streaming context. Hence, if a new model is learned the streaming system
>> cannot be updated without shutting down the application, broadcasting
>> again, and restarting the application. Our goal was to re-broadcast
>> variables without requiring a downtime of the streaming service.
>>
>> The key to this implementation is a live re-broadcastVariable()
>> interface, which can be triggered in between micro-batch executions,
>> without any re-boot required for the streaming application. At a high level
>> the task is done by re-fetching broadcast variable information from the
>> spark driver, and then re-distribute it to the workers. The micro-batch
>> execution is blocked while the update is made, by taking a lock on the
>> execution. We have already tested this in our prototype deployment of our
>> anomaly detection service and can successfully re-broadcast the broadcast
>> variables with no downtime.
>>
>> We would like to integrate these changes in spark, can anyone please let
>> me know the process of submitting patches/ new features to spark. Also. I
>> understand that the current version of Spark is 2.1. However, our changes
>> have been done and tested on Spark 1.6.2, will this be a problem?
>>
>> Thanks
>> Nipun
>>
>
>

Re: [Spark Streaming] Dynamic Broadcast Variable Update

Posted by Gene Pang <ge...@gmail.com>.

As Tim pointed out, Alluxio (renamed from Tachyon) may be able to help you.
Here is some documentation on how to run Alluxio and Spark together
<http://www.alluxio.org/docs/1.4/en/Running-Spark-on-Alluxio.html>, and
here is a blog post on a Spark streaming + Alluxio use case
<https://www.alluxio.com/blog/qunar-performs-real-time-data-analytics-up-to-300x-faster-with-alluxio>
.

Hope that helps,
Gene

On Tue, May 2, 2017 at 11:56 AM, Nipun Arora <ni...@gmail.com>
wrote:

> Hi All,
>
> To support our Spark Streaming based anomaly detection tool, we have made
> a patch in Spark 1.6.2 to dynamically update broadcast variables.
>
> I'll first explain our use-case, which I believe should be common to
> several people using Spark Streaming applications. Broadcast variables are
> often used to store values "machine learning models", which can then be
> used on streaming data to "test" and get the desired results (for our case
> anomalies). Unfortunately, in the current spark, broadcast variables are
> final and can only be initialized once before the initialization of the
> streaming context. Hence, if a new model is learned the streaming system
> cannot be updated without shutting down the application, broadcasting
> again, and restarting the application. Our goal was to re-broadcast
> variables without requiring a downtime of the streaming service.
>
> The key to this implementation is a live re-broadcastVariable() interface,
> which can be triggered in between micro-batch executions, without any
> re-boot required for the streaming application. At a high level the task is
> done by re-fetching broadcast variable information from the spark driver,
> and then re-distribute it to the workers. The micro-batch execution is
> blocked while the update is made, by taking a lock on the execution. We
> have already tested this in our prototype deployment of our anomaly
> detection service and can successfully re-broadcast the broadcast variables
> with no downtime.
>
> We would like to integrate these changes in spark, can anyone please let
> me know the process of submitting patches/ new features to spark. Also. I
> understand that the current version of Spark is 2.1. However, our changes
> have been done and tested on Spark 1.6.2, will this be a problem?
>
> Thanks
> Nipun
>