You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Daniel van der Ende <da...@gmail.com> on 2018/02/09 12:29:57 UTC

Structured Streaming Broadcasts

Hi all,

I'm working on a Structured Streaming job that uses a Spark ML algorithm.
The algorithm includes the broadcast of a dataframe. The idea is that the
Structured Streaming job will stay running, and process data as it comes in
from a Kafka bus. On start up, I have specified that I want at least 2
executors, to be able to quickly handle incoming requests, without having
to wait for the overhead of spinning additional executors. The problem I'm
having now is that it seems that if the executors stay idle for too long,
the BlockManager and ContextCleaner remove the broadcast blocks from the
memory of the executor. Then, when data comes in on Kafka, the data is no
longer available on the executors, and the data has to be broadcasted
again. I'd like to avoid this. Is there any way to mark a dataframe as
"non-garbage-collectible" ? (or perhaps some other best practice for
dealing with broadcasts in long running jobs?)

Thanks!

Daniel van der Ende