You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Christiaan Ras <ch...@semmelwise.nl> on 2018/01/23 11:32:30 UTC

[Structured streaming] Merging streaming with semi-static datasets

Hi,

I’m currently doing some tests with Structured Streaming and I’m wondering how I can merge the streaming dataset with a more-or-less static dataset (from a JDBC source).
With more-or-less I mean a dataset which does not change that often and could be cached by Spark for a while. It is possible to merge static datasets but static datasets will be refreshed on every batch which increases batch duration.
With ‘traditional’ spark streaming (non-structured) I use a counter and refresh the dataset (by using unpersist() and cache()) when it hits a certain threshold. I admit it’s not a state-of-the-art solution but it works. With structured streaming I was not able to get this mechanism working. It looks like the code between input and sinks runs once…

Is there a way to cache external datasets, use them in consecutive batches (merging with new incoming streaming data, perform operations and sink results) and refresh the external datasets after a specified number of batches?

Regards,
Chris