You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ofer Eliassaf <of...@gmail.com> on 2018/05/17 05:18:37 UTC

PySpark Structured Streaming - using previous iteration computed results in current iteration

We would like to utilize maintaining an arbitrary state between invokations
of the iterations of StructuredStreaming in python

How can we maintain a static DataFrame that acts as state between the
iterations?

Several options that may be relevant:
1. in Spark memory (distributed across the workers)
2. External Memory solution (e.g. ElasticSearch / Redis)
3. utilizing other state maintenance that can work with PySpark

Specifically - given that in iteration N we get a Streaming DataFrame from
Kafka, we apply computation that produces a label column over the window of
samples from the last hour.
We want to keep around the labels and the sample ids for the next iteration
(N+1) where we want to do a join with the new sample window to inherit the
labels of samples that existed in the previous (N) iteration.


-- 
Regards,
Ofer Eliassaf