You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Dominik Safaric <do...@gmail.com> on 2017/05/16 19:54:54 UTC

Spark Streaming 2.1 recovery

Hi,

currently I am exploring Spark’s fault tolerance capabilities in terms of fault recovery. Namely I run a Spark 2.1 standalone cluster on a master and four worker nodes. The application pulls data using the Kafka direct stream API from a Kafka topic over a (sliding) window of time, and writes transformed data back to another topic. During this process, using a bash script I randomly kill a Worker process with an expectation of getting insight onto RDD recovery using the log4j logs written by the Driver. However, expect of messages describing that a Worker has been lost, I cannot find any traces indicating that Spark is recovery the data lost by the killed Worker. Hence, I have the following questions: 

If using stateless transformations such as windowing does Spark checkpoint the data blocks or just the RDDs metadata? 
If not, is the state recovered from memory of a Worker to which the data has been replicated or just using the HDFS checkpoints? 
If Spark checkpoints and recovers the metadata only, how are exactly-once processing semantics achieved? I refer to processing semantics, and not output semantics, as the later would require storing the data into a transactional data store. 
Using Write Ahead Logs, would Spark recover the data from them in parallel instead of re-pulling the messages from Kafka? 

Thanks in advance for the clarification,
Dominik