You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by mkhaitman <ma...@chango.com> on 2015/02/23 19:53:31 UTC

StreamingContext textFileStream question

Hello,

I was interested in creating a StreamingContext textFileStream based job,
which runs for long durations, and can also recover from prolonged driver
failure... It seems like StreamingContext checkpointing is mainly used for
the case when the driver dies during the processing of an RDD, and to
recover that one RDD, but my question specifically relates to whether there
is a way to also recover which files were missed between the timeframe of
the driver dying and being started back up (whether manually or
automatically).

Any assistance/suggestions with this one would be greatly appreciated!

Thanks,
Mark.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/StreamingContext-textFileStream-question-tp10742.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

RE: StreamingContext textFileStream question

Posted by mkhaitman <ma...@chango.com>.

Hi Jerry,

Thanks for the quick response! Looks like I'll need to come up with an
alternative solution in the meantime,  since I'd like to avoid the other
input streams + WAL approach. :)

Thanks again,
Mark.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/StreamingContext-textFileStream-question-tp10742p10745.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

RE: StreamingContext textFileStream question

Posted by "Shao, Saisai" <sa...@intel.com>.

Hi Mark,

For input streams like text input stream, only RDDs can be recovered from checkpoint, no missed files, if file is missed, actually an exception will be raised. If you use HDFS, HDFS will guarantee no data loss since it has 3 copies.Otherwise user logic has to guarantee no file deleted before recovering.

For input stream which is receiver based, like Kafka input stream or socket input stream, a WAL(write ahead log) mechanism can be enabled to store the received data as well as metadata, so data can be recovered from failure.

Thanks
Jerry

-----Original Message-----
From: mkhaitman [mailto:mark.khaitman@chango.com] 
Sent: Monday, February 23, 2015 10:54 AM
To: dev@spark.apache.org
Subject: StreamingContext textFileStream question

Hello,

I was interested in creating a StreamingContext textFileStream based job, which runs for long durations, and can also recover from prolonged driver failure... It seems like StreamingContext checkpointing is mainly used for the case when the driver dies during the processing of an RDD, and to recover that one RDD, but my question specifically relates to whether there is a way to also recover which files were missed between the timeframe of the driver dying and being started back up (whether manually or automatically).

Any assistance/suggestions with this one would be greatly appreciated!

Thanks,
Mark.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/StreamingContext-textFileStream-question-tp10742.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org For additional commands, e-mail: dev-help@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org