You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Chris Fregly <ch...@fregly.com> on 2014/06/30 21:27:02 UTC

Re: persistence and fault tolerance in Spark Streaming

@TD:  could you provide some guidance on this?  these same types of
questions come up a lot in the field - and i'd like to have a solid answer
for folks.

thanks so much!

-chris


On Wed, May 28, 2014 at 10:45 AM, Diana Carroll <dc...@cloudera.com>
wrote:

> As I understand it, Spark streaming automatically persists (replication =
> 2) windowed dstreams, but not regular dstreams.
>
> So my question is, what happens in the case of worker node failure with a
> non-windowed dstream whose data source is a network stream?
>
> Say I'm getting a feed of log data, and one of my workers drops out
> halfway through an operation.  What happens?  Non-streaming RDDs are
> resilient because they can be recomputed from the source file, but in this
> case there is no source file.  If the original data from the stream wasn't
> replicated, does that mean it's just lost?  Will the task just fail?  Will
> the job fail?
>
> Also, I tried testing on a cluster with two workers and a windowed
> dstream.  The "Storage" tab in the app UI does show the data being
> persisted, but only with single replication.  Is that because my cluster is
> too small?
>
> [image: Inline image 1]
>
> Thanks,
> Diana
>