You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Chris Fregly <ch...@fregly.com> on 2014/06/30 21:27:02 UTC
Re: persistence and fault tolerance in Spark Streaming
@TD: could you provide some guidance on this? these same types of
questions come up a lot in the field - and i'd like to have a solid answer
for folks.
thanks so much!
-chris
On Wed, May 28, 2014 at 10:45 AM, Diana Carroll <dc...@cloudera.com>
wrote:
> As I understand it, Spark streaming automatically persists (replication =
> 2) windowed dstreams, but not regular dstreams.
>
> So my question is, what happens in the case of worker node failure with a
> non-windowed dstream whose data source is a network stream?
>
> Say I'm getting a feed of log data, and one of my workers drops out
> halfway through an operation. What happens? Non-streaming RDDs are
> resilient because they can be recomputed from the source file, but in this
> case there is no source file. If the original data from the stream wasn't
> replicated, does that mean it's just lost? Will the task just fail? Will
> the job fail?
>
> Also, I tried testing on a cluster with two workers and a windowed
> dstream. The "Storage" tab in the app UI does show the data being
> persisted, but only with single replication. Is that because my cluster is
> too small?
>
> [image: Inline image 1]
>
> Thanks,
> Diana
>