You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Adrian Mocanu <am...@verticalscope.com> on 2014/02/19 16:29:44 UTC
How is a spark node crash handled by spark wrt running DStream

Let me shorten my initial email since there were no takers:

How is a spark node crash handled by spark wrt a running DStream:

a)      the DStream is restarted and read  starting with the 1st RDD

b)      the DStream restarts reading at the last RDD read before the crash

c)       the DStream restarts anywhere in the sequence of RDDs

Thanks
-A
From: Adrian Mocanu [mailto:amocanu@verticalscope.com]
Sent: February-18-14 5:45 PM
To: user@spark.incubator.apache.org
Subject: RE: Is DStream read from the beginning upon node crash or from where it left off

Addendum

Forgot to mention that I use a StreamingContext: val streamingContext = new StreamingContext(conf, Seconds(10))
And I have no idea how option a) reading the DStream from the beginning would be implemented in Spark, but I think it might instead read it somewhere in the middle or last 100 RDDs which is the same thing: duplicates.

Also for completion, I use Calliope to insert into Cassandra.

From: Adrian Mocanu [mailto:amocanu@verticalscope.com]
Sent: February-18-14 5:32 PM
To: user@spark.incubator.apache.org<ma...@spark.incubator.apache.org>
Subject: Is DStream read from the beginning upon node crash or from where it left off

Hi,
I have a question on achieving fault tolerance when counting with spark and storing the aggregate count into Cassandra.

Example input: RDD 1 [a,a,a], RDD 2 [a,a]
After aggregation of RDD1 (ie map + reduceByKey) we get Map:[a->3]
And after aggregation for RDD2 we get Map:[a->2]

Now lets store these into Cassandra.
Someone here, Pankaj, mentioned the common trick to store the last transaction id of the RDD into Cassandra along with the data.
Now this could work, if after a spark node crash the DStream is not replayed entirely, but only starting from the last RDD.
For example, after saving RDD1 into Cassandra, the table has name: a, count: 3, RDD_id: 1

Now let's crash the spark node with my DStream.

Now the DStream is being recovered in parallel on other spark nodes
AND if the DStream is continued then RDD 2 will be processed next which correctly gives name: a, count: 5, RDD_id: 2
But if the DStream is executed from the beginning, it will over count RDD1 a tuples giving duplicates in the count value.

I need to know which way a node crash is handled:
a) the DStream is entirely restarted and read from the beginning or
 b) the DStream is read from the last RDD read
so that I can design the db schema accordingly to make each update idempotent:
for a) I need to save RDD IDs for each RDD while for b) all I need is the ID of the last RDD

I also welcome schema suggestions to make aggregation idempotent. Thanks a lot!
-Adrian