You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Tathagata Das (JIRA)" <ji...@apache.org> on 2015/05/22 20:32:17 UTC

[jira] [Comment Edited] (SPARK-7827) kafka streaming NotLeaderForPartition Exception could not be handled normally

    [ https://issues.apache.org/jira/browse/SPARK-7827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14556565#comment-14556565 ] 

Tathagata Das edited comment on SPARK-7827 at 5/22/15 6:31 PM:
---------------------------------------------------------------

The normal task retries tried to run the task 4 times (see "Job aborted due to stage failure: Task 11 in stage 3549.0 failed 4 times") and then gave up. 
Could you try increasing the task retry times using "spark.task.maxFailures".



was (Author: tdas):
The normal task retries tried to run the task 4 times (see "Job aborted due to stage failure: Task 11 in stage 3549.0 failed 4 times") and then gave up. 
Could you try increasing the task retry times using "spark.task.maxFailures".

TD

> kafka streaming NotLeaderForPartition Exception could not be handled normally
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-7827
>                 URL: https://issues.apache.org/jira/browse/SPARK-7827
>             Project: Spark
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 1.3.1
>         Environment: spark 1.3.1,  on yarn,   hadoop 2.6.0  
>            Reporter: hotdog
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> This is my Cluster 's log, once the partition's leader could not be found, the total streaming task will fail...
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in stage 3549.0 failed 4 times, most recent failure: Lost task 11.3 in stage 3549.0 (TID 385491, td-h85.cp): kafka.common.NotLeaderForPartitionException
> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> 	at java.lang.Class.newInstance(Class.java:374)
> 	at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:73)
> 	at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.handleFetchErr(KafkaRDD.scala:142)
> 	at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:151)
> 	at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:162)
> 	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> 	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:217)
> 	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:64)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
> 	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
> 	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> 	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
> 	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
> 	at scala.Option.foreach(Option.scala:236)
> 	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
> 	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
> 	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
> 	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> I read the source code of KafkaRDD.scala:
>   private def handleFetchErr(resp: FetchResponse) {
>       if (resp.hasError) {
>         val err = resp.errorCode(part.topic, part.partition)
>         if (err == ErrorMapping.LeaderNotAvailableCode ||
>           err == ErrorMapping.NotLeaderForPartitionCode) {
>           log.error(s"Lost leader for topic ${part.topic} partition ${part.partition}, " +
>             s" sleeping for ${kc.config.refreshLeaderBackoffMs}ms")
>           Thread.sleep(kc.config.refreshLeaderBackoffMs)
>         }
>         // Let normal rdd retry sort out reconnect attempts
>         throw ErrorMapping.exceptionFor(err)
>       }
>     }
> it seems the code throw a NotLeaderForPartition exception and expect the normal rdd retry.  but why the rdd retry was not performed and the job failed suddenly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org