You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Saisai Shao (JIRA)" <ji...@apache.org> on 2016/01/18 20:03:39 UTC
[jira] [Comment Edited] (SPARK-12864) initialize executorIdCounter after ApplicationMaster killed for max number of executor failures reached

    [ https://issues.apache.org/jira/browse/SPARK-12864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15105671#comment-15105671 ] 

Saisai Shao edited comment on SPARK-12864 at 1/18/16 7:03 PM:
--------------------------------------------------------------

What Spark version are you using? I remember I fixed a similar AM re-attempt issue before. As I remembered, when AM is exited, all the related containers/executors will be exited as well, and when another attempt is started, it will initialize the AM from scratch, so there should be no problem.

Did you mean that when AM is failed, all the containers are still running in your cluster? If so, that's a kind of weird, would you please elaborate what you saw, thanks a lot.


was (Author: jerryshao):
What's Spark version are you using? I remember I fixed a similar AM re-attempt issue before. As I remembered, when AM is exited, all the related containers/executors will be exited as well, and when another attempt is started, it will initialize the AM from scratch, so there should be no problem.

Did you mean that when AM is failed, all the containers are still running in your cluster? If so, that's a kind of weird, would you please elaborate what you saw, thanks a lot.

>  initialize executorIdCounter after ApplicationMaster killed for max number of executor failures reached
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-12864
>                 URL: https://issues.apache.org/jira/browse/SPARK-12864
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.3.1, 1.4.1, 1.5.2
>            Reporter: iward
>
> Currently, when max number of executor failures reached the *maxNumExecutorFailures*,  *ApplicationMaster* will be killed and re-register another one.This time, *YarnAllocator* will be created a new instance.
> But, the value of property *executorIdCounter* in  *YarnAllocator* will reset to *0*. Then the *Id* of new executor will starting from 1. This will confuse with the executor has already created before, which will cause FetchFailedException.
> For example, the following is the task log:
> {noformat}
> 2015-12-22 02:33:15 INFO 15/12/22 02:33:15 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: 172.22.92.14:45125
> 2015-12-22 02:33:26 INFO 15/12/22 02:33:26 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as AkkaRpcEndpointRef(Actor[akka.tcp://sparkYarnAM@172.22.168.72:54040/user/YarnAM#-1290854604])
> {noformat}
> {noformat}
> 2015-12-22 02:35:02 INFO 15/12/22 02:35:02 INFO YarnClientSchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@BJHC-HERA-16217.hadoop.jd.local:46538/user/Executor#-790726793]) with ID 1
> {noformat}
> {noformat}
> Lost task 3.0 in stage 102.0 (TID 1963, BJHC-HERA-16217.hadoop.jd.local): FetchFailed(BlockManagerId(1, BJHC-HERA-17030.hadoop.jd.local, 7337
> ), shuffleId=5, mapId=2, reduceId=3, message=
> 2015-12-22 02:43:20 INFO org.apache.spark.shuffle.FetchFailedException: /data3/yarn1/local/usercache/dd_edw/appcache/application_1450438154359_206399/blockmgr-b1fd0363-6d53-4d09-8086-adc4a13f4dc4/0f/shuffl
> e_5_2_0.index (No such file or directory)
> 2015-12-22 02:43:20 INFO at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
> 2015-12-22 02:43:20 INFO at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
> 2015-12-22 02:43:20 INFO at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
> 2015-12-22 02:43:20 INFO at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> 2015-12-22 02:43:20 INFO at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> 2015-12-22 02:43:20 INFO at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> 2015-12-22 02:43:20 INFO at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 2015-12-22 02:43:20 INFO at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 2015-12-22 02:43:20 INFO at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:154)
> 2015-12-22 02:43:20 INFO at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:149)
> 2015-12-22 02:43:20 INFO at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
> 2015-12-22 02:43:20 INFO at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
> 2015-12-22 02:43:20 INFO at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 2015-12-22 02:43:20 INFO at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> {noformat}
> As the task log show, the executor id of  *BJHC-HERA-16217.hadoop.jd.local* is the same as *BJHC-HERA-17030.hadoop.jd.local*. So, it is confusion and cause FetchFailedException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org