You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2016/01/17 13:58:39 UTC

[jira] [Assigned] (SPARK-12864) initialize executorIdCounter after ApplicationMaster killed for max number of executor failures reached

     [ https://issues.apache.org/jira/browse/SPARK-12864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-12864:
------------------------------------

    Assignee:     (was: Apache Spark)

>  initialize executorIdCounter after ApplicationMaster killed for max number of executor failures reached
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-12864
>                 URL: https://issues.apache.org/jira/browse/SPARK-12864
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.3.1, 1.4.1, 1.5.2
>            Reporter: iward
>
> Currently, when max number of executor failures reached the *maxNumExecutorFailures*,  *ApplicationMaster* will be killed and re-register another one.This time, *YarnAllocator* will be created a new instance.
> But, the value of property *executorIdCounter* in  *YarnAllocator* will reset to *0*. Then the *Id* of new executor will starting from 1. This will confuse with the executor has already created before, which will cause FetchFailedException.
> For example, the following is the task log:
> {noformat}
> 2015-12-22 02:33:15 INFO 15/12/22 02:33:15 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: 172.22.92.14:45125
> 2015-12-22 02:33:26 INFO 15/12/22 02:33:26 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as AkkaRpcEndpointRef(Actor[akka.tcp://sparkYarnAM@172.22.168.72:54040/user/YarnAM#-1290854604])
> {noformat}
> {noformat}
> 2015-12-22 02:35:02 INFO 15/12/22 02:35:02 INFO YarnClientSchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@BJHC-HERA-16217.hadoop.jd.local:46538/user/Executor#-790726793]) with ID 1
> {noformat}
> {noformat}
> Lost task 3.0 in stage 102.0 (TID 1963, BJHC-HERA-16217.hadoop.jd.local): FetchFailed(BlockManagerId(1, BJHC-HERA-17030.hadoop.jd.local, 7337
> ), shuffleId=5, mapId=2, reduceId=3, message=
> 2015-12-22 02:43:20 INFO org.apache.spark.shuffle.FetchFailedException: /data3/yarn1/local/usercache/dd_edw/appcache/application_1450438154359_206399/blockmgr-b1fd0363-6d53-4d09-8086-adc4a13f4dc4/0f/shuffl
> e_5_2_0.index (No such file or directory)
> 2015-12-22 02:43:20 INFO at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
> 2015-12-22 02:43:20 INFO at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
> 2015-12-22 02:43:20 INFO at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
> 2015-12-22 02:43:20 INFO at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> 2015-12-22 02:43:20 INFO at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> 2015-12-22 02:43:20 INFO at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> 2015-12-22 02:43:20 INFO at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 2015-12-22 02:43:20 INFO at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 2015-12-22 02:43:20 INFO at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:154)
> 2015-12-22 02:43:20 INFO at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:149)
> 2015-12-22 02:43:20 INFO at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
> 2015-12-22 02:43:20 INFO at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
> 2015-12-22 02:43:20 INFO at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 2015-12-22 02:43:20 INFO at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> {noformat}
> As the task log show, the executor id of  *BJHC-HERA-16217.hadoop.jd.local* is the same as *BJHC-HERA-17030.hadoop.jd.local*. So, it is confusion and cause FetchFailedException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org