You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by zhonghaihua <gi...@git.apache.org> on 2016/01/17 13:55:54 UTC

[GitHub] spark pull request: initialize executorIdCounter after Application...

GitHub user zhonghaihua opened a pull request:

    https://github.com/apache/spark/pull/10794

    initialize executorIdCounter after ApplicationMaster killed for max n…

    Currently, when max number of executor failures reached the `maxNumExecutorFailures`, `ApplicationMaster` will be killed and re-register another one.This time, `YarnAllocator` will be created a new instance.
    But, the value of property `executorIdCounter` in `YarnAllocator` will reset to `0`. Then the Id of new executor will starting from `1`. This will confuse with the executor has already created before, which will cause FetchFailedException.
    This PR introduce a mechanism to initialize `executorIdCounter` after `ApplicationMaster` killed.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zhonghaihua/spark initExecutorIdCounterAfterAMKilled

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10794.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10794
    
----
commit 30048ac7ac9fc95edc1b936076415dea335848ef
Author: zhonghaihua <79...@qq.com>
Date:   2016-01-17T12:46:44Z

    initialize executorIdCounter after ApplicationMaster killed for max number of executor failures reached

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-204715025
  
    Hi, @tgravescs , my jira id is `Iward`. Thanks a lot.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-188853203
  
    **[Test build #51971 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51971/consoleFull)** for PR 10794 at commit [`659c505`](https://github.com/apache/spark/commit/659c5050a2433fc2ddb056895be30b12985481b9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r57890700
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator(
         new ConcurrentHashMap[ContainerId, java.lang.Boolean])
     
       @volatile private var numExecutorsRunning = 0
    -  // Used to generate a unique ID per executor
    -  private var executorIdCounter = 0
    +
    +  /**
    +   * Used to generate a unique ID per executor
    +   *
    +   * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then
    +   * the id of new executor will start from 1, this will conflict with the executor has
    +   * already created before. So, we should initialize the `executorIdCounter` by getting
    +   * the max executorId from driver.
    +   *
    +   * @see SPARK-12864
    +   */
    +  private var executorIdCounter: Int = {
    --- End diff --
    
    Great, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r57733482
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator(
         new ConcurrentHashMap[ContainerId, java.lang.Boolean])
     
       @volatile private var numExecutorsRunning = 0
    -  // Used to generate a unique ID per executor
    -  private var executorIdCounter = 0
    +
    +  /**
    +   * Used to generate a unique ID per executor
    +   *
    +   * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then
    +   * the id of new executor will start from 1, this will conflict with the executor has
    +   * already created before. So, we should initialize the `executorIdCounter` by getting
    +   * the max executorId from driver.
    +   *
    +   * @see SPARK-12864
    +   */
    +  private var executorIdCounter: Int = {
    +    driverRef.askWithRetry[Int](RetrieveMaxExecutorId) + 1
    --- End diff --
    
    I'm also a bit on the fence here whether we do this for cluster mode.  We don't need to so its just an extra overhead but at the same time its nice not having a bunch of if checks.. @vanzin  opinion?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r57729696
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator(
         new ConcurrentHashMap[ContainerId, java.lang.Boolean])
     
       @volatile private var numExecutorsRunning = 0
    -  // Used to generate a unique ID per executor
    -  private var executorIdCounter = 0
    +
    +  /**
    +   * Used to generate a unique ID per executor
    +   *
    +   * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then
    +   * the id of new executor will start from 1, this will conflict with the executor has
    --- End diff --
    
    I think we need to clarify this to say this is required for client mode when driver isn't running on yarn.    this isn't an issue in cluster mode.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-201880646
  
    @andrewor14 @tgravescs @vanzin Could you verify this PR, or any thoughts or concerns for this ? Thanks a lot.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-178309811
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by lianhuiwang <gi...@git.apache.org>.

Github user lianhuiwang commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-186128437
  
    @jerryshao when yarn-client, because driver is always running, when AM failure occurs, some executors that is created by previous AM may still exist after second AM start. so I think we need to reset in CoarseGrainedSchedulerBackend that can remove all historical executors before second AM allocate new executors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-188714369
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51938/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-186134181
  
    I see, that's what I worried about. I thought about this potential issue previously, also conflict executor id may bring in race conditions. Let me think about a proper way to address it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-186010712
  
    @zhonghaihua @jerryshao How is this related to #11205?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-187185830
  
    So we never intended to support the AM restart in client mode and having the driver handle that properly.  I was expecting it to see the AM die and the driver to go away.  At one point the AM attempts was set to 1 and  I think we just never handled it when we changed it to be configurable.
    
    We probably either need to test it out fully or just set the attempts to 1 for client mode.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-204568571
  
    LGTM, I'll leave it to @tgravescs to do a final review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r54060394
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator(
         new ConcurrentHashMap[ContainerId, java.lang.Boolean])
     
       @volatile private var numExecutorsRunning = 0
    -  // Used to generate a unique ID per executor
    -  private var executorIdCounter = 0
    +
    +  /**
    +   * Used to generate a unique ID per executor
    +   *
    +   * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then
    +   * the id of new executor will start from 1, this will conflict with the executor has
    +   * already created before. So, we should initialize the `executorIdCounter` by getting
    +   * the max executorId from driver.
    +   *
    +   * @see SPARK-12864
    +   */
    +  private var executorIdCounter: Int = {
    +    driverRef.askWithRetry[Int](RetrieveMaxExecutorId) + 1
    --- End diff --
    
    Yes, this logic will also be used in yarn cluster mode, also in yarn cluster mode, AM is started before driver.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-188633095
  
    **[Test build #51938 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51938/consoleFull)** for PR 10794 at commit [`3a1724c`](https://github.com/apache/spark/commit/3a1724c19ad3eb9e87d9a5b10007b6c53424aac0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r53919614
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -169,6 +172,24 @@ private[yarn] class YarnAllocator(
       }
     
       /**
    +   * Init `executorIdCounter`
    +   */
    +  def initExecutorIdCounter(): Unit = {
    +    val port = sparkConf.getInt("spark.yarn.am.port", 0)
    +    SparkHadoopUtil.get.runAsSparkUser { () =>
    +      val init = RpcEnv.create(
    +        "executorIdCounterInit",
    +        Utils.localHostName,
    +        port,
    +        sparkConf,
    +        new SecurityManager(sparkConf))
    +      val driver = init.setupEndpointRefByURI(driverUrl)
    --- End diff --
    
    Hi @jerryshao , thanks for your comments. I see what you mean, I will fix it soon. Thanks a lot.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r57934269
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator(
         new ConcurrentHashMap[ContainerId, java.lang.Boolean])
     
       @volatile private var numExecutorsRunning = 0
    -  // Used to generate a unique ID per executor
    -  private var executorIdCounter = 0
    +
    +  /**
    +   * Used to generate a unique ID per executor
    +   *
    +   * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then
    +   * the id of new executor will start from 1, this will conflict with the executor has
    +   * already created before. So, we should initialize the `executorIdCounter` by getting
    +   * the max executorId from driver.
    +   *
    +   * @see SPARK-12864
    +   */
    +  private var executorIdCounter: Int = {
    --- End diff --
    
    There is a check in CoarseGrainedSchedulerBackend:
    
    ```
          case RegisterExecutor(executorId, executorRef, hostPort, cores, logUrls) =>
            if (executorDataMap.contains(executorId)) {
              context.reply(RegisterExecutorFailed("Duplicate executor ID: " + executorId))
    ```
    
    That would cause the container failure count to go up, though. Probably ok? I don't see a way to fix that race without updating the driver with a synchronous RPC every time a new container is launched.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r57838823
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator(
         new ConcurrentHashMap[ContainerId, java.lang.Boolean])
     
       @volatile private var numExecutorsRunning = 0
    -  // Used to generate a unique ID per executor
    -  private var executorIdCounter = 0
    +
    +  /**
    +   * Used to generate a unique ID per executor
    +   *
    +   * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then
    +   * the id of new executor will start from 1, this will conflict with the executor has
    --- End diff --
    
    I think we can clarify this in `SPARK-12864` issue. @andrewor14 What's your opinion ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r54059785
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator(
         new ConcurrentHashMap[ContainerId, java.lang.Boolean])
     
       @volatile private var numExecutorsRunning = 0
    -  // Used to generate a unique ID per executor
    -  private var executorIdCounter = 0
    +
    +  /**
    +   * Used to generate a unique ID per executor
    +   *
    +   * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then
    +   * the id of new executor will start from 1, this will conflict with the executor has
    +   * already created before. So, we should initialize the `executorIdCounter` by getting
    +   * the max executorId from driver.
    +   *
    +   * @see SPARK-12864
    +   */
    +  private var executorIdCounter: Int = {
    +    driverRef.askWithRetry[Int](RetrieveMaxExecutorId) + 1
    --- End diff --
    
    I think here we should also make sure `driverRef` is already established, both in yarn client and cluster mode, did you test locally?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r53521525
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -169,6 +172,24 @@ private[yarn] class YarnAllocator(
       }
     
       /**
    +   * Init `executorIdCounter`
    +   */
    +  def initExecutorIdCounter(): Unit = {
    +    val port = sparkConf.getInt("spark.yarn.am.port", 0)
    +    SparkHadoopUtil.get.runAsSparkUser { () =>
    +      val init = RpcEnv.create(
    +        "executorIdCounterInit",
    +        Utils.localHostName,
    +        port,
    +        sparkConf,
    +        new SecurityManager(sparkConf))
    +      val driver = init.setupEndpointRefByURI(driverUrl)
    --- End diff --
    
    why create another endpoint here? Can't we just use `driverRef`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r58032934
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala ---
    @@ -155,6 +158,9 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
               // in this block are read when requesting executors
               CoarseGrainedSchedulerBackend.this.synchronized {
                 executorDataMap.put(executorId, data)
    +            if (currentExecutorIdCounter < Integer.parseInt(executorId)) {
    +              currentExecutorIdCounter = Integer.parseInt(executorId)
    +            }
    --- End diff --
    
    @vanzin Thanks for your comments. I will optimize it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-186993839
  
    Hi @andrewor14 , the reason of test failed seems `GitException`. Could you retest it ? Thanks a lot.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-188730628
  
    retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r53521751
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedClusterMessage.scala ---
    @@ -30,6 +30,8 @@ private[spark] object CoarseGrainedClusterMessages {
     
       case object RetrieveSparkProps extends CoarseGrainedClusterMessage
     
    +  case object RetrieveCurrentExecutorIdCounter extends CoarseGrainedClusterMessage
    --- End diff --
    
    I would call this retrieve max executor ID. Without context the reader has no idea what `executorIdCounter` is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r53562669
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala ---
    @@ -155,6 +158,9 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
               // in this block are read when requesting executors
               CoarseGrainedSchedulerBackend.this.synchronized {
                 executorDataMap.put(executorId, data)
    +            if (currentExecutorIdCounter < Integer.parseInt(executorId)) {
    +              currentExecutorIdCounter = Integer.parseInt(executorId)
    +            }
    --- End diff --
    
    Thank for review it. For my understanding, I don't think we can get the max executor ID in executorDataMap. Because, when AM is failure, all the executor are disconnect and be removed, by this time, as the code in method `CoarseGrainedSchedulerBackend.removeExecutor` show, the executor information in executorDataMap also be removed. 
    So, I think the executor information in executorDataMap is not complete. What do you think ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r54057627
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator(
         new ConcurrentHashMap[ContainerId, java.lang.Boolean])
     
       @volatile private var numExecutorsRunning = 0
    -  // Used to generate a unique ID per executor
    -  private var executorIdCounter = 0
    +
    +  /**
    +   * Used to generate a unique ID per executor
    +   *
    +   * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then
    +   * the id of new executor will start from 1, this will conflict with the executor has
    +   * already created before. So, we should initialize the `executorIdCounter` by getting
    +   * the max executorId from driver.
    +   *
    +   * @see SPARK-12864
    +   */
    +  private var executorIdCounter: Int = {
    +    driverRef.askWithRetry[Int](RetrieveMaxExecutorId) + 1
    --- End diff --
    
    Is it necessary to request the executor id every time allocating a new executor?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-204276708
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54685/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-188789010
  
    **[Test build #51971 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51971/consoleFull)** for PR 10794 at commit [`659c505`](https://github.com/apache/spark/commit/659c5050a2433fc2ddb056895be30b12985481b9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-186993136
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51653/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-185625228
  
    @andrewor14 @marmbrus @rxin ,  any thoughts or concerns for this patch ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-203429195
  
    Also please update the description of this PR and jira to describe that this is happening in client mode due to driver not running on yarn.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-186016863
  
    Hi, @andrewor14 , I agree with @jerryshao , I think that is not related to it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-174830247
  
    @marmbrus @liancheng @yhuai Could you verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-179924619
  
    @andrewor14 Thanks for review it. Could this path merge to master ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r58009178
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedClusterMessage.scala ---
    @@ -30,6 +30,8 @@ private[spark] object CoarseGrainedClusterMessages {
     
       case object RetrieveSparkProps extends CoarseGrainedClusterMessage
     
    +  case object RetrieveMaxExecutorId extends CoarseGrainedClusterMessage
    --- End diff --
    
    @tgravescs Ok，I get your mean. Thanks a lot.
    Use this name `RetrieveLastAllocatedExecutorId` is ok ? @vanzin What's your opinion ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-186103130
  
    So you mean we also need to clean the states in `CoarseGrainedSchedulerBackend` if AM failure occurs, even dynamic allocation is not enabled?
    
    What specific behavior did you see? @lianhuiwang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r57732142
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedClusterMessage.scala ---
    @@ -30,6 +30,8 @@ private[spark] object CoarseGrainedClusterMessages {
     
       case object RetrieveSparkProps extends CoarseGrainedClusterMessage
     
    +  case object RetrieveMaxExecutorId extends CoarseGrainedClusterMessage
    --- End diff --
    
    we seem to be intermixing current and Max.  Max makes me think that this is some limit so can we change things to be consistent and perhaps use CurrentExecutorId or LargestAllocated


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-188714365
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r54057531
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala ---
    @@ -78,6 +78,9 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
       // Executors that have been lost, but for which we don't yet know the real exit reason.
       protected val executorsPendingLossReason = new HashSet[String]
     
    +  // The num of current max ExecutorId used to re-register appMaster
    +  var currentExecutorIdCounter = 0
    --- End diff --
    
    Please add scope keyword `protected` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-202642060
  
    This looks OK. Any thoughts @vanzin @tgravescs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r54058368
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala ---
    @@ -78,6 +78,9 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
       // Executors that have been lost, but for which we don't yet know the real exit reason.
       protected val executorsPendingLossReason = new HashSet[String]
     
    +  // The num of current max ExecutorId used to re-register appMaster
    +  var currentExecutorIdCounter = 0
    --- End diff --
    
    Hi @jerryshao , thanks for your comments. The master branch is different from branch-1.5.x version. In master branch,`CoarseGrainedSchedulerBackend` is belong to module `core` and `YarnSchedulerBackend` is belong to module `yarn` , while in branch-1.5.x version it is belong to the same package. So, from my understanding, `protected` is unsuited here, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r57884582
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator(
         new ConcurrentHashMap[ContainerId, java.lang.Boolean])
     
       @volatile private var numExecutorsRunning = 0
    -  // Used to generate a unique ID per executor
    -  private var executorIdCounter = 0
    +
    +  /**
    +   * Used to generate a unique ID per executor
    +   *
    +   * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then
    +   * the id of new executor will start from 1, this will conflict with the executor has
    --- End diff --
    
    I would prefer to do it here in this comment to better describe the situation this is needed.  It should be a line or two and I personally much prefer that then pointing at jiras unless its a big discussion/background required, then the jira makes more sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r54058736
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator(
         new ConcurrentHashMap[ContainerId, java.lang.Boolean])
     
       @volatile private var numExecutorsRunning = 0
    -  // Used to generate a unique ID per executor
    -  private var executorIdCounter = 0
    +
    +  /**
    +   * Used to generate a unique ID per executor
    +   *
    +   * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then
    +   * the id of new executor will start from 1, this will conflict with the executor has
    +   * already created before. So, we should initialize the `executorIdCounter` by getting
    +   * the max executorId from driver.
    +   *
    +   * @see SPARK-12864
    +   */
    +  private var executorIdCounter: Int = {
    +    driverRef.askWithRetry[Int](RetrieveMaxExecutorId) + 1
    --- End diff --
    
    Hi @jerryshao , thanks for reviewing. Allocating a new executor won't execute that code. It just request the executor id when `YarnAllocator` being created.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-172323185
  
    cc @rxin @marmbrus @chenghao-intel @jeanlyn could you give some advice ？


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by lianhuiwang <gi...@git.apache.org>.

Github user lianhuiwang commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-186102176
  
    @jerryshao I think it needs to reset in CoarseGrainedSchedulerBackend when dynamicAllocation is not enabled. because it can clear information and come back to initialization state.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-178362312
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50524/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-204570317
  
    @ zhonghaihua  what is your jira id so I can assign it to you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r54059178
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala ---
    @@ -78,6 +78,9 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
       // Executors that have been lost, but for which we don't yet know the real exit reason.
       protected val executorsPendingLossReason = new HashSet[String]
     
    +  // The num of current max ExecutorId used to re-register appMaster
    +  var currentExecutorIdCounter = 0
    --- End diff --
    
    I don't think so. Though they're in different modules, still they're under same package, please see other variables like `hostToLocalTaskCount`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-186014165
  
    @andrewor14 from my understanding I don't think it is the same issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-178283366
  
    Yes, this is the yarn-client only AM reattempt issue, I address this issue before by resetting the status of `ExecutorAllocationManager` and `CoarseGrainedSchedulerBackend`. But looks like there's still some stale state conflicts in BlockManager. For the details you could check the related JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r54058886
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator(
         new ConcurrentHashMap[ContainerId, java.lang.Boolean])
     
       @volatile private var numExecutorsRunning = 0
    -  // Used to generate a unique ID per executor
    -  private var executorIdCounter = 0
    +
    +  /**
    +   * Used to generate a unique ID per executor
    +   *
    +   * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then
    +   * the id of new executor will start from 1, this will conflict with the executor has
    +   * already created before. So, we should initialize the `executorIdCounter` by getting
    +   * the max executorId from driver.
    +   *
    +   * @see SPARK-12864
    +   */
    +  private var executorIdCounter: Int = {
    +    driverRef.askWithRetry[Int](RetrieveMaxExecutorId) + 1
    --- End diff --
    
    I see, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r54094456
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala ---
    @@ -78,6 +78,9 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
       // Executors that have been lost, but for which we don't yet know the real exit reason.
       protected val executorsPendingLossReason = new HashSet[String]
     
    +  // The num of current max ExecutorId used to re-register appMaster
    +  var currentExecutorIdCounter = 0
    --- End diff --
    
    Hi @jerryshao , you are right. I fix it now. Thanks a lot.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10794


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r57886080
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedClusterMessage.scala ---
    @@ -30,6 +30,8 @@ private[spark] object CoarseGrainedClusterMessages {
     
       case object RetrieveSparkProps extends CoarseGrainedClusterMessage
     
    +  case object RetrieveMaxExecutorId extends CoarseGrainedClusterMessage
    --- End diff --
    
    I understand what the variable is, but readers looking at just this message without other context wouldn't necessarily know. When I look at the name I think its giving me some limit for max executor id. For instance we have many configs that set max of things ( for instance spark.reducer.maxSizeInFlight, spark.shuffle.io.maxRetries, etc.)  That is why I would like the name clarified. 
    
    If we change the calling context, then its not just the max executor id for the last AM, its the last executor id that was allocated.  So perhaps rename to be  RetrieveLastAllocatedExecutorId
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r57732500
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator(
         new ConcurrentHashMap[ContainerId, java.lang.Boolean])
     
       @volatile private var numExecutorsRunning = 0
    -  // Used to generate a unique ID per executor
    -  private var executorIdCounter = 0
    +
    +  /**
    +   * Used to generate a unique ID per executor
    +   *
    +   * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then
    +   * the id of new executor will start from 1, this will conflict with the executor has
    +   * already created before. So, we should initialize the `executorIdCounter` by getting
    +   * the max executorId from driver.
    +   *
    +   * @see SPARK-12864
    +   */
    +  private var executorIdCounter: Int = {
    --- End diff --
    
    So I think there is a race condition here.  For instance, AM dies right after it sends launch command for a new executor, if the AM comes up and get the currentExecutorId right before the executor registers, this number is going to be off by one (or possibly more then one if it sent launch commands for many)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-204276705
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-188713677
  
    **[Test build #51938 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51938/consoleFull)** for PR 10794 at commit [`3a1724c`](https://github.com/apache/spark/commit/3a1724c19ad3eb9e87d9a5b10007b6c53424aac0).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-188853591
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51971/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-188853588
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r53902497
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -169,6 +172,24 @@ private[yarn] class YarnAllocator(
       }
     
       /**
    +   * Init `executorIdCounter`
    +   */
    +  def initExecutorIdCounter(): Unit = {
    +    val port = sparkConf.getInt("spark.yarn.am.port", 0)
    +    SparkHadoopUtil.get.runAsSparkUser { () =>
    +      val init = RpcEnv.create(
    +        "executorIdCounterInit",
    +        Utils.localHostName,
    +        port,
    +        sparkConf,
    +        new SecurityManager(sparkConf))
    +      val driver = init.setupEndpointRefByURI(driverUrl)
    --- End diff --
    
    `YarnSchedulerBackend` extends `CoarseGrainedSchedulerBackend`, so what you mentioned can be achieved, you can check other codes inside the class to know how other codes handle this. Creating another endpoint is not necessary and weird here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-178362311
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r54060168
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator(
         new ConcurrentHashMap[ContainerId, java.lang.Boolean])
     
       @volatile private var numExecutorsRunning = 0
    -  // Used to generate a unique ID per executor
    -  private var executorIdCounter = 0
    +
    +  /**
    +   * Used to generate a unique ID per executor
    +   *
    +   * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then
    +   * the id of new executor will start from 1, this will conflict with the executor has
    +   * already created before. So, we should initialize the `executorIdCounter` by getting
    +   * the max executorId from driver.
    +   *
    +   * @see SPARK-12864
    +   */
    +  private var executorIdCounter: Int = {
    +    driverRef.askWithRetry[Int](RetrieveMaxExecutorId) + 1
    --- End diff --
    
    Yes, but just in yarn-client mode. Is it necessary to test in yarn-cluster mode ? @jerryshao 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-189095532
  
    Hi @andrewor14 , could you review this ?  Thanks a lot.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-191830328
  
    Hi @andrewor14 , any thoughts or concerns for this patch ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: initialize executorIdCounter after Application...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-172322701
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-186425239
  
    @jerryshao Can you clarify something: even after your fix in #9963 you still run into this issue right? Dynamic allocation or not, how did Spark ever continue to work across AM restarts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r54062856
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator(
         new ConcurrentHashMap[ContainerId, java.lang.Boolean])
     
       @volatile private var numExecutorsRunning = 0
    -  // Used to generate a unique ID per executor
    -  private var executorIdCounter = 0
    +
    +  /**
    +   * Used to generate a unique ID per executor
    +   *
    +   * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then
    +   * the id of new executor will start from 1, this will conflict with the executor has
    +   * already created before. So, we should initialize the `executorIdCounter` by getting
    +   * the max executorId from driver.
    +   *
    +   * @see SPARK-12864
    +   */
    +  private var executorIdCounter: Int = {
    +    driverRef.askWithRetry[Int](RetrieveMaxExecutorId) + 1
    --- End diff --
    
    Great. Another thing is that current executor id is started from 1, not 0, we'd better keep that convention.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r57931587
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedClusterMessage.scala ---
    @@ -30,6 +30,8 @@ private[spark] object CoarseGrainedClusterMessages {
     
       case object RetrieveSparkProps extends CoarseGrainedClusterMessage
     
    +  case object RetrieveMaxExecutorId extends CoarseGrainedClusterMessage
    --- End diff --
    
    how about `MaxKnownExecutorId`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-186993134
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r57838667
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator(
         new ConcurrentHashMap[ContainerId, java.lang.Boolean])
     
       @volatile private var numExecutorsRunning = 0
    -  // Used to generate a unique ID per executor
    -  private var executorIdCounter = 0
    +
    +  /**
    +   * Used to generate a unique ID per executor
    +   *
    +   * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then
    +   * the id of new executor will start from 1, this will conflict with the executor has
    +   * already created before. So, we should initialize the `executorIdCounter` by getting
    +   * the max executorId from driver.
    +   *
    +   * @see SPARK-12864
    +   */
    +  private var executorIdCounter: Int = {
    --- End diff --
    
    @tgravescs Thanks your comment. Yes. I think in this situation, when Am comes up, if the executor has already registered, the execuotr information must set in `executorDataMap`. And in executor registers process, it has a check to make sure we don't allow executor to register with same id. 
    You can see it in https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L139


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-204288682
  
    @andrewor14 @tgravescs @vanzin The code and the comment is optimized. And the description of this PR and jira is also updated. Please review it again. Thanks a lot.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r57837944
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedClusterMessage.scala ---
    @@ -30,6 +30,8 @@ private[spark] object CoarseGrainedClusterMessages {
     
       case object RetrieveSparkProps extends CoarseGrainedClusterMessage
     
    +  case object RetrieveMaxExecutorId extends CoarseGrainedClusterMessage
    --- End diff --
    
    @tgravescs Thanks for reviewing. Because this variables is the max executorId in previous AM, and this is just called by initializing AM. Our intention is to get the max executorId from all executor in previous AM. So I think maxExecutorId is ok. 
    @andrewor14 What's your opinion ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-178362213
  
    **[Test build #50524 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50524/consoleFull)** for PR 10794 at commit [`30048ac`](https://github.com/apache/spark/commit/30048ac7ac9fc95edc1b936076415dea335848ef).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-178218750
  
    @vanzin @jerryshao IIRC there's a similar patch somewhere to fix this issue?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-204248244
  
    **[Test build #54685 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54685/consoleFull)** for PR 10794 at commit [`ebe3c7f`](https://github.com/apache/spark/commit/ebe3c7f290929588c822137b8bf27b18fe75393f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r53579600
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -169,6 +172,24 @@ private[yarn] class YarnAllocator(
       }
     
       /**
    +   * Init `executorIdCounter`
    +   */
    +  def initExecutorIdCounter(): Unit = {
    +    val port = sparkConf.getInt("spark.yarn.am.port", 0)
    +    SparkHadoopUtil.get.runAsSparkUser { () =>
    +      val init = RpcEnv.create(
    +        "executorIdCounterInit",
    +        Utils.localHostName,
    +        port,
    +        sparkConf,
    +        new SecurityManager(sparkConf))
    +      val driver = init.setupEndpointRefByURI(driverUrl)
    --- End diff --
    
    Hi @andrewor14 , `driverRef` doesn't work in this case. Because, for my understanding, `driverRef` which endpoint name called `YarnScheduler` send message to `YarnSchedulerEndpoint` (or get message from `YarnSchedulerEndpoint`), while we should get max executorId from `CoarseGrainedSchedulerBackend.DriverEndpoint` which endpoint name called `CoarseGrainedScheduler`.
    
    So, I think we should need a method to initialize `executorIdCounter`. And as you said, we should add huge comment huge comment related to SPARK-12864 to explain why we need to do this at this method. What‘s your opinion ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-204569111
  
    +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r54062244
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator(
         new ConcurrentHashMap[ContainerId, java.lang.Boolean])
     
       @volatile private var numExecutorsRunning = 0
    -  // Used to generate a unique ID per executor
    -  private var executorIdCounter = 0
    +
    +  /**
    +   * Used to generate a unique ID per executor
    +   *
    +   * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then
    +   * the id of new executor will start from 1, this will conflict with the executor has
    +   * already created before. So, we should initialize the `executorIdCounter` by getting
    +   * the max executorId from driver.
    +   *
    +   * @see SPARK-12864
    +   */
    +  private var executorIdCounter: Int = {
    +    driverRef.askWithRetry[Int](RetrieveMaxExecutorId) + 1
    --- End diff --
    
    Hi @jerryshao , I made a test just now. It can work in yarn cluster mode. From my understanding, `driverRef` is created by `ApplicationMaster`, so, I think at this time `driverEndpoint` is started. What's your opinion ? Thanks a lot.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r53562543
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedClusterMessage.scala ---
    @@ -30,6 +30,8 @@ private[spark] object CoarseGrainedClusterMessages {
     
       case object RetrieveSparkProps extends CoarseGrainedClusterMessage
     
    +  case object RetrieveCurrentExecutorIdCounter extends CoarseGrainedClusterMessage
    --- End diff --
    
    @andrewor14 Yeah, you are right. I will fix it soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r53522220
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -169,6 +172,24 @@ private[yarn] class YarnAllocator(
       }
     
       /**
    +   * Init `executorIdCounter`
    +   */
    +  def initExecutorIdCounter(): Unit = {
    --- End diff --
    
    actually, you can just initialize `executorIdCounter` to the value you get from the driver + 1 right?
    ```
    /**
     * [Add huge comment related to SPARK-12864 to explain why we need to do this here.]
     */
    private var executorCounter: Int = {
      driverRef.askWithRetry[Int](RetrieveMaxExecutorId) + 1
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-178318013
  
    **[Test build #50524 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50524/consoleFull)** for PR 10794 at commit [`30048ac`](https://github.com/apache/spark/commit/30048ac7ac9fc95edc1b936076415dea335848ef).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r53521663
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala ---
    @@ -155,6 +158,9 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
               // in this block are read when requesting executors
               CoarseGrainedSchedulerBackend.this.synchronized {
                 executorDataMap.put(executorId, data)
    +            if (currentExecutorIdCounter < Integer.parseInt(executorId)) {
    +              currentExecutorIdCounter = Integer.parseInt(executorId)
    +            }
    --- End diff --
    
    this is kind of awkward. You don't need to keep track of another variable; just compute the max executor ID when the AM asks for it. You already have all the information you need in `executorDataMap`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r58042774
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator(
         new ConcurrentHashMap[ContainerId, java.lang.Boolean])
     
       @volatile private var numExecutorsRunning = 0
    -  // Used to generate a unique ID per executor
    -  private var executorIdCounter = 0
    +
    +  /**
    +   * Used to generate a unique ID per executor
    +   *
    +   * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then
    +   * the id of new executor will start from 1, this will conflict with the executor has
    --- End diff --
    
    @tgravescs Ok, I will do it soon. Thanks a lot.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-204276434
  
    **[Test build #54685 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54685/consoleFull)** for PR 10794 at commit [`ebe3c7f`](https://github.com/apache/spark/commit/ebe3c7f290929588c822137b8bf27b18fe75393f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/10794#issuecomment-186495337

Hi @andrewor14 , in our implementation, currently when AM is failed all the related executors will be exited automatically, and driver will be notified with disconnection events and remove the related states. After then when the AM restarts, new executors will be registered into driver.

Here we assume all the executors will be exited before AM restarts. I'm afraid AM will possibly be restarted before all the executors are exited. To try to fix this, here in #9963 I cleaned `executorDataMap` when reset is invoked, but it is only for dynamic allocation enabled situation. like what @lianhuiwang mentioned, for dynamic allocation disabled situation we should also clean this state.

Beside, what I'm thinking is that there might be conflicted executor id issue, since executor id will be recalculated when AM restarts, which will be conflicted with old one. The issue may not only be in the driver side, but also in the external shuffle service (since now executor shuffle service requires executor id to do some recovery works), but I haven't yet met such issue till now.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r57933108
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator(
         new ConcurrentHashMap[ContainerId, java.lang.Boolean])
     
       @volatile private var numExecutorsRunning = 0
    -  // Used to generate a unique ID per executor
    -  private var executorIdCounter = 0
    +
    +  /**
    +   * Used to generate a unique ID per executor
    +   *
    +   * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then
    +   * the id of new executor will start from 1, this will conflict with the executor has
    +   * already created before. So, we should initialize the `executorIdCounter` by getting
    +   * the max executorId from driver.
    +   *
    +   * @see SPARK-12864
    +   */
    +  private var executorIdCounter: Int = {
    +    driverRef.askWithRetry[Int](RetrieveMaxExecutorId) + 1
    --- End diff --
    
    I don't think it makes a difference; it's just one extra RPC call during initialization. In cluster mode that's a local call which should also be much cheaper.
    
    An alternative would be to change the `RegisterClusterManager` message so that it returns this data to the AM (see `AMEndpoint.onStart` in ApplicationMaster.scala).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by zhonghaihua <gi...@git.apache.org>.

Github user zhonghaihua commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r54063643
  
    --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
    @@ -81,8 +82,20 @@ private[yarn] class YarnAllocator(
         new ConcurrentHashMap[ContainerId, java.lang.Boolean])
     
       @volatile private var numExecutorsRunning = 0
    -  // Used to generate a unique ID per executor
    -  private var executorIdCounter = 0
    +
    +  /**
    +   * Used to generate a unique ID per executor
    +   *
    +   * Init `executorIdCounter`. when AM restart, `executorIdCounter` will reset to 0. Then
    +   * the id of new executor will start from 1, this will conflict with the executor has
    +   * already created before. So, we should initialize the `executorIdCounter` by getting
    +   * the max executorId from driver.
    +   *
    +   * @see SPARK-12864
    +   */
    +  private var executorIdCounter: Int = {
    +    driverRef.askWithRetry[Int](RetrieveMaxExecutorId) + 1
    --- End diff --
    
    I see, thanks a lot.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12864][YARN] initialize executorIdCount...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10794#discussion_r57931508
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala ---
    @@ -155,6 +158,9 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
               // in this block are read when requesting executors
               CoarseGrainedSchedulerBackend.this.synchronized {
                 executorDataMap.put(executorId, data)
    +            if (currentExecutorIdCounter < Integer.parseInt(executorId)) {
    +              currentExecutorIdCounter = Integer.parseInt(executorId)
    +            }
    --- End diff --
    
    yes, with dynamic allocation, for example, the executor with the max known id may be gone already.
    
    minor: `executorId.toInt` instead of `Integer.parseInt`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org