You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by drexin <gi...@git.apache.org> on 2014/07/10 18:40:26 UTC

[GitHub] spark pull request: mesos executor ids now consist of the slave id...

GitHub user drexin opened a pull request:

    https://github.com/apache/spark/pull/1358

    mesos executor ids now consist of the slave id and a counter to fix dupl...

    ...icate id problems

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/drexin/spark wip-fix-mesos-executor-id-drexin

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1358.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1358
    
----
commit dde05f38bdab590bec23c2b4a5b8062e62991036
Author: Dario Rexin <da...@r3-tech.de>
Date:   2014-07-10T16:39:41Z

    mesos executor ids now consist of the slave id and a counter to fix duplicate id problems

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: mesos executor ids now consist of the slave id...

Posted by KashiErez <gi...@git.apache.org>.
Github user KashiErez commented on the pull request:

    https://github.com/apache/spark/pull/1358#issuecomment-55237569
  
    I have encountered this issue:
    We have a 24/7 Spark running job on Mesos.
    It happens every 1-3 days.
    
    Here are 2 lines from my Driver log file:
    
    2014-09-10 18:50:44,510 ERROR [spark-akka.actor.default-dispatcher-46] TaskSchedulerImpl  - Lost executor 201408311047-3690990090-5050-30951-12 on spark106.us.taboolasyndication.com: remote Akka client disassociated
    
    2014-09-10 18:51:46,062 ERROR [spark-akka.actor.default-dispatcher-15] BlockManagerMasterActor  - Got two different block manager registrations on 201408311047-3690990090-5050-30951-12
    
    Looks like Driver is disassociated from Spark worker.
    One minuted duplicated block manager registration happens.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: mesos executor ids now consist of the slave id...

Posted by drexin <gi...@git.apache.org>.
Github user drexin closed the pull request at:

    https://github.com/apache/spark/pull/1358


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: mesos executor ids now consist of the slave id...

Posted by brndnmtthws <gi...@git.apache.org>.
Github user brndnmtthws commented on the pull request:

    https://github.com/apache/spark/pull/1358#issuecomment-55659794
  
    It seems that this is a symptom of the following issue:
    
    https://issues.apache.org/jira/browse/SPARK-3535


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: mesos executor ids now consist of the slave id...

Posted by gmalouf <gi...@git.apache.org>.
Github user gmalouf commented on the pull request:

    https://github.com/apache/spark/pull/1358#issuecomment-52836949
  
    We've run into this issue a handful of times including once today - is it possible the bug is in Mesos?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: mesos executor ids now consist of the slave id...

Posted by drexin <gi...@git.apache.org>.
Github user drexin commented on the pull request:

    https://github.com/apache/spark/pull/1358#issuecomment-48707209
  
    Created a JIRA issue here: https://issues.apache.org/jira/browse/SPARK-2445


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: mesos executor ids now consist of the slave id...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1358#issuecomment-50413581
  
    Sure, if you find it, let me know.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: mesos executor ids now consist of the slave id...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1358#issuecomment-50250085
  
    Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: mesos executor ids now consist of the slave id...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1358#issuecomment-50250198
  
    QA tests have started for PR 1358. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17229/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: mesos executor ids now consist of the slave id...

Posted by drexin <gi...@git.apache.org>.
Github user drexin commented on the pull request:

    https://github.com/apache/spark/pull/1358#issuecomment-48706534
  
    Hi Patrick,
    
    the problem is described in [this mailing list entry](http://mail-archives.apache.org/mod_mbox/mesos-user/201407.mbox/%3c53B66E6D.7090909@uninett.no%3e)
    
    If I understand the [documentation on run modes](http://spark.apache.org/docs/latest/running-on-mesos.html) and the code correctly, in fine grained mode it starts a separate instance of `MesosExecutorBackend` for each spark task. If this is correct, then as soon as 2 tasks run concurrently on the same machine we should run into this problem.
    
    On [this line](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala#L329) in the `BlockManagerMasterActor`, there is a check on the `BlockManagerId`, which will always be different per `Executor` instance, because the port in there is randomly assigned. The `executorId` however is always set to the mesos `slaveId`. This means that we are running into [this case](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala#L331-L335) as soon as we start two `Executor` instances on the same slave. This PR fixes this by adding the counter to the `executorId`. Please tell me if I overlooked something.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: mesos executor ids now consist of the slave id...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1358#issuecomment-60327322
  
    @tsliwowicz your fix seems good -- thanks for getting to the bottom of this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: mesos executor ids now consist of the slave id...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1358#issuecomment-55978771
  
    BTW the delta from the original pull request would be that we only increment our counter when the old executor fails. If you want to implement that, please create a JIRA for it and send a new PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: mesos executor ids now consist of the slave id...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1358#issuecomment-50250903
  
    QA results for PR 1358:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17229/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: mesos executor ids now consist of the slave id...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1358#issuecomment-48630513
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: mesos executor ids now consist of the slave id...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1358#issuecomment-55978678
  
    I see, so maybe the problem is that an executor dies, and another is launched on the same Mesos machine with the same executor ID, which then breaks assumptions elsewhere in the code. In that case, our executor ID would need to be something like (Mesos executor ID) + (our attempt # on this executor). But you'd need to look throughout the MesosScheduler code and make sure this works -- in particular we have to send back the right ID when we launch tasks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: mesos executor ids now consist of the slave id...

Posted by tsliwowicz <gi...@git.apache.org>.
Github user tsliwowicz commented on the pull request:

    https://github.com/apache/spark/pull/1358#issuecomment-59724452
  
    @mateiz - @KashiErez and I went on a different route. The killer issue was that there is a System.exit(1) in BlockManagerMasterActor which was a huge robustness issue for us. @taboola we are running some pretty large clusters (process many tera bytes of data / day) which do real time calculations and are mission critical. So - we fixed the issue and it's been running successfully in our production for a while now. 
    
    I opened a new ticket - https://issues.apache.org/jira/browse/SPARK-4006
    And a pull request - https://github.com/apache/spark/pull/2854
    
    What do you think about our fix? 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: mesos executor ids now consist of the slave id...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1358#issuecomment-50250554
  
    So I don't quite understand, how can multiple executors be launched for the same Spark application on the same node right now? I thought we always reuse our executor across tasks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: mesos executor ids now consist of the slave id...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1358#discussion_r15436238
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala ---
    @@ -250,7 +252,7 @@ private[spark] class MesosSchedulerBackend(
         MesosTaskInfo.newBuilder()
           .setTaskId(taskId)
           .setSlaveId(SlaveID.newBuilder().setValue(slaveId).build())
    -      .setExecutor(createExecutorInfo(slaveId))
    +      .setExecutor(createExecutorInfo(nextExecutorId(slaveId)))
    --- End diff --
    
    Won't this change keep launching a new executor for each task? We want to reuse our Mesos executors


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: mesos executor ids now consist of the slave id...

Posted by drexin <gi...@git.apache.org>.
Github user drexin commented on the pull request:

    https://github.com/apache/spark/pull/1358#issuecomment-50318921
  
    @mateiz: You are right. I don't see how an executor could be started more than once per slave, but it seems to happen sometimes (see the mailing list entry). I will close this PR and try to further investigate this. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: mesos executor ids now consist of the slave id...

Posted by brndnmtthws <gi...@git.apache.org>.
Github user brndnmtthws commented on the pull request:

    https://github.com/apache/spark/pull/1358#issuecomment-55310679
  
    Yep, also hitting this same problem.  We're running Spark 1.0.2 and Mesos 0.20.0.
    
    From a quick analysis, it looks like a bug in Spark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: mesos executor ids now consist of the slave id...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/1358#issuecomment-48701713
  
    Hey Dario - do you mind describing a bit more the problem this fixes (ideally create a JIRA for it) and what the symptoms are.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---