You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by xuzhongxing <gi...@git.apache.org> on 2014/08/14 10:02:17 UTC

[GitHub] spark pull request: Fix spark driver hang in mesos fine-grained mo...

GitHub user xuzhongxing opened a pull request:

    https://github.com/apache/spark/pull/1940

    Fix spark driver hang in mesos fine-grained mode

    https://issues.apache.org/jira/browse/SPARK-3005

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/xuzhongxing/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1940.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1940
    
----
commit 9496604865d22dc22b0293ed61a8414a2f510d1d
Author: xuzhongxing <xu...@163.com>
Date:   2014-08-14T07:59:14Z

    Fix spark driver hang in mesos fine-grained mode
    
    https://issues.apache.org/jira/browse/SPARK-3005

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3005] Fix spark driver hang in mesos fi...

Posted by tnachen <gi...@git.apache.org>.
Github user tnachen commented on the pull request:

    https://github.com/apache/spark/pull/1940#issuecomment-56246740
  
    Just chiming in about the two different fixes about the killTask, where this PR does nothing while the Brenden's PR #2453 calls the Mesos driver kill task.
    I think Brenden's approach is correct, since we don't know if the killTask was called because the task failed to launch or was actually being cancelled and Mesos task is already running.
    On Mesos side if you call killTask on a non-existing Task all you get is a LOG(WARNING).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Fix spark driver hang in mesos fine-grained mo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1940#issuecomment-52154510
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3005] Fix spark driver hang in mesos fi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1940#issuecomment-54694474
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3005] Fix spark driver hang in mesos fi...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/1940#issuecomment-55342956
  
    @xuzhongxing I followed the conversation on the JIRA and it looks like we still don't have a good idea of why Spark driver is hanging. Although we have a fix that makes the problem go away, the root cause is probably deeper, and the behavior you observed on fine-grained mode is just a symptom.
    
    My guess is that when an `UnsupportedOperationException` is thrown when we try to `killTask`, we never end up posting the job end event to the listeners. This may be a behavior introduced in #1219. @kayousterhout Can you comment on this? Any thoughts on why `DAGScheduler` hangs if we don't post a job end event?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3005] Fix spark driver hang in mesos fi...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/1940#issuecomment-55364614
  
    @andrewor14 I think you're right that there's a deeper problem here.  I haven't tested this but here's what I think is going on:
    
    (1) In TaskSchedulerImpl.cancelTasks(), the killTask call throws an unsupported operation exception, as is logged (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L194).  As a result, tsm.abort() never gets called.  So, the TaskSetManager still thinks everything is hunky dory.
    (2) Slowly the rest of the tasks fail, triggering the handleFailedTask() code in TaskSetManager.  The TSM doesn't realize the task set is effectively dead because abort() was never called.
    (3) Now, what I would expect to happen is that the code here:https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L605 would trigger the task to be re-launched.  Eventually, a task would fail 4 times and the stage would get killed.  This isn't exactly the right behavior, but still wouldn't lead to a hang.  It might be good to understand why that isn't happening.
    
    Regardless of what's going on with (3), I think the right way to fix this is to move the tsm.abort() call here: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L196 up to before we try to kill the task.  That way, regardless of whether killTask() is successful, we'll mark the task set as aborted and send all the appropriate events.
    
    Also, whoever fixes this should definitely add a unit test!! It would be great to add a short unit test to show the problem first, so it's easier for others to reproduce, and then deal with the fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3005] Fix spark driver hang in mesos fi...

Posted by xuzhongxing <gi...@git.apache.org>.
Github user xuzhongxing closed the pull request at:

    https://github.com/apache/spark/pull/1940


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3005] Fix spark driver hang in mesos fi...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/1940#issuecomment-55696321
  
    @kayousterhout thanks for the thorough analysis. Do you have any thoughts on just defining killTasks to be "best effort"? I think that would generally simplify the code a lot here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3005] Fix spark driver hang in mesos fi...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/1940#issuecomment-55696539
  
    This seems like it could be ok -- my only concern is about the semantics of
    when we tell the user we've killed their job.  Currently I think we invoke
    a callback on JobWaiter saying that the job has been killed, iff the
    schedulerBackend implements killTask.  So, if we make killTask() best
    effort, the semantics of that callback will change.
    
    On Mon, Sep 15, 2014 at 10:06 PM, Patrick Wendell <no...@github.com>
    wrote:
    
    > @kayousterhout <https://github.com/kayousterhout> thanks for the thorough
    > analysis. Do you have any thoughts on just defining killTasks to be "best
    > effort"? I think that would generally simplify the code a lot here.
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/1940#issuecomment-55696321>.
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3005] Fix spark driver hang in mesos fi...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/1940#issuecomment-56084970
  
    Yeah I think we should just change it to say that the kill request has been acknowledge, but since killing as asynchronous and best-effort, it may not have stopped executing. The semantics are already somewhat weird, because right now users will get that message even if tasks from their job are still running (since it's asynchronous).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3005] Fix spark driver hang in mesos fi...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/1940#issuecomment-53199778
  
    I commented on the JIRA - but we already have code that handles the fact that cancellation is not supported in Mesos. It's likely this is related to some other type of error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3005] Fix spark driver hang in mesos fi...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/1940#issuecomment-58241317
  
    Hey @xuzhongxing I think this is resolved in #2453. Would you mind closing this issue?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3005] Fix spark driver hang in mesos fi...

Posted by tnachen <gi...@git.apache.org>.
Github user tnachen commented on the pull request:

    https://github.com/apache/spark/pull/1940#issuecomment-62014644
  
    Please close this PR as this is no longer needed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org