You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Vinod Kone <vi...@gmail.com> on 2012/05/07 23:11:34 UTC

Review Request: Fix for slave segfault on framework exit

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5057/
-----------------------------------------------------------

Review request for mesos, Benjamin Hindman and John Sirois.


Summary
-------

Fix for: https://issues.apache.org/jira/browse/MESOS-190

Also prevents slave from infinitely re-trying status updates to a dead framework.


This addresses bug MESOS-190.
    https://issues.apache.org/jira/browse/MESOS-190


Diffs
-----

  src/slave/slave.cpp 09a8396 

Diff: https://reviews.apache.org/r/5057/diff


Testing
-------

Checked with long lived framework.

$ ./bin/mesos-master.sh
$ ./bin/mesos-slave.sh --master=localhost:5050
$./src/long-lived-framework localhost:5050


Thanks,

Vinod


Re: Review Request: Fix for slave segfault on framework exit

Posted by Benjamin Hindman <be...@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5057/#review7701
-----------------------------------------------------------

Ship it!


Thanks Vinod.

- Benjamin


On 2012-05-08 17:09:42, Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/5057/
> -----------------------------------------------------------
> 
> (Updated 2012-05-08 17:09:42)
> 
> 
> Review request for mesos, Benjamin Hindman and John Sirois.
> 
> 
> Summary
> -------
> 
> Fix for: https://issues.apache.org/jira/browse/MESOS-190
> 
> Also prevents slave from infinitely re-trying status updates to a dead framework.
> 
> 
> This addresses bug MESOS-190.
>     https://issues.apache.org/jira/browse/MESOS-190
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 09a8396 
>   src/tests/fault_tolerance_tests.cpp 6772daf 
> 
> Diff: https://reviews.apache.org/r/5057/diff
> 
> 
> Testing
> -------
> 
> Checked with long lived framework.
> 
> $ ./bin/mesos-master.sh
> $ ./bin/mesos-slave.sh --master=localhost:5050
> $./src/long-lived-framework localhost:5050
> 
> 
> Thanks,
> 
> Vinod
> 
>


Re: Review Request: Fix for slave segfault on framework exit

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5057/
-----------------------------------------------------------

(Updated 2012-05-08 17:09:42.129768)


Review request for mesos, Benjamin Hindman and John Sirois.


Changes
-------

john's comments. added test case.


Summary
-------

Fix for: https://issues.apache.org/jira/browse/MESOS-190

Also prevents slave from infinitely re-trying status updates to a dead framework.


This addresses bug MESOS-190.
    https://issues.apache.org/jira/browse/MESOS-190


Diffs (updated)
-----

  src/slave/slave.cpp 09a8396 
  src/tests/fault_tolerance_tests.cpp 6772daf 

Diff: https://reviews.apache.org/r/5057/diff


Testing
-------

Checked with long lived framework.

$ ./bin/mesos-master.sh
$ ./bin/mesos-slave.sh --master=localhost:5050
$./src/long-lived-framework localhost:5050


Thanks,

Vinod


Re: Review Request: Fix for slave segfault on framework exit

Posted by Vinod Kone <vi...@gmail.com>.

> On 2012-05-07 21:50:01, John Sirois wrote:
> > src/slave/slave.cpp, line 1487
> > <https://reviews.apache.org/r/5057/diff/2/?file=107599#file107599line1487>
> >
> >     Is there a test that could be tweaked to ensure this is happening?  Presumably it wasn't before via executorExited?

added a test.


> On 2012-05-07 21:50:01, John Sirois wrote:
> > src/slave/slave.cpp, line 1483
> > <https://reviews.apache.org/r/5057/diff/2/?file=107599#file107599line1483>
> >
> >     Does this new api call still transition live tasks to LOST/FAILED?

This is a bit nuanced. When a framework is shutdown, the slave sends a shutdown to the executor. One of the 2 things might happen.

1) EXECUTOR_SHUTDOWN_TIMEOUT_SECONDS elapses before the isolation module informs about the lost executor.  A TASK_LOST  will be sent by 
   the slave to the master. But the master drops it to the floor because the framework is dead.

2) Isolation module informs about lost executor before EXECUTOR_SHUTDOWN_TIMEOUT_SECONDS. Slave doesn't send a TASK_LOST.

In either case, the master never sends the TASK_LOST to the dead framework, which is the right thing to do.


This might be different when we have slave recovery implemented, but the logic there for handling status updates is very different. In other words, this fix will 
probably go away when we merge slave recovery stuff.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5057/#review7657
-----------------------------------------------------------


On 2012-05-07 21:11:34, Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/5057/
> -----------------------------------------------------------
> 
> (Updated 2012-05-07 21:11:34)
> 
> 
> Review request for mesos, Benjamin Hindman and John Sirois.
> 
> 
> Summary
> -------
> 
> Fix for: https://issues.apache.org/jira/browse/MESOS-190
> 
> Also prevents slave from infinitely re-trying status updates to a dead framework.
> 
> 
> This addresses bug MESOS-190.
>     https://issues.apache.org/jira/browse/MESOS-190
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 09a8396 
> 
> Diff: https://reviews.apache.org/r/5057/diff
> 
> 
> Testing
> -------
> 
> Checked with long lived framework.
> 
> $ ./bin/mesos-master.sh
> $ ./bin/mesos-slave.sh --master=localhost:5050
> $./src/long-lived-framework localhost:5050
> 
> 
> Thanks,
> 
> Vinod
> 
>


Re: Review Request: Fix for slave segfault on framework exit

Posted by John Sirois <jo...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5057/#review7657
-----------------------------------------------------------



src/slave/slave.cpp
<https://reviews.apache.org/r/5057/#comment16872>

    Does this new api call still transition live tasks to LOST/FAILED?



src/slave/slave.cpp
<https://reviews.apache.org/r/5057/#comment16873>

    Is there a test that could be tweaked to ensure this is happening?  Presumably it wasn't before via executorExited?


- John


On 2012-05-07 21:11:34, Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/5057/
> -----------------------------------------------------------
> 
> (Updated 2012-05-07 21:11:34)
> 
> 
> Review request for mesos, Benjamin Hindman and John Sirois.
> 
> 
> Summary
> -------
> 
> Fix for: https://issues.apache.org/jira/browse/MESOS-190
> 
> Also prevents slave from infinitely re-trying status updates to a dead framework.
> 
> 
> This addresses bug MESOS-190.
>     https://issues.apache.org/jira/browse/MESOS-190
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 09a8396 
> 
> Diff: https://reviews.apache.org/r/5057/diff
> 
> 
> Testing
> -------
> 
> Checked with long lived framework.
> 
> $ ./bin/mesos-master.sh
> $ ./bin/mesos-slave.sh --master=localhost:5050
> $./src/long-lived-framework localhost:5050
> 
> 
> Thanks,
> 
> Vinod
> 
>