You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Greg Mann <Gr...@gmail.com> on 2015/08/12 23:07:19 UTC

Review Request 37415: Fix flaky ExamplesTest.JavaFramework

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/37415/
-----------------------------------------------------------

Review request for mesos, Joris Van Remoortere and Till Toenshoff.


Bugs: MESOS-830
    https://issues.apache.org/jira/browse/MESOS-830


Repository: mesos


Description
-------

Fix flaky ExamplesTest.JavaFramework


Diffs
-----

  src/examples/java/TestFramework.java 265005b9d1a6a053b812c133a4817c3d9e61179e 

Diff: https://reviews.apache.org/r/37415/diff/


Testing
-------

This test may fail only infrequently, and the bug is made worse by verbose logging, so to verify correctness (or reproduce the bug before patching) try:

GLOG_v=2 GTEST_FILTER="ExamplesTest.JavaFramework" bin/mesos-tests.sh --verbose --gtest_repeat=100 2>/dev/null | grep "\["


Thanks,

Greg Mann


Re: Review Request 37415: Fix flaky ExamplesTest.JavaFramework

Posted by Greg Mann <Gr...@gmail.com>.

> On Aug. 13, 2015, 4:22 p.m., Till Toenshoff wrote:
> > src/examples/java/TestFramework.java, line 275
> > <https://reviews.apache.org/r/37415/diff/1/?file=1038842#file1038842line275>
> >
> >     Why 500 - not 200 or 1000? If I am not mistaken, then using 500 worked fine on your test-machines but chances are that someone comes in with a heavily loaded machine where those 500ms might not do, correct? Let us be honest and let the audience know that this is just a value chosen out of some test runs done - again, if I did not miss something :)
> >     
> >     I think it would also be a great idea to add a todo here that captures what you wrote within the JIRA issue already;
> >     "TODO(*): Ideally, we would inspect the status of the driver and its associated tasks via the Java API and wait until their teardown is complete to exit."
> 
> Greg Mann wrote:
>     You are correct! :-) Thanks, comments addressed.
> 
> Joseph Wu wrote:
>     Agreed.  I don't think a sleep if a good enough solution.
>     
>     (Note: This next part is just supposition.)
>     
>     If you take a look at the comment here:
>     https://github.com/apache/mesos/blob/437e7018e184128c9ec7c4113da09bbf9914721f/src/java/jni/org_apache_mesos_MesosSchedulerDriver.cpp#L485
>     
>     You'll notice that the SchedulerDriver is instantiated as a weak reference.  So the GC is relied upon to destroy the object.
>     See: https://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/functions.html#weak
>     
>     Have you tried changing it to a normal global reference and adding the explicit cleanup code into the `stop` function?

Yea, I'm not crazy about the sleep solution either. I just tried your suggestion, Joseph, putting a DeleteGlobalRef() call in the stop() method, and it didn't work unfortunately. I think that short of an explicit confirmation message upon teardown completion (which doesn't currently exist, and is perhaps not advisable anyway), I don't see a way of resolving this in an elegant fashion. Even if we're sure that the SchedulerDriver is gone before we call System.exit(), the associated Executors and the Slave are doing logging as they shut down, and it's the premature destruction of the mutexes used by glog that is causing the problem.


- Greg


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/37415/#review95264
-----------------------------------------------------------


On Aug. 13, 2015, 4:49 p.m., Greg Mann wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/37415/
> -----------------------------------------------------------
> 
> (Updated Aug. 13, 2015, 4:49 p.m.)
> 
> 
> Review request for mesos, Joris Van Remoortere, Joseph Wu, and Till Toenshoff.
> 
> 
> Bugs: MESOS-830
>     https://issues.apache.org/jira/browse/MESOS-830
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Fix flaky ExamplesTest.JavaFramework
> 
> 
> Diffs
> -----
> 
>   src/examples/java/TestFramework.java 265005b9d1a6a053b812c133a4817c3d9e61179e 
> 
> Diff: https://reviews.apache.org/r/37415/diff/
> 
> 
> Testing
> -------
> 
> This test may fail only infrequently, and the bug is made worse by verbose logging, so to verify correctness (or reproduce the bug before patching) try:
> 
> GLOG_v=2 GTEST_FILTER="ExamplesTest.JavaFramework" bin/mesos-tests.sh --verbose --gtest_repeat=100 2>/dev/null | grep "\["
> 
> 
> Thanks,
> 
> Greg Mann
> 
>


Re: Review Request 37415: Fix flaky ExamplesTest.JavaFramework

Posted by Marco Massenzio <ma...@mesosphere.io>.

> On Aug. 13, 2015, 4:22 p.m., Till Toenshoff wrote:
> > src/examples/java/TestFramework.java, line 275
> > <https://reviews.apache.org/r/37415/diff/1/?file=1038842#file1038842line275>
> >
> >     Why 500 - not 200 or 1000? If I am not mistaken, then using 500 worked fine on your test-machines but chances are that someone comes in with a heavily loaded machine where those 500ms might not do, correct? Let us be honest and let the audience know that this is just a value chosen out of some test runs done - again, if I did not miss something :)
> >     
> >     I think it would also be a great idea to add a todo here that captures what you wrote within the JIRA issue already;
> >     "TODO(*): Ideally, we would inspect the status of the driver and its associated tasks via the Java API and wait until their teardown is complete to exit."
> 
> Greg Mann wrote:
>     You are correct! :-) Thanks, comments addressed.
> 
> Joseph Wu wrote:
>     Agreed.  I don't think a sleep if a good enough solution.
>     
>     (Note: This next part is just supposition.)
>     
>     If you take a look at the comment here:
>     https://github.com/apache/mesos/blob/437e7018e184128c9ec7c4113da09bbf9914721f/src/java/jni/org_apache_mesos_MesosSchedulerDriver.cpp#L485
>     
>     You'll notice that the SchedulerDriver is instantiated as a weak reference.  So the GC is relied upon to destroy the object.
>     See: https://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/functions.html#weak
>     
>     Have you tried changing it to a normal global reference and adding the explicit cleanup code into the `stop` function?
> 
> Greg Mann wrote:
>     Yea, I'm not crazy about the sleep solution either. I just tried your suggestion, Joseph, putting a DeleteGlobalRef() call in the stop() method, and it didn't work unfortunately. I think that short of an explicit confirmation message upon teardown completion (which doesn't currently exist, and is perhaps not advisable anyway), I don't see a way of resolving this in an elegant fashion. Even if we're sure that the SchedulerDriver is gone before we call System.exit(), the associated Executors and the Slave are doing logging as they shut down, and it's the premature destruction of the mutexes used by glog that is causing the problem.

Well, @joseph - the comment you point to, pretty much negates that option, doesn't it?

    // Create a weak global reference to the MesosSchedulerDriver
    // instance (we want a global reference so the GC doesn't collect
    // the instance but we make it weak so the JVM can exit).
  
Bearing in mind that this is test code, which is only run at exit (ie, one-off and not in a performance-critical section) I think that adding a sleep is acceptable (if not exciting).

Sure, a callback that says "hey, I'm done now, you can go away" would be great, but I'd guess, probably overkill.

Totally +1 to the comment and the TODO - hard-coded 'magic numbers" are A Bad Thing, so definitely worth going the extra mile and explaining them.


- Marco


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/37415/#review95264
-----------------------------------------------------------


On Aug. 13, 2015, 4:49 p.m., Greg Mann wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/37415/
> -----------------------------------------------------------
> 
> (Updated Aug. 13, 2015, 4:49 p.m.)
> 
> 
> Review request for mesos, Joris Van Remoortere, Joseph Wu, and Till Toenshoff.
> 
> 
> Bugs: MESOS-830
>     https://issues.apache.org/jira/browse/MESOS-830
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Fix flaky ExamplesTest.JavaFramework
> 
> 
> Diffs
> -----
> 
>   src/examples/java/TestFramework.java 265005b9d1a6a053b812c133a4817c3d9e61179e 
> 
> Diff: https://reviews.apache.org/r/37415/diff/
> 
> 
> Testing
> -------
> 
> This test may fail only infrequently, and the bug is made worse by verbose logging, so to verify correctness (or reproduce the bug before patching) try:
> 
> GLOG_v=2 GTEST_FILTER="ExamplesTest.JavaFramework" bin/mesos-tests.sh --verbose --gtest_repeat=100 2>/dev/null | grep "\["
> 
> 
> Thanks,
> 
> Greg Mann
> 
>


Re: Review Request 37415: Fix flaky ExamplesTest.JavaFramework

Posted by Greg Mann <Gr...@gmail.com>.

> On Aug. 13, 2015, 4:22 p.m., Till Toenshoff wrote:
> > src/examples/java/TestFramework.java, line 275
> > <https://reviews.apache.org/r/37415/diff/1/?file=1038842#file1038842line275>
> >
> >     Why 500 - not 200 or 1000? If I am not mistaken, then using 500 worked fine on your test-machines but chances are that someone comes in with a heavily loaded machine where those 500ms might not do, correct? Let us be honest and let the audience know that this is just a value chosen out of some test runs done - again, if I did not miss something :)
> >     
> >     I think it would also be a great idea to add a todo here that captures what you wrote within the JIRA issue already;
> >     "TODO(*): Ideally, we would inspect the status of the driver and its associated tasks via the Java API and wait until their teardown is complete to exit."

You are correct! :-) Thanks, comments addressed.


- Greg


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/37415/#review95264
-----------------------------------------------------------


On Aug. 13, 2015, 4:49 p.m., Greg Mann wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/37415/
> -----------------------------------------------------------
> 
> (Updated Aug. 13, 2015, 4:49 p.m.)
> 
> 
> Review request for mesos, Joris Van Remoortere, Joseph Wu, and Till Toenshoff.
> 
> 
> Bugs: MESOS-830
>     https://issues.apache.org/jira/browse/MESOS-830
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Fix flaky ExamplesTest.JavaFramework
> 
> 
> Diffs
> -----
> 
>   src/examples/java/TestFramework.java 265005b9d1a6a053b812c133a4817c3d9e61179e 
> 
> Diff: https://reviews.apache.org/r/37415/diff/
> 
> 
> Testing
> -------
> 
> This test may fail only infrequently, and the bug is made worse by verbose logging, so to verify correctness (or reproduce the bug before patching) try:
> 
> GLOG_v=2 GTEST_FILTER="ExamplesTest.JavaFramework" bin/mesos-tests.sh --verbose --gtest_repeat=100 2>/dev/null | grep "\["
> 
> 
> Thanks,
> 
> Greg Mann
> 
>


Re: Review Request 37415: Fix flaky ExamplesTest.JavaFramework

Posted by Joseph Wu <jo...@mesosphere.io>.

> On Aug. 13, 2015, 9:22 a.m., Till Toenshoff wrote:
> > src/examples/java/TestFramework.java, line 275
> > <https://reviews.apache.org/r/37415/diff/1/?file=1038842#file1038842line275>
> >
> >     Why 500 - not 200 or 1000? If I am not mistaken, then using 500 worked fine on your test-machines but chances are that someone comes in with a heavily loaded machine where those 500ms might not do, correct? Let us be honest and let the audience know that this is just a value chosen out of some test runs done - again, if I did not miss something :)
> >     
> >     I think it would also be a great idea to add a todo here that captures what you wrote within the JIRA issue already;
> >     "TODO(*): Ideally, we would inspect the status of the driver and its associated tasks via the Java API and wait until their teardown is complete to exit."
> 
> Greg Mann wrote:
>     You are correct! :-) Thanks, comments addressed.

Agreed.  I don't think a sleep if a good enough solution.

(Note: This next part is just supposition.)

If you take a look at the comment here:
https://github.com/apache/mesos/blob/437e7018e184128c9ec7c4113da09bbf9914721f/src/java/jni/org_apache_mesos_MesosSchedulerDriver.cpp#L485

You'll notice that the SchedulerDriver is instantiated as a weak reference.  So the GC is relied upon to destroy the object.
See: https://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/functions.html#weak

Have you tried changing it to a normal global reference and adding the explicit cleanup code into the `stop` function?


- Joseph


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/37415/#review95264
-----------------------------------------------------------


On Aug. 13, 2015, 9:49 a.m., Greg Mann wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/37415/
> -----------------------------------------------------------
> 
> (Updated Aug. 13, 2015, 9:49 a.m.)
> 
> 
> Review request for mesos, Joris Van Remoortere, Joseph Wu, and Till Toenshoff.
> 
> 
> Bugs: MESOS-830
>     https://issues.apache.org/jira/browse/MESOS-830
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Fix flaky ExamplesTest.JavaFramework
> 
> 
> Diffs
> -----
> 
>   src/examples/java/TestFramework.java 265005b9d1a6a053b812c133a4817c3d9e61179e 
> 
> Diff: https://reviews.apache.org/r/37415/diff/
> 
> 
> Testing
> -------
> 
> This test may fail only infrequently, and the bug is made worse by verbose logging, so to verify correctness (or reproduce the bug before patching) try:
> 
> GLOG_v=2 GTEST_FILTER="ExamplesTest.JavaFramework" bin/mesos-tests.sh --verbose --gtest_repeat=100 2>/dev/null | grep "\["
> 
> 
> Thanks,
> 
> Greg Mann
> 
>


Re: Review Request 37415: Fix flaky ExamplesTest.JavaFramework

Posted by Till Toenshoff <to...@me.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/37415/#review95264
-----------------------------------------------------------



src/examples/java/TestFramework.java (line 275)
<https://reviews.apache.org/r/37415/#comment150115>

    Why 500 - not 200 or 1000? If I am not mistaken, then using 500 worked fine on your test-machines but chances are that someone comes in with a heavily loaded machine where those 500ms might not do, correct? Let us be honest and let the audience know that this is just a value chosen out of some test runs done - again, if I did not miss something :)
    
    I think it would also be a great idea to add a todo here that captures what you wrote within the JIRA issue already;
    "TODO(*): Ideally, we would inspect the status of the driver and its associated tasks via the Java API and wait until their teardown is complete to exit."


- Till Toenshoff


On Aug. 12, 2015, 9:07 p.m., Greg Mann wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/37415/
> -----------------------------------------------------------
> 
> (Updated Aug. 12, 2015, 9:07 p.m.)
> 
> 
> Review request for mesos, Joris Van Remoortere and Till Toenshoff.
> 
> 
> Bugs: MESOS-830
>     https://issues.apache.org/jira/browse/MESOS-830
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Fix flaky ExamplesTest.JavaFramework
> 
> 
> Diffs
> -----
> 
>   src/examples/java/TestFramework.java 265005b9d1a6a053b812c133a4817c3d9e61179e 
> 
> Diff: https://reviews.apache.org/r/37415/diff/
> 
> 
> Testing
> -------
> 
> This test may fail only infrequently, and the bug is made worse by verbose logging, so to verify correctness (or reproduce the bug before patching) try:
> 
> GLOG_v=2 GTEST_FILTER="ExamplesTest.JavaFramework" bin/mesos-tests.sh --verbose --gtest_repeat=100 2>/dev/null | grep "\["
> 
> 
> Thanks,
> 
> Greg Mann
> 
>


Re: Review Request 37415: Fix flaky ExamplesTest.JavaFramework

Posted by Mesos ReviewBot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/37415/#review95203
-----------------------------------------------------------


Patch looks great!

Reviews applied: [37415]

All tests passed.

- Mesos ReviewBot


On Aug. 12, 2015, 9:07 p.m., Greg Mann wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/37415/
> -----------------------------------------------------------
> 
> (Updated Aug. 12, 2015, 9:07 p.m.)
> 
> 
> Review request for mesos, Joris Van Remoortere and Till Toenshoff.
> 
> 
> Bugs: MESOS-830
>     https://issues.apache.org/jira/browse/MESOS-830
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Fix flaky ExamplesTest.JavaFramework
> 
> 
> Diffs
> -----
> 
>   src/examples/java/TestFramework.java 265005b9d1a6a053b812c133a4817c3d9e61179e 
> 
> Diff: https://reviews.apache.org/r/37415/diff/
> 
> 
> Testing
> -------
> 
> This test may fail only infrequently, and the bug is made worse by verbose logging, so to verify correctness (or reproduce the bug before patching) try:
> 
> GLOG_v=2 GTEST_FILTER="ExamplesTest.JavaFramework" bin/mesos-tests.sh --verbose --gtest_repeat=100 2>/dev/null | grep "\["
> 
> 
> Thanks,
> 
> Greg Mann
> 
>