You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Vinod Kone <vi...@gmail.com> on 2013/05/16 01:21:23 UTC
Re: Review Request: Fix the flaky AllocatorZookeeperTests


> On April 25, 2013, 8:07 p.m., Benjamin Hindman wrote:
> > Can you elaborate on why AtMost(1) was not sufficient?
> 
> Thomas Marshall wrote:
>     I honestly have no idea, although I just realized that the JIRA issue I linked to is actually referring to a different problem. That issue (timing out on waiting for the registered future) is fixed by the recent change that extended the amount of time that we wait on futures. The problem being fixed by this patch looks something more like:
>     
>     ...
>     I0425 13:37:03.516690 30336 hierarchical_allocator_process.hpp:423] Removed slave 201304251337-16842879-37747-30328-0
>     I0425 13:37:03.516121 30334 slave.cpp:486] Slave asked to shut down by master@127.0.1.1:37747
>     I0425 13:37:03.517014 30334 slave.cpp:1099] Asked to shut down framework 201304251337-16842879-37747-30328-0000 by master@127.0.1.1:37747
>     W0425 13:37:03.517163 30334 slave.cpp:1120] Ignoring shutdown framework 201304251337-16842879-37747-30328-0000 because it is terminating
>     I0425 13:37:03.518982 30334 slave.cpp:1867] master@127.0.1.1:37747 exited
>     [Thread 0x7fffb0cff700 (LWP 30378) exited]
>     W0425 13:37:03.519126 30334 slave.cpp:1870] Master disconnected! Waiting for a new master to be elected
>     [Thread 0x7fffabfff700 (LWP 30379) exited]
>     I0425 13:37:03.521286 30328 slave.cpp:441] Slave terminating
>     I0425 13:37:03.521467 30328 slave.cpp:1099] Asked to shut down framework 201304251337-16842879-37747-30328-0000 by @0.0.0.0:0
>     W0425 13:37:03.521649 30328 slave.cpp:1120] Ignoring shutdown framework 201304251337-16842879-37747-30328-0000 because it is terminating
>     [Thread 0x7fffb1d01700 (LWP 30374) exited]
>     [Thread 0x7fffab7fe700 (LWP 30376) exited]
>     
>     Program received signal SIGPIPE, Broken pipe.
>     [Switching to Thread 0x7fffb230e700 (LWP 30370)]
>     0x00007ffff6724ccd in write () from /lib/x86_64-linux-gnu/libpthread.so.0
>     (gdb) where
>     #0  0x00007ffff6724ccd in write () from /lib/x86_64-linux-gnu/libpthread.so.0
>     #1  0x00007fffb2316ab2 in Java_sun_nio_ch_FileDispatcherImpl_write0 () from /usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/libnio.so
>     #2  0x00007fffcc39ef90 in ?? ()
>     #3  0x0000000000000000 in ?? ()
>     
>     That error is occurring somewhere deep down inside zookeeper, and I don't really know what's causing it, other than obviously something related to the slave not being fully shut down yet when we shut down the master. If it seems important to you, I can continue investigating, but I suspect that its a quirk of how our test infrastructure interacts with zookeeper, and I don't think that its likely to come up in practice.
> 
> Benjamin Hindman wrote:
>     So, I don't see an error in the output that you've shown. When running tests with gdb and a JVM (which you get with ZooKeeper tests) you need to ignore most signals since that's the JVMs normal operation. I did see a a segfault that the JVM propagated in a previous test run, but it's not clear to me why this fix will cover that segfault ...
> 
> Thomas Marshall wrote:
>     Sorry, I didn't realize that was expected behavior. In that case, the problem this patch is solving looks something more like:
>     
>     ...
>     I0426 10:29:39.232862 12073 master.cpp:774] Asked to unregister framework 201304261029-16842879-56235-12059-0000
>     I0426 10:29:39.233058 12073 master.hpp:300] Removing task with resources cpus=1; mem=500 on slave 201304261029-16842879-56235-12059-0
>     I0426 10:29:39.233082 12079 hierarchical_allocator_process.hpp:359] Deactivated framework 201304261029-16842879-56235-12059-0000
>     I0426 10:29:39.233254 12073 master.hpp:318] Removing offer with resources cpus=1; mem=524; ports=[31000-32000]; disk=9053 on slave 201304261029-16842879-56235-12059-0
>     I0426 10:29:39.233487 12079 hierarchical_allocator_process.hpp:544] Recovered cpus=1; mem=500 (total allocatable: cpus=1; mem=500; ports=[]; disk=0) on slave 201304261029-16842879-56235-12059-0 from framework 201304261029-16842879-56235-12059-0000
>     I0426 10:29:39.233636 12073 master.cpp:477] Master terminating
>     I0426 10:29:39.233777 12059 master.cpp:283] Shutting down master
>     I0426 10:29:39.233770 12079 hierarchical_allocator_process.hpp:544] Recovered cpus=1; mem=524; ports=[31000-32000]; disk=9053 (total allocatable: cpus=2; mem=1024; ports=[31000-32000]; disk=9053) on slave 201304261029-16842879-56235-12059-0 from framework 201304261029-16842879-56235-12059-0000
>     I0426 10:29:39.233077 12074 slave.cpp:1099] Asked to shut down framework 201304261029-16842879-56235-12059-0000 by master@127.0.1.1:56235
>     W0426 10:29:39.234302 12074 slave.cpp:1120] Ignoring shutdown framework 201304261029-16842879-56235-12059-0000 because it is terminating
>     I0426 10:29:39.234310 12078 hierarchical_allocator_process.hpp:423] Removed slave 201304261029-16842879-56235-12059-0
>     I0426 10:29:39.234422 12074 slave.cpp:486] Slave asked to shut down by master@127.0.1.1:56235
>     I0426 10:29:39.234792 12074 slave.cpp:1099] Asked to shut down framework 201304261029-16842879-56235-12059-0000 by master@127.0.1.1:56235
>     W0426 10:29:39.234848 12074 slave.cpp:1120] Ignoring shutdown framework 201304261029-16842879-56235-12059-0000 because it is terminating
>     I0426 10:29:39.234915 12074 slave.cpp:441] Slave terminating
>     I0426 10:29:39.234956 12074 slave.cpp:1099] Asked to shut down framework 201304261029-16842879-56235-12059-0000 by @0.0.0.0:0
>     W0426 10:29:39.234998 12074 slave.cpp:1120] Ignoring shutdown framework 201304261029-16842879-56235-12059-0000 because it is terminating
>     pure virtual method called
>     terminate called without an active exception
>     Aborted (core dumped)

i've seen this "pure virtual method" error before, but unsure of the cause. do you know why/how your patch fixes this?


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10786/#review19724
-----------------------------------------------------------


On April 25, 2013, 8:05 p.m., Thomas Marshall wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10786/
> -----------------------------------------------------------
> 
> (Updated April 25, 2013, 8:05 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Vinod Kone, and Ben Mahler.
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> This addresses bug MESOS-441.
>     https://issues.apache.org/jira/browse/MESOS-441
> 
> 
> Diffs
> -----
> 
>   src/tests/allocator_zookeeper_tests.cpp 2c7deb1 
> 
> Diff: https://reviews.apache.org/r/10786/diff/
> 
> 
> Testing
> -------
> 
> bin/mesos-tests.sh --gtest_break_on_failure --gtest_repeat=1000 --gtest_filter=*AllocatorZoo*
> 
> 
> Thanks,
> 
> Thomas Marshall
> 
>