You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Yan Xu (JIRA)" <ji...@apache.org> on 2014/03/13 01:45:46 UTC
[jira] [Comment Edited] (MESOS-1088)
ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster
is flaky
[ https://issues.apache.org/jira/browse/MESOS-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932704#comment-13932704 ]
Yan Xu edited comment on MESOS-1088 at 3/13/14 12:43 AM:
---------------------------------------------------------
I suspect that the problem lies in the [latch|https://github.com/apache/mesos/blob/ea1ce107bb2aadc947563f1b59c7d08d1b7125f3/3rdparty/libprocess/src/latch.cpp].
{code:title=latch.cpp}
void Latch::trigger()
{
if (!triggered) {
terminate(pid);
{code}
It's possible for {{process::wait(pid, duration)}} below to return and in turn, {{Latch::await(...)}} to return {{false}} before the execution the next line, right?
{code:title=latch.cpp (continued)}
triggered = true;
}
}
bool Latch::await(const Duration& duration)
{
if (!triggered) {
process::wait(pid, duration); // Explict to disambiguate.
// It's possible that we failed to wait because:
// (1) Our process has already terminated.
// (2) We timed out (i.e., duration was not "infinite").
// In the event of (1) we might need to return 'true' since a
// terminated process might imply that the latch has been
// triggered. To capture this we simply return the value of
// 'triggered' (which will also capture cases where we actually
// timed out but have since triggered, which seems like an
// acceptable semantics given such a "tie").
return triggered;
}
return true;
}
{code}
was (Author: xujyan):
I suspect that it's the problem lies in the [latch|https://github.com/apache/mesos/blob/ea1ce107bb2aadc947563f1b59c7d08d1b7125f3/3rdparty/libprocess/src/latch.cpp].
{code:title=latch.cpp}
void Latch::trigger()
{
if (!triggered) {
terminate(pid);
{code}
It's possible for {{process::wait(pid, duration)}} below to return and in turn, {{Latch::await(...)}} to return {{false}} before the execution the next line, right?
{code:title=latch.cpp (continued)}
triggered = true;
}
}
bool Latch::await(const Duration& duration)
{
if (!triggered) {
process::wait(pid, duration); // Explict to disambiguate.
// It's possible that we failed to wait because:
// (1) Our process has already terminated.
// (2) We timed out (i.e., duration was not "infinite").
// In the event of (1) we might need to return 'true' since a
// terminated process might imply that the latch has been
// triggered. To capture this we simply return the value of
// 'triggered' (which will also capture cases where we actually
// timed out but have since triggered, which seems like an
// acceptable semantics given such a "tie").
return triggered;
}
return true;
}
{code}
> ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster is flaky
> -----------------------------------------------------------------------------------------
>
> Key: MESOS-1088
> URL: https://issues.apache.org/jira/browse/MESOS-1088
> Project: Mesos
> Issue Type: Bug
> Components: test
> Reporter: Yan Xu
> Assignee: Yan Xu
> Fix For: 0.19.0
>
>
> {code}
> [ RUN ] ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster
> I0312 15:50:02.733414 2029 zookeeper_test_server.cpp:158] Started ZooKeeperTestServer on port 32925
> 2014-03-12 15:50:02,733:2029(0x7fc285609700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
> 2014-03-12 15:50:02,733:2029(0x7fc285609700):ZOO_INFO@log_env@716: Client environment:host.name=fedora-20
> 2014-03-12 15:50:02,734:2029(0x7fc285609700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
> 2014-03-12 15:50:02,734:2029(0x7fc285609700):ZOO_INFO@log_env@724: Client environment:os.arch=3.13.6-200.fc20.x86_64
> 2014-03-12 15:50:02,734:2029(0x7fc285609700):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Fri Mar 7 17:02:28 UTC 2014
> 2014-03-12 15:50:02,734:2029(0x7fc285609700):ZOO_INFO@log_env@733: Client environment:user.name=jenkins
> 2014-03-12 15:50:02,735:2029(0x7fc285609700):ZOO_INFO@log_env@741: Client environment:user.home=/home/jenkins
> 2014-03-12 15:50:02,735:2029(0x7fc285609700):ZOO_INFO@log_env@753: Client environment:user.dir=/var/jenkins/workspace/vinod-test/compiler/clang/os/fedora-20/src
> 2014-03-12 15:50:02,735:2029(0x7fc285609700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=127.0.0.1:32925 sessionTimeout=10000 watcher=0x7fc28df599f0 sessionId=0 sessionPasswd=<null> context=0x7fc264019490 flags=0
> I0312 15:50:02.738956 2050 contender.cpp:127] Joining the ZK group
> 2014-03-12 15:50:02,743:2029(0x7fc2532d1700):ZOO_INFO@check_events@1703: initiated connection to server [127.0.0.1:32925]
> 2014-03-12 15:50:02,750:2029(0x7fc2532d1700):ZOO_INFO@check_events@1750: session establishment complete on server [127.0.0.1:32925], sessionId=0x144b87cfc6c0000, negotiated timeout=10000
> I0312 15:50:02.752624 2051 group.cpp:310] Group process ((1177)@192.168.122.164:46605) connected to ZooKeeper
> I0312 15:50:02.752657 2051 group.cpp:778] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0)
> I0312 15:50:02.752666 2051 group.cpp:382] Trying to create path '/mesos' in ZooKeeper
> I0312 15:50:02.770174 2052 contender.cpp:243] New candidate (id='0') has entered the contest for leadership
> I0312 15:50:02.773874 2051 detector.cpp:134] Detected a new leader: (id='0')
> I0312 15:50:02.774001 2051 group.cpp:655] Trying to get '/mesos/info_0000000000' in ZooKeeper
> I0312 15:50:02.778889 2051 detector.cpp:377] A new leading master (UPID=@128.150.152.0:10000) is detected
> tests/master_contender_detector_tests.cpp:738: Failure
> Failed to wait 10secs for detected
> I0312 15:50:02.779384 2029 contender.cpp:182] Now cancelling the membership: 0
> 2014-03-12 15:50:02,780:2029(0x7fc28f738880):ZOO_INFO@zookeeper_close@2505: Closing zookeeper sessionId=0x144b87cfc6c0000 to [127.0.0.1:32925]
> I0312 15:50:02.784046 2029 zookeeper_test_server.cpp:122] Shutdown ZooKeeperTestServer on port 32925
> [ FAILED ] ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster (55 ms)
> {code}
> Notice that only 55ms has elapsed for this test and the Clock is not paused.
--
This message was sent by Atlassian JIRA
(v6.2#6252)