You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Yan Xu (JIRA)" <ji...@apache.org> on 2014/03/13 01:45:46 UTC

[jira] [Comment Edited] (MESOS-1088) ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster is flaky

    [ https://issues.apache.org/jira/browse/MESOS-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13932704#comment-13932704 ] 

Yan Xu edited comment on MESOS-1088 at 3/13/14 12:43 AM:
---------------------------------------------------------

I suspect that the problem lies in the [latch|https://github.com/apache/mesos/blob/ea1ce107bb2aadc947563f1b59c7d08d1b7125f3/3rdparty/libprocess/src/latch.cpp].

{code:title=latch.cpp}
void Latch::trigger()
{
  if (!triggered) {
    terminate(pid);
{code}

It's possible for {{process::wait(pid, duration)}} below to return and in turn, {{Latch::await(...)}} to return {{false}} before the execution the next line, right?

{code:title=latch.cpp (continued)}
    triggered = true;
  }
}


bool Latch::await(const Duration& duration)
{
  if (!triggered) {
    process::wait(pid, duration); // Explict to disambiguate.
    // It's possible that we failed to wait because:
    //   (1) Our process has already terminated.
    //   (2) We timed out (i.e., duration was not "infinite").

    // In the event of (1) we might need to return 'true' since a
    // terminated process might imply that the latch has been
    // triggered. To capture this we simply return the value of
    // 'triggered' (which will also capture cases where we actually
    // timed out but have since triggered, which seems like an
    // acceptable semantics given such a "tie").
    return triggered;
  }

  return true;
}
{code}


was (Author: xujyan):
I suspect that it's the problem lies in the [latch|https://github.com/apache/mesos/blob/ea1ce107bb2aadc947563f1b59c7d08d1b7125f3/3rdparty/libprocess/src/latch.cpp].

{code:title=latch.cpp}
void Latch::trigger()
{
  if (!triggered) {
    terminate(pid);
{code}

It's possible for {{process::wait(pid, duration)}} below to return and in turn, {{Latch::await(...)}} to return {{false}} before the execution the next line, right?

{code:title=latch.cpp (continued)}
    triggered = true;
  }
}


bool Latch::await(const Duration& duration)
{
  if (!triggered) {
    process::wait(pid, duration); // Explict to disambiguate.
    // It's possible that we failed to wait because:
    //   (1) Our process has already terminated.
    //   (2) We timed out (i.e., duration was not "infinite").

    // In the event of (1) we might need to return 'true' since a
    // terminated process might imply that the latch has been
    // triggered. To capture this we simply return the value of
    // 'triggered' (which will also capture cases where we actually
    // timed out but have since triggered, which seems like an
    // acceptable semantics given such a "tie").
    return triggered;
  }

  return true;
}
{code}

> ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster is flaky
> -----------------------------------------------------------------------------------------
>
>                 Key: MESOS-1088
>                 URL: https://issues.apache.org/jira/browse/MESOS-1088
>             Project: Mesos
>          Issue Type: Bug
>          Components: test
>            Reporter: Yan Xu
>            Assignee: Yan Xu
>             Fix For: 0.19.0
>
>
> {code}
> [ RUN      ] ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster
> I0312 15:50:02.733414  2029 zookeeper_test_server.cpp:158] Started ZooKeeperTestServer on port 32925
> 2014-03-12 15:50:02,733:2029(0x7fc285609700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
> 2014-03-12 15:50:02,733:2029(0x7fc285609700):ZOO_INFO@log_env@716: Client environment:host.name=fedora-20
> 2014-03-12 15:50:02,734:2029(0x7fc285609700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
> 2014-03-12 15:50:02,734:2029(0x7fc285609700):ZOO_INFO@log_env@724: Client environment:os.arch=3.13.6-200.fc20.x86_64
> 2014-03-12 15:50:02,734:2029(0x7fc285609700):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Fri Mar 7 17:02:28 UTC 2014
> 2014-03-12 15:50:02,734:2029(0x7fc285609700):ZOO_INFO@log_env@733: Client environment:user.name=jenkins
> 2014-03-12 15:50:02,735:2029(0x7fc285609700):ZOO_INFO@log_env@741: Client environment:user.home=/home/jenkins
> 2014-03-12 15:50:02,735:2029(0x7fc285609700):ZOO_INFO@log_env@753: Client environment:user.dir=/var/jenkins/workspace/vinod-test/compiler/clang/os/fedora-20/src
> 2014-03-12 15:50:02,735:2029(0x7fc285609700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=127.0.0.1:32925 sessionTimeout=10000 watcher=0x7fc28df599f0 sessionId=0 sessionPasswd=<null> context=0x7fc264019490 flags=0
> I0312 15:50:02.738956  2050 contender.cpp:127] Joining the ZK group
> 2014-03-12 15:50:02,743:2029(0x7fc2532d1700):ZOO_INFO@check_events@1703: initiated connection to server [127.0.0.1:32925]
> 2014-03-12 15:50:02,750:2029(0x7fc2532d1700):ZOO_INFO@check_events@1750: session establishment complete on server [127.0.0.1:32925], sessionId=0x144b87cfc6c0000, negotiated timeout=10000
> I0312 15:50:02.752624  2051 group.cpp:310] Group process ((1177)@192.168.122.164:46605) connected to ZooKeeper
> I0312 15:50:02.752657  2051 group.cpp:778] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0)
> I0312 15:50:02.752666  2051 group.cpp:382] Trying to create path '/mesos' in ZooKeeper
> I0312 15:50:02.770174  2052 contender.cpp:243] New candidate (id='0') has entered the contest for leadership
> I0312 15:50:02.773874  2051 detector.cpp:134] Detected a new leader: (id='0')
> I0312 15:50:02.774001  2051 group.cpp:655] Trying to get '/mesos/info_0000000000' in ZooKeeper
> I0312 15:50:02.778889  2051 detector.cpp:377] A new leading master (UPID=@128.150.152.0:10000) is detected
> tests/master_contender_detector_tests.cpp:738: Failure
> Failed to wait 10secs for detected
> I0312 15:50:02.779384  2029 contender.cpp:182] Now cancelling the membership: 0
> 2014-03-12 15:50:02,780:2029(0x7fc28f738880):ZOO_INFO@zookeeper_close@2505: Closing zookeeper sessionId=0x144b87cfc6c0000 to [127.0.0.1:32925]
> I0312 15:50:02.784046  2029 zookeeper_test_server.cpp:122] Shutdown ZooKeeperTestServer on port 32925
> [  FAILED  ] ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster (55 ms)
> {code}
> Notice that only 55ms has elapsed for this test and the Clock is not paused.



--
This message was sent by Atlassian JIRA
(v6.2#6252)