You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Yan Xu (JIRA)" <ji...@apache.org> on 2014/05/07 01:42:17 UTC

[jira] [Commented] (MESOS-1318) ProcessWatcher triggers seg fault

    [ https://issues.apache.org/jira/browse/MESOS-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13991348#comment-13991348 ] 

Yan Xu commented on MESOS-1318:
-------------------------------

Here is what I think happened:

1. ZooKeeperImpl publish itself to ZK C lib's {{zookeeper_init}} [in its constructor|https://github.com/apache/mesos/blob/f76ab279b55bef3e1b9b0982cbd401ac300f2a82/src/zookeeper/zookeeper.cpp#L65].
2. At this point ZooKeeperImpl::ctor has not returned and it definitely has not been [assigned to it wrapper class' impl member|https://github.com/apache/mesos/blob/f76ab279b55bef3e1b9b0982cbd401ac300f2a82/src/zookeeper/zookeeper.cpp#L375].
3. If ZK client fires an event at this time, it is processed by ProcessWatcher, which [calls getSessionId()|https://github.com/apache/mesos/blob/f76ab279b55bef3e1b9b0982cbd401ac300f2a82/src/zookeeper/watcher.hpp#L33].
4. Inside getSessionId() it tries to use [impl which is not assigned yet|https://github.com/apache/mesos/blob/f76ab279b55bef3e1b9b0982cbd401ac300f2a82/src/zookeeper/zookeeper.cpp#L393]!

The crux is that publishing an object before it's fully initialized is dangerous.
We can:
1. Add a void ZooKeeperImpl::start() which calls zookeeper_init.
2. Have ZooKeeper ctor call {{impl.start()}}.
{code}
ZooKeeper::ZooKeeper(const string& servers,
                     const Duration& timeout,
                     Watcher* watcher)
{
  impl = new ZooKeeperImpl(this, servers, timeout, watcher);
  impl.start();
}
{code}

This way impl is guaranteed to be fully initialized. But nothing prevents ZK client from sending us events before impl.start() returns (therefore Zookeeper is not full initialized and returned to its caller). So perhaps it's prudent to expose the start() or init() method to the users of ZooKeeper class as well. 

So in Group we'd do

{code}
void GroupProcess::initialize()
{
  // Doing initialization here allows to avoid the race between
  // instantiating the ZooKeeper instance and being spawned ourself.
  watcher = new ProcessWatcher<GroupProcess>(self());
  zk = new ZooKeeper(servers, timeout, watcher);
  zk.start(); 
  state = CONNECTING;
}
{code}

> ProcessWatcher triggers seg fault
> ---------------------------------
>
>                 Key: MESOS-1318
>                 URL: https://issues.apache.org/jira/browse/MESOS-1318
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: Yan Xu
>             Fix For: 0.19.0
>
>
> Likely exposed by the fix for MESOS-1265.
> {noformat}
> 2014-05-06 18:01:18,943:17653(0x7f27f1117940):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
> 2014-05-06 18:01:18,943:17653(0x7f27f1117940):ZOO_INFO@log_env@716: Client environment:host.name=<redacted>
> 2014-05-06 18:01:18,943:17653(0x7f27f1117940):ZOO_INFO@log_env@723: Client environment:os.name=Linux
> 2014-05-06 18:01:18,943:17653(0x7f27f1117940):ZOO_INFO@log_env@724: Client environment:os.arch=<redacted>
> 2014-05-06 18:01:18,943:17653(0x7f27f1117940):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Mon Apr 7 15:24:34 PDT 2014
> 2014-05-06 18:01:18,943:17653(0x7f27f1117940):ZOO_INFO@log_env@733: Client environment:user.name=(null)
> I0506 18:01:18.947623 17653 slave.cpp:244] Slave resources: cpus(*):0.01; mem(*):160; disk(*):480; ports(*):[31780-31780]
> I0506 18:01:19.030995 17653 slave.cpp:272] Slave hostname: <redacted>
> I0506 18:01:19.031070 17653 slave.cpp:273] Slave checkpoint: true
> I0506 18:01:19.049667 17674 state.cpp:33] Recovering state from '/var/lib/mesos/slaves/20140416-015639-1890854154-5050-1354-24096/frameworks/201103282247-0000000019-0000/executors/thermos-1399399159295-mesos-test-meta-slave-1-424-bb99b160-9bb9-4f9f-ac75-378ca9ef5957/runs/09c67d7a-11f3-4054-bcde-3256f1d17dc6/sandbox/work_3/meta'
> I0506 18:01:19.051961 17674 status_update_manager.cpp:193] Recovering status update manager
> I0506 18:01:19.052105 17674 mesos_containerizer.cpp:201] Recovering containerizer
> I0506 18:01:19.052505 17674 slave.cpp:2943] Finished recovery
> 2014-05-06 18:01:19,057:17653(0x7f27f1117940):ZOO_INFO@log_env@741: Client environment:user.home=/home/mesos
> 2014-05-06 18:01:19,057:17653(0x7f27f1117940):ZOO_INFO@log_env@753: Client environment:user.dir=/var/lib/mesos/slaves/20140416-015639-1890854154-5050-1354-24096/frameworks/201103282247-0000000019-0000/executors/thermos-1399399159295-mesos-test-meta-slave-1-424-bb99b160-9bb9-4f9f-ac75-378ca9ef5957/runs/09c67d7a-11f3-4054-bcde-3256f1d17dc6/sandbox
> 2014-05-06 18:01:19,057:17653(0x7f27f1117940):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=<redacted> sessionTimeout=10000 watcher=0x7f27f8f311f0 sessionId=0 sessionPasswd=<null> context=0x249aed0 flags=0
> 2014-05-06 18:01:19,352:17653(0x7f27ec4fe940):ZOO_INFO@check_events@1703: initiated connection to server [10.36.79.123:2181]
> 2014-05-06 18:01:19,354:17653(0x7f27ec4fe940):ZOO_INFO@check_events@1750: session establishment complete on server [10.36.79.123:2181], sessionId=0x245af1f5caa8812, negotiated timeout=10000
> *** Aborted at 1399399279 (unix time) try "date -d @1399399279" if you are using GNU date ***
> PC: @     0x7f27f8f2f7c0 ZooKeeper::getSessionId()
> *** SIGSEGV (@0x0) received by PID 17653 (TID 0x7f27ebcfd940) from PID 0; stack trace: ***
>     @     0x7f27f8616ca0 (unknown)
>     @     0x7f27f8f2f7c0 ZooKeeper::getSessionId()
>     @     0x7f27f8f4ffcc ProcessWatcher<>::process()
>     @     0x7f27f8f31238 ZooKeeperImpl::event()
>     @     0x7f27f92292d2 deliverWatchers
>     @     0x7f27f921fe33 process_completions
>     @     0x7f27f9224bc1 do_completion
>     @     0x7f27f860e83d start_thread
>     @     0x7f27f737626d clone
> /bin/bash: line 1: 17653 Segmentation fault      (core dumped) META_THERMOS_ROOT=$(pwd)/work_3 /usr/local/sbin/mesos-slave --work_dir="$(pwd)/work_3" --mastsandbox/.logs/mesos-slave-3/0/stderr 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)