You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2013/07/20 19:38:49 UTC

[jira] [Updated] (MESOS-465) Failures due to ZooKeeper operation timeouts in the master detector.

     [ https://issues.apache.org/jira/browse/MESOS-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Mahler updated MESOS-465:
----------------------------------

    Summary: Failures due to ZooKeeper operation timeouts in the master detector.  (was: Failure due to ZooKeeper authentication operation timeout in the master detector.)
    
> Failures due to ZooKeeper operation timeouts in the master detector.
> --------------------------------------------------------------------
>
>                 Key: MESOS-465
>                 URL: https://issues.apache.org/jira/browse/MESOS-465
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Assignee: Jiang Yan Xu
>
> Observed this in production when ZK connectivity issues were present:
> I0508 19:15:48.485126 18915 slave.cpp:951] Process exited: @0.0.0.0:0
> W0508 19:15:48.485252 18915 slave.cpp:954] WARNING! Master disconnected! Waiting for a new master to be elected.
> 2013-05-08 19:15:51,621:18889(0x4ddeb940):ZOO_ERROR@handle_socket_error_msg@1528: Socket [10.40.93.122:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 3ms)
> 2013-05-08 19:15:51,621:18889(0x4ddeb940):ZOO_INFO@check_events@1585: initiated connection to server [10.40.89.122:2181]
> 2013-05-08 19:15:51,623:18889(0x4ddeb940):ZOO_ERROR@handle_socket_error_msg@1621: Socket [10.40.89.122:2181] zk retcode=-112, errno=116(Stale NFS file handle): sessionId=0x33e8587f8b50141 has expired.
> 2013-05-08 19:15:51,623:18889(0x4ddeb940):ZOO_DEBUG@handle_error@1138: Calling a watcher for a ZOO_SESSION_EVENT and the state=ZOO_EXPIRED_SESSION_STATE
> 2013-05-08 19:15:51,623:18889(0x4ddeb940):ZOO_DEBUG@do_io@317: IO thread terminated
> 2013-05-08 19:15:51,623:18889(0x4d5ea940):ZOO_DEBUG@process_completions@1765: Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
> W0508 19:15:51.623631 18904 detector.cpp:393] Master detector (slave(1)@IP:PORT)  ZooKeeper session expired!
> 2013-05-08 19:15:51,623:18889(0x4d5ea940):ZOO_DEBUG@do_completion@336: completion thread terminated
> 2013-05-08 19:15:51,623:18889(0x46ddd940):ZOO_INFO@zookeeper_close@2321: Freeing zookeeper resources for sessionId=0x33e8587f8b50141
> 2013-05-08 19:15:51,625:18889(0x46ddd940):ZOO_INFO@log_env@658: Client environment:zookeeper.version=zookeeper C client 3.3.4
> 2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@log_env@662: Client environment:host.name=HOSTNAME
> 2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@log_env@669: Client environment:os.name=Linux
> 2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@log_env@670: Client environment:os.arch=2.6.39-t7.el5
> 2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@log_env@671: Client environment:os.version=#1 SMP Tue Oct 23 23:59:44 UTC 2012
> 2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@log_env@679: Client environment:user.name=(null)
> 2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@log_env@687: Client environment:user.home=/root
> 2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@log_env@699: Client environment:user.dir=/
> 2013-05-08 19:15:51,626:18889(0x46ddd940):ZOO_INFO@zookeeper_init@727: Initiating client connection, host=ZKHOST:2181 sessionTimeout=10000 watcher=0x7f6da70adc00 sessionId=0 sessionPasswd
> =<null> context=0x7f6d9c186640 flags=0
> 2013-05-08 19:15:51,629:18889(0x46ddd940):ZOO_DEBUG@start_threads@152: starting threads...
> 2013-05-08 19:15:51,629:18889(0x4d5ea940):ZOO_DEBUG@do_io@279: started IO thread
> 2013-05-08 19:15:51,629:18889(0x4ddeb940):ZOO_DEBUG@do_completion@326: started completion thread
> 2013-05-08 19:15:51,630:18889(0x4d5ea940):ZOO_INFO@check_events@1585: initiated connection to server [IP:2181]
> 2013-05-08 19:15:51,632:18889(0x4d5ea940):ZOO_INFO@check_events@1632: session establishment complete on server [ZK_IP:2181], sessionId=0x3e85920347000b, negotiated timeout=10000
> 2013-05-08 19:15:51,632:18889(0x4d5ea940):ZOO_DEBUG@check_events@1638: Calling a watcher for a ZOO_SESSION_EVENT and the state=ZOO_CONNECTED_STATE
> 2013-05-08 19:15:51,632:18889(0x4ddeb940):ZOO_DEBUG@process_completions@1765: Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
> I0508 19:15:51.632997 18908 detector.cpp:283] Master detector (slave(1)@IP:5051) connected to ZooKeeper ...
> I0508 19:15:51.633085 18908 detector.cpp:288] Authenticating to ZooKeeper using scheme 'digest'
> 2013-05-08 19:15:51,633:18889(0x48de1940):ZOO_DEBUG@send_last_auth_info@1265: Sending auth info request to ZK_IP:2181
> I0508 19:15:53.453980 18913 http.cpp:280] HTTP request for '/slave(1)/stats.json'
> 2013-05-08 19:15:58,302:18889(0x4d5ea940):ZOO_ERROR@handle_socket_error_msg@1528: Socket [ZK_IP:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 4ms)
> 2013-05-08 19:15:58,302:18889(0x4d5ea940):ZOO_DEBUG@handle_error@1141: Calling a watcher for a ZOO_SESSION_EVENT and the state=CONNECTING_STATE
> 2013-05-08 19:15:58,302:18889(0x4ddeb940):ZOO_DEBUG@process_completions@1765: Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
> F0508 19:15:58.303388 18908 detector.cpp:291] Failed to authenticate with ZooKeeper: operation timeout
> *** Check failure stack trace: ***
>     @     0x7f6da732669d  google::LogMessage::Fail()
>     @     0x7f6da732c307  google::LogMessage::SendToLog()
>     @     0x7f6da7327f4c  google::LogMessage::Flush()
>     @     0x7f6da73281b6  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7f6da70762e0  mesos::internal::ZooKeeperMasterDetectorProcess::connected()
>     @     0x7f6da7077298  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7f6da7078e14  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7f6da7221d7b  std::tr1::function<>::operator()()
>     @     0x7f6da71daf9f  process::ProcessBase::visit()
>     @     0x7f6da71efec8  process::DispatchEvent::visit()
>     @     0x7f6da71e411d  process::ProcessManager::resume()
>     @     0x7f6da71e4968  process::schedule()
>     @     0x7f6da68fd73d  start_thread
>     @     0x7f6da52e1f6d  clone
> Here's another case where we don't handle operation timeout (this was on 0.12.0 but we have a similar issue on 0.13.0 when checking for existence of a node, see: MESOS-409):
> F0720 14:29:33.791141 28951 detector.cpp:315] Failed to create '/home/mesos/prod/master' in ZooKeeper: operation timeout
> *** Check failure stack trace: ***
> 2013-07-20 14:29:33,792:28924(0x7ff109727940):ZOO_ERROR@handle_socket_error_msg@1621: Socket [10.35.97.137:2181] zk retcode=-112, errno=116(Stale NFS file handle): s
> essionId=0x23f2f2c0d7eb014 has expired.
> 2013-07-20 14:29:33,792:28924(0x7ff109727940):ZOO_DEBUG@handle_error@1138: Calling a watcher for a ZOO_SESSION_EVENT and the state=ZOO_EXPIRED_SESSION_STATE
> 2013-07-20 14:29:33,793:28924(0x7ff109727940):ZOO_DEBUG@do_io@317: IO thread terminated
> 2013-07-20 14:29:33,793:28924(0x7ff108f26940):ZOO_DEBUG@process_completions@1765: Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
>     @     0x7ff11691969d  google::LogMessage::Fail()
>     @     0x7ff11691f307  google::LogMessage::SendToLog()
>     @     0x7ff11691af4c  google::LogMessage::Flush()
>     @     0x7ff11691b1b6  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7ff1166696ad  mesos::internal::ZooKeeperMasterDetectorProcess::connected()
>     @     0x7ff11666a298  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7ff11666be14  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7ff116814d7b  std::tr1::function<>::operator()()
>     @     0x7ff1167cdf9f  process::ProcessBase::visit()
>     @     0x7ff1167e2ec8  process::DispatchEvent::visit()
>     @     0x7ff1167d711d  process::ProcessManager::resume()
>     @     0x7ff1167d7968  process::schedule()
>     @     0x7ff115eef83d  start_thread
>     @     0x7ff1148d1f8d  clone

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira