You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2014/05/21 19:59:39 UTC

[jira] [Commented] (MESOS-1326) Retry policy for zookeeper_init failures

    [ https://issues.apache.org/jira/browse/MESOS-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004990#comment-14004990 ] 

Benjamin Mahler commented on MESOS-1326:
----------------------------------------

Looking at a few of these cases, it's not clear how much benefit would be added for the retries, since I see failures up to 2 minutes after the original failures:

{noformat}
I0521 13:17:34.652848  2729 group.cpp:469] ZooKeeper session expired
2014-05-21 13:17:34,653:2711(0x7f7304dba940):ZOO_INFO@zookeeper_close@2522: Freeing zookeeper resources for sessionId=0x456db1b7d4483c
2014-05-21 13:17:34,654:2711(0x7f7304dba940):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2014-05-21 13:17:34,654:2711(0x7f7304dba940):ZOO_INFO@log_env@716: Client environment:host.name=foo
2014-05-21 13:17:34,654:2711(0x7f7304dba940):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2014-05-21 13:17:34,654:2711(0x7f7304dba940):ZOO_INFO@log_env@724: Client environment:os.arch=2.6.44-t14.el5
2014-05-21 13:17:34,654:2711(0x7f7304dba940):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Thu Dec 19 12:29:49 PST 2013
2014-05-21 13:17:34,654:2711(0x7f7304dba940):ZOO_INFO@log_env@733: Client environment:user.name=(null)
2014-05-21 13:17:34,654:2711(0x7f7304dba940):ZOO_INFO@log_env@741: Client environment:user.home=/root
2014-05-21 13:17:34,654:2711(0x7f7304dba940):ZOO_INFO@log_env@753: Client environment:user.dir=/
2014-05-21 13:17:34,654:2711(0x7f7304dba940):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=foobar:2181 sessionTimeout=10000 watcher=0x7f731135d070 sess
ionId=0 sessionPasswd=<null> context=0x7f72fc0bfbf0 flags=0
I0521 13:17:41.288115  2724 slave.cpp:2658] Current usage 7.32%. Max allowed age: 5.787852137166100days
I0521 13:18:11.256170  2731 http.cpp:245] HTTP request for '/slave(1)/stats.json'
I0521 13:18:27.135423  2733 http.cpp:245] HTTP request for '/slave(1)/stats.json'
I0521 13:18:41.290091  2724 slave.cpp:2658] Current usage 7.32%. Max allowed age: 5.787919608946412days
2014-05-21 13:18:54,728:2711(0x7f7304dba940):ZOO_ERROR@getaddrs@599: getaddrinfo: Invalid argument

F0521 13:18:54.729014  2729 zookeeper.cpp:74] Failed to create ZooKeeper, zookeeper_init: Invalid argument [22]
*** Check failure stack trace: ***
    @     0x7f73116105fd  google::LogMessage::Fail()
    @     0x7f7311612444  google::LogMessage::SendToLog()
    @     0x7f73116101ec  google::LogMessage::Flush()
    @     0x7f73116103f9  google::LogMessage::~LogMessage()
    @     0x7f7311611372  google::ErrnoLogMessage::~ErrnoLogMessage()
    @     0x7f731135b561  ZooKeeper::ZooKeeper()
    @     0x7f7311365f38  zookeeper::GroupProcess::expired()
    @     0x7f7311366198  zookeeper::GroupProcess::timedout()
    @     0x7f73115461c2  process::ProcessManager::resume()
    @     0x7f73115464bc  process::schedule()
    @     0x7f7310aba83d  start_thread
    @     0x7f730f82226d  clone
/usr/local/bin/mesos-slave.sh: line 115:  2711 Aborted                 (core dumped) $debug /usr/local/sbin/mesos-slave --port=5051 --resources="${MESOS_RESOURCES}" --attributes="${MESOS_ATTR
IBUTES}" --master="${master_zoo_url}" --log_dir="${log_dir}" ${EXTRA_FLAGS} "$@"
Slave Exit Status: 134


WARNING: Logging before InitGoogleLogging() is written to STDERR
F0521 13:19:42.063138  3766 process.cpp:1491] Failed to initialize, gethostbyname2: Host name lookup failure
*** Check failure stack trace: ***
/usr/local/bin/mesos-slave.sh: line 115:  3766 Aborted                 (core dumped) $debug /usr/local/sbin/mesos-slave --port=5051 --resources="${MESOS_RESOURCES}" --attributes="${MESOS_ATTR
IBUTES}" --master="${master_zoo_url}" --log_dir="${log_dir}" ${EXTRA_FLAGS} "$@"
Slave Exit Status: 134


WARNING: Logging before InitGoogleLogging() is written to STDERR
F0521 13:20:52.733873  4141 process.cpp:1491] Failed to initialize, gethostbyname2: Host name lookup failure
*** Check failure stack trace: ***
/usr/local/bin/mesos-slave.sh: line 115:  4141 Aborted                 (core dumped) $debug /usr/local/sbin/mesos-slave --port=5051 --resources="${MESOS_RESOURCES}" --attributes="${MESOS_ATTR
IBUTES}" --master="${master_zoo_url}" --log_dir="${log_dir}" ${EXTRA_FLAGS} "$@"
Slave Exit Status: 134


I0521 13:21:52.826442  4877 logging.cpp:106] Logging INFO level started!
I0521 13:21:52.827071  4877 main.cpp:126] Build: 2014-04-24 19:52:05 by mockbuild
I0521 13:21:52.827087  4877 main.cpp:128] Version: 0.19.0-tw3
W0521 13:21:52.827102  4877 containerizer.cpp:169] The 'cgroups' isolation flag is deprecated, please update your flags to '--isolation=cgroups/cpu,cgroups/mem'.
I0521 13:21:52.827235  4877 containerizer.cpp:177] Using isolation: cgroups/cpu,cgroups/mem
I0521 13:21:52.839269  4877 cgroups_launcher.cpp:58] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the cgroups launcher
I0521 13:21:52.839462  4877 main.cpp:149] Starting Mesos slave
{noformat}

We may want to consider exiting in these cases instead of aborting, so that they can be distinguished separately for alerting purposes.

> Retry policy for zookeeper_init failures
> ----------------------------------------
>
>                 Key: MESOS-1326
>                 URL: https://issues.apache.org/jira/browse/MESOS-1326
>             Project: Mesos
>          Issue Type: Improvement
>    Affects Versions: 0.19.0
>            Reporter: Jie Yu
>              Labels: reliability
>
> Currently, we fatal when we have a zookeeper_init failure. Sometimes, this is annoying because during a DNS failover, we may experience this a lot and we don't necessary need to fatal on those cases.
> I am wondering whether we can retry on zookeeper_init failures?



--
This message was sent by Atlassian JIRA
(v6.2#6252)