You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "analogue (JIRA)" <ji...@apache.org> on 2016/05/20 22:00:14 UTC

[jira] [Created] (MESOS-5427) Mesos master locks up after slave fails to authenticate

analogue created MESOS-5427:
-------------------------------

             Summary: Mesos master locks up after slave fails to authenticate
                 Key: MESOS-5427
                 URL: https://issues.apache.org/jira/browse/MESOS-5427
             Project: Mesos
          Issue Type: Bug
          Components: master
    Affects Versions: 0.20.1
         Environment: Linux XXXXXX-XXXXXXXXX 3.13.0-49-generic #81-Ubuntu SMP Tue Mar 24 19:29:48 UTC 2015 x86_64 GNU/Linux
Ubuntu 10.04.1 LTS
AWS/8cores/16GB
            Reporter: analogue


In a mesos master cluster with one leader and two backups, a single slave attempting to authenticate with the leader locked up the master and resulted in 2 CPU cores pegged at 100% CPU usage until restarted.

master
{noformat}
I0516 02:55:39.945566 32126 master.cpp:3612] Authenticating slave(1)@10.85.20.76:5051
I0516 02:55:39.945757 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.945802 32123 authenticator.hpp:156] Creating new server SASL connection
I0516 02:55:39.945991 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946030 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946063 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946095 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946126 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946158 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946189 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946221 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946252 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946285 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946316 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946347 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
I0516 02:55:39.946379 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
...
W0516 02:55:44.945811 32124 master.cpp:3670] Authentication timed out
I0516 02:55:49.290623 32121 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
(last long line repeats until mesos-master restarted)
{noformat}

slave
{noformat}
Log file created at: 2016/05/16 02:37:52
Running on machine: 10-85-20-76-uswest2btestopia
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0516 02:37:52.112509 10198 logging.cpp:142] INFO level logging started!
I0516 02:37:52.112761 10198 main.cpp:126] Build: 2014-12-12 00:52:32 by
I0516 02:37:52.112772 10198 main.cpp:128] Version: 0.20.1
I0516 02:37:52.112778 10198 main.cpp:131] Git tag: 0.20.1
I0516 02:37:52.112783 10198 main.cpp:135] Git SHA: fe0a39112f3304283f970f1b08b322b1e970829d
I0516 02:37:52.112793 10198 containerizer.cpp:89] Using isolation: cgroups/cpu,cgroups/mem
I0516 02:37:52.125773 10198 linux_launcher.cpp:78] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I0516 02:37:52.126652 10198 main.cpp:149] Starting Mesos slave
I0516 02:37:52.128687 10246 slave.cpp:167] Slave started on 1)@10.85.20.76:5051
I0516 02:37:52.128708 10246 credentials.hpp:84] Loading credential for authentication from '/etc/seagull_mesos_creds'
W0516 02:37:52.128865 10246 credentials.hpp:99] Permissions on credential file '/etc/seagull_mesos_creds' are too open. It is recommended that your credential file is NOT accessible by others.
I0516 02:37:52.128968 10246 slave.cpp:265] Slave using credential for: seagull_slave
I0516 02:37:52.129612 10246 slave.cpp:278] Slave resources: cpus(*):31; disk(*):140000; mem(*):59382; ports(*):[31000-32000]
I0516 02:37:52.132064 10255 group.cpp:313] Group process (group(1)@10.85.20.76:5051) connected to ZooKeeper
I0516 02:37:52.132086 10255 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0516 02:37:52.132097 10255 group.cpp:385] Trying to create path '/mesos-releng' in ZooKeeper
I0516 02:37:52.200781 10246 slave.cpp:306] Slave hostname: 10-85-20-76-uswest2btestopia.dev.yelpcorp.com
I0516 02:37:52.200804 10246 slave.cpp:307] Slave checkpoint: true
I0516 02:37:52.201323 10262 detector.cpp:138] Detected a new leader: (id='4733')
I0516 02:37:52.201484 10258 group.cpp:658] Trying to get '/mesos-releng/info_0000004733' in ZooKeeper
I0516 02:37:52.202786 10262 detector.cpp:426] A new leading master (UPID=master@10.85.3.50:5050) is detected
I0516 02:37:52.203071 10257 state.cpp:33] Recovering state from '/ephemeral/mesos-slave/meta'
I0516 02:37:52.203229 10253 status_update_manager.cpp:193] Recovering status update manager
I0516 02:37:52.203358 10247 containerizer.cpp:252] Recovering containerizer
I0516 02:37:52.205095 10253 slave.cpp:3198] Finished recovery
I0516 02:37:52.206739 10264 slave.cpp:589] New master detected at master@10.85.3.50:5050
I0516 02:37:52.207399 10264 slave.cpp:663] Authenticating with master master@10.85.3.50:5050
I0516 02:37:52.207417 10253 status_update_manager.cpp:167] New master detected at master@10.85.3.50:5050
I0516 02:37:52.207474 10264 slave.cpp:636] Detecting new master
I0516 02:37:52.207484 10256 authenticatee.hpp:104] Initializing client SASL
I0516 02:37:53.521589 10256 authenticatee.hpp:128] Creating new client SASL connection
W0516 02:37:57.210635 10257 slave.cpp:737] Authentication timed out
W0516 02:37:57.210777 10257 slave.cpp:701] Failed to authenticate with master 
{noformat}

No idea why the timestamps don't match up between the master and the slave but ntp logs to syslog indicate there was no time skew at play here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)