You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "analogue (JIRA)" <ji...@apache.org> on 2016/05/20 22:00:14 UTC

[jira] [Updated] (MESOS-5427) Mesos master locks up after slave fails to authenticate

     [ https://issues.apache.org/jira/browse/MESOS-5427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

analogue updated MESOS-5427:
----------------------------
    Priority: Minor  (was: Major)

> Mesos master locks up after slave fails to authenticate
> -------------------------------------------------------
>
>                 Key: MESOS-5427
>                 URL: https://issues.apache.org/jira/browse/MESOS-5427
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.20.1
>         Environment: Linux XXXXXX-XXXXXXXXX 3.13.0-49-generic #81-Ubuntu SMP Tue Mar 24 19:29:48 UTC 2015 x86_64 GNU/Linux
> Ubuntu 10.04.1 LTS
> AWS/8cores/16GB
>            Reporter: analogue
>            Priority: Minor
>
> In a mesos master cluster with one leader and two backups, a single slave attempting to authenticate with the leader locked up the master and resulted in 2 CPU cores pegged at 100% CPU usage until restarted.
> master
> {noformat}
> I0516 02:55:39.945566 32126 master.cpp:3612] Authenticating slave(1)@10.85.20.76:5051
> I0516 02:55:39.945757 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
> I0516 02:55:39.945802 32123 authenticator.hpp:156] Creating new server SASL connection
> I0516 02:55:39.945991 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
> I0516 02:55:39.946030 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
> I0516 02:55:39.946063 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
> I0516 02:55:39.946095 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
> I0516 02:55:39.946126 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
> I0516 02:55:39.946158 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
> I0516 02:55:39.946189 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
> I0516 02:55:39.946221 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
> I0516 02:55:39.946252 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
> I0516 02:55:39.946285 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
> I0516 02:55:39.946316 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
> I0516 02:55:39.946347 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
> I0516 02:55:39.946379 32126 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
> ...
> W0516 02:55:44.945811 32124 master.cpp:3670] Authentication timed out
> I0516 02:55:49.290623 32121 master.cpp:3598] Queuing up authentication request from slave(1)@10.85.20.76:5051 because authentication is still in progress
> (last long line repeats until mesos-master restarted)
> {noformat}
> slave
> {noformat}
> Log file created at: 2016/05/16 02:37:52
> Running on machine: 10-85-20-76-uswest2btestopia
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
> I0516 02:37:52.112509 10198 logging.cpp:142] INFO level logging started!
> I0516 02:37:52.112761 10198 main.cpp:126] Build: 2014-12-12 00:52:32 by
> I0516 02:37:52.112772 10198 main.cpp:128] Version: 0.20.1
> I0516 02:37:52.112778 10198 main.cpp:131] Git tag: 0.20.1
> I0516 02:37:52.112783 10198 main.cpp:135] Git SHA: fe0a39112f3304283f970f1b08b322b1e970829d
> I0516 02:37:52.112793 10198 containerizer.cpp:89] Using isolation: cgroups/cpu,cgroups/mem
> I0516 02:37:52.125773 10198 linux_launcher.cpp:78] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> I0516 02:37:52.126652 10198 main.cpp:149] Starting Mesos slave
> I0516 02:37:52.128687 10246 slave.cpp:167] Slave started on 1)@10.85.20.76:5051
> I0516 02:37:52.128708 10246 credentials.hpp:84] Loading credential for authentication from '/etc/seagull_mesos_creds'
> W0516 02:37:52.128865 10246 credentials.hpp:99] Permissions on credential file '/etc/seagull_mesos_creds' are too open. It is recommended that your credential file is NOT accessible by others.
> I0516 02:37:52.128968 10246 slave.cpp:265] Slave using credential for: seagull_slave
> I0516 02:37:52.129612 10246 slave.cpp:278] Slave resources: cpus(*):31; disk(*):140000; mem(*):59382; ports(*):[31000-32000]
> I0516 02:37:52.132064 10255 group.cpp:313] Group process (group(1)@10.85.20.76:5051) connected to ZooKeeper
> I0516 02:37:52.132086 10255 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
> I0516 02:37:52.132097 10255 group.cpp:385] Trying to create path '/mesos-releng' in ZooKeeper
> I0516 02:37:52.200781 10246 slave.cpp:306] Slave hostname: 10-85-20-76-uswest2btestopia.dev.yelpcorp.com
> I0516 02:37:52.200804 10246 slave.cpp:307] Slave checkpoint: true
> I0516 02:37:52.201323 10262 detector.cpp:138] Detected a new leader: (id='4733')
> I0516 02:37:52.201484 10258 group.cpp:658] Trying to get '/mesos-releng/info_0000004733' in ZooKeeper
> I0516 02:37:52.202786 10262 detector.cpp:426] A new leading master (UPID=master@10.85.3.50:5050) is detected
> I0516 02:37:52.203071 10257 state.cpp:33] Recovering state from '/ephemeral/mesos-slave/meta'
> I0516 02:37:52.203229 10253 status_update_manager.cpp:193] Recovering status update manager
> I0516 02:37:52.203358 10247 containerizer.cpp:252] Recovering containerizer
> I0516 02:37:52.205095 10253 slave.cpp:3198] Finished recovery
> I0516 02:37:52.206739 10264 slave.cpp:589] New master detected at master@10.85.3.50:5050
> I0516 02:37:52.207399 10264 slave.cpp:663] Authenticating with master master@10.85.3.50:5050
> I0516 02:37:52.207417 10253 status_update_manager.cpp:167] New master detected at master@10.85.3.50:5050
> I0516 02:37:52.207474 10264 slave.cpp:636] Detecting new master
> I0516 02:37:52.207484 10256 authenticatee.hpp:104] Initializing client SASL
> I0516 02:37:53.521589 10256 authenticatee.hpp:128] Creating new client SASL connection
> W0516 02:37:57.210635 10257 slave.cpp:737] Authentication timed out
> W0516 02:37:57.210777 10257 slave.cpp:701] Failed to authenticate with master 
> {noformat}
> No idea why the timestamps don't match up between the master and the slave but ntp logs to syslog indicate there was no time skew at play here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)