You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@brooklyn.apache.org by "Aled Sage (JIRA)" <ji...@apache.org> on 2017/02/08 15:41:41 UTC

[jira] [Commented] (BROOKLYN-434) Provisioning of CentOS 7.3 (via jclouds) can leave sshd in bad state

    [ https://issues.apache.org/jira/browse/BROOKLYN-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858132#comment-15858132 ] 

Aled Sage commented on BROOKLYN-434:
------------------------------------

A workaround is to change jclouds {{org.jclouds.scriptbuilder.statements.ssh.SshdConfig}} so that it does not call {{service sshd reload}}. Instead, one can call {{service sshd stop; service sshd start}}. For example:

{noformat}
-      Statement reloadSshdConfig = exec("hash service 2>&- && service ssh reload 2>&- || service sshd reload 2>&- || /etc/init.d/ssh* reload");
+      Statement reloadSshdConfig = exec("( hash service 2>&- && ( service ssh reload || ( service sshd stop; service sshd start ) ) ) || /etc/init.d/ssh* reload");
{noformat}

---
However, Yavor has also observed strange behaviour where the file {{/var/run/sshd.pid}} is missing, even though the process is running! I'm not sure exactly when that happens or how to reproduce it. In that situation, it's not enough to call stop + start, because an sshd is still running in the background listening on port 22, and systemd never finds out about it.

For this, Yavor came up with the solution below - it finds the process listening on port 22 and kills it:
{noformat}
String cmd = "systemctl stop sshd;ss -lntp4|grep ':22'|awk -v FS=\"(pid=|,)\" '{if (NR) system(\"kill \" $3)}';systemctl start sshd";
{noformat}

(I suspect we'd need to be a bit more careful with the {{grep ':22'}} so that it doesn't kill a process listening on port 220).

This is a belt-and-braces fix!

> Provisioning of CentOS 7.3 (via jclouds) can leave sshd in bad state
> --------------------------------------------------------------------
>
>                 Key: BROOKLYN-434
>                 URL: https://issues.apache.org/jira/browse/BROOKLYN-434
>             Project: Brooklyn
>          Issue Type: Bug
>    Affects Versions: 0.10.0
>            Reporter: Aled Sage
>
> **Underlying Problem**
> There is a bug in CentOS 7.3 (and RHEL 7.3) related to reloading of the SSH daemon configuration. See:
> * RHEL bug report at https://bugzilla.redhat.com/show_bug.cgi?id=1381997 (which gives the best description of the underlying problem in https://bugzilla.redhat.com/show_bug.cgi?id=1381997#c4).
> * CentOS bug report at https://bugs.centos.org/view.php?id=12757
> * The change introduced the problematic behavior at https://bugzilla.redhat.com/show_bug.cgi?id=1291172
> **Brooklyn Code calling sshd reload**
> The problems occurs when jclouds (or {{org.apache.brooklyn.policy.jclouds.os.CreateUserPolicy}}) modifies {{/etc/ssh/sshd_config}} and then calls {{service sshd reload}}. This can leave systemd's view of sshd in a bad state. A subsequent call to {{service sshd reload}} or {{service sshd restart}} will not have the desired effect!
> The jclouds code calls {{service sshd reload}} if it is configured to lock-down sshd, which is the default - it will configure {{PasswordAuthentication no}} and {{PermitRootLogin no}} (see https://github.com/apache/brooklyn-server/blob/rel/apache-brooklyn-0.10.0/locations/jclouds/src/main/java/org/apache/brooklyn/location/jclouds/JcloudsLocation.java#L2293).
> Subsequent use of {{CreateUserPolicy}} will also call {{service sshd reload}} (see https://github.com/apache/brooklyn-server/blob/rel/apache-brooklyn-0.10.0/locations/jclouds/src/main/java/org/apache/brooklyn/policy/jclouds/os/CreateUserPolicy.java#L159-L164).
> **Observed Behaviour: initial jclouds provisioning**
> The exact behaviour we see on CentOS 7.3 is non-deterministic. Below we attempt to describe what can happen.
> *TL;DR: systemd thinks sshd is inactive/dead, but the process is really running.*
> On a newly provisioned CentOS 7.3 VM, immediately after we've provisioned it through jclouds...
> Looking at the output from jclouds (having tweaked it so that the stderr isn't lost), it executed:
> {noformat}
> exec 3<> /etc/ssh/sshd_config && awk -v TEXT="PasswordAuthentication no
> PermitRootLogin no
> " 'BEGIN {print TEXT}{print}' /etc/ssh/sshd_config >&3
> hash service && service ssh reload || service sshd reload || /etc/init.d/ssh* reload
> {noformat}
> The relevant stderr is:
> {noformat}
> Redirecting to /bin/systemctl reload  sshd.service
> Redirecting to /bin/systemctl status  sshd.service
> {noformat}
> The status of sshd (according the systemd) is:
> {noformat}
> $ service sshd status
> Redirecting to /bin/systemctl status  sshd.service
> * sshd.service - OpenSSH server daemon
>    Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor preset: enabled)
>    Active: inactive (dead) since Wed 2017-02-08 15:16:18 UTC; 5s ago
>      Docs: man:sshd(8)
>            man:sshd_config(5)
>   Process: 28941 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS)
>   Process: 10346 ExecStart=/usr/sbin/sshd $OPTIONS (code=exited, status=0/SUCCESS)
>  Main PID: 10348 (code=exited, status=0/SUCCESS)
>    CGroup: /system.slice/sshd.service
>            └─28942 /usr/sbin/sshd
> {noformat}
> However, the process is really running and we can ssh to it:
> {noformat}
> $ cat /var/run/sshd.pid
> 28942
> $ ps aux | grep "/usr/sbin/sshd"
> root     28942  0.0  0.0  82468  1260 ?        Ss   15:16   0:00 /usr/sbin/sshd
> {noformat}
> **Observed Behaviour: sshd reload again**
> *TL;DR: running reload again fails - the process keeps running, but does not re-load the sshd_config file.*
> After running {{service sshd reload}}, if the same command is run again we get:
> {noformat}
> $ sudo service sshd reload
> Redirecting to /bin/systemctl reload  sshd.service
> Job for sshd.service invalid.
> $ echo $?
> 1
> $ cat /var/run/sshd.pid
> 28942
> $ ps aux | grep "/usr/sbin/sshd"
> root     28942  0.0  0.0  82468  1260 ?        Ss   15:16   0:00 /usr/sbin/sshd
> $ service sshd status
> Redirecting to /bin/systemctl status  sshd.service
> * sshd.service - OpenSSH server daemon
>    Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor preset: enabled)
>    Active: inactive (dead) since Wed 2017-02-08 15:16:18 UTC; 1min 9s ago
>      Docs: man:sshd(8)
>            man:sshd_config(5)
>   Process: 28941 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS)
>   Process: 10346 ExecStart=/usr/sbin/sshd $OPTIONS (code=exited, status=0/SUCCESS)
>  Main PID: 10348 (code=exited, status=0/SUCCESS)
>    CGroup: /system.slice/sshd.service
>            └─28942 /usr/sbin/sshd
> {noformat}
> In {{/var/log/messages}}, we have:
> {noformat}
> Feb  8 15:17:23 host-10-104-0-188 systemd: Unit sshd.service cannot be reloaded because it is inactive.
> {noformat}
> **Observed Behaviour: restart, after a failed reload**
> *TL;DR: systemd does not restart the process (same pid) - the existing process is now recognised as running though, but the {{/etc/ssh/sshd_config}} has not be reloaded.*
> {noformat}
> $ sudo service sshd restart
> Redirecting to /bin/systemctl restart  sshd.service
> $ echo $?
> 0
> $ cat /var/run/sshd.pid
> 28942
> $ ps aux | grep "/usr/sbin/sshd"
> root     28942  0.0  0.0  82468  1260 ?        Ss   15:16   0:00 /usr/sbin/sshd
> $ service sshd status
> Redirecting to /bin/systemctl status  sshd.service
> * sshd.service - OpenSSH server daemon
>    Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor preset: enabled)
>    Active: active (running) since Wed 2017-02-08 15:21:39 UTC; 5s ago
>      Docs: man:sshd(8)
>            man:sshd_config(5)
>   Process: 28941 ExecReload=/bin/kill -HUP $MAINPID (code=exited, status=0/SUCCESS)
>   Process: 29726 ExecStart=/usr/sbin/sshd $OPTIONS (code=exited, status=0/SUCCESS)
>  Main PID: 28942 (sshd)
>    CGroup: /system.slice/sshd.service
>            └─28942 /usr/sbin/sshd
> Feb 08 15:21:39 qa-machine-en-fcap6ij4ya-7bf systemd[1]: Starting OpenSSH server daemon...
> Feb 08 15:21:39 qa-machine-en-fcap6ij4ya-7bf systemd[1]: Started OpenSSH server daemon.
> {noformat}
> In {{/var/log/messages}}, we have:
> {noformat}
> Feb  8 15:21:39 host-10-104-0-188 systemd: Starting OpenSSH server daemon...
> Feb  8 15:21:39 host-10-104-0-188 systemd: Started OpenSSH server daemon.
> {noformat}
> And in {{/var/log/secure}}:
> {noformat}
> Feb  8 15:21:39 host-10-104-0-188 sudo:     amp : TTY=pts/0 ; PWD=/home/users/amp ; USER=root ; COMMAND=/usr/sbin/service sshd restart
> Feb  8 15:21:39 host-10-104-0-188 polkitd[497]: Registered Authentication Agent for unix-process:29709:569722 (system bus name :1.29 [/usr/bin/pkttyagent --notify-fd 5 --fallback], object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_GB.UTF-8)
> Feb  8 15:21:39 host-10-104-0-188 polkitd[497]: Unregistered Authentication Agent for unix-process:29709:569722 (system bus name :1.29, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_GB.UTF-8) (disconnected from bus)
> Feb  8 15:21:39 host-10-104-0-188 sshd[29727]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use.
> Feb  8 15:21:39 host-10-104-0-188 sshd[29727]: error: Bind to port 22 on :: failed: Address already in use.
> Feb  8 15:21:39 host-10-104-0-188 sshd[29727]: fatal: Cannot bind any address.
> {noformat}
> My wild guess at what has happened here is...
> 1. systemd thinks sshd is not running, so did not stop it.
> 2. systemd starts another sshd process, which fails because process 28942 is already running and listening on port 22.
> 3. systemd looks at the pid file, and finds 28942 written there - that sshd process is running, so it reports sshd as "active (running)".
> The sshd process (28942) has *not* re-loaded the sshd_config file, so changes will not have taken effect.
> **Observed Behaviour: stop + start, after a failed reload**
> Note that calling {{service sshd stop; service sshd start}} after a bad reload will behave in the same way as the {{service sshd restart}}.
> **Observed Behaviour: subsequent restart, after failed restart**
> A subsequent call to restart now succeeds as expected. The process is replaced, and the changes made in sshd_config take effect.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)