You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jclouds.apache.org by "Ashley Kasim (akasim)" <ak...@cisco.com> on 2015/08/26 20:23:32 UTC
jclouds SshClient and ScriptBuilder
Hi,
I've been using jclouds ScriptBuilder and SshClient to run scripts on nodes I've created with jclouds. Everything runs correctly, except when the last command is "service start" the service gets killed on ssh disconnect, and I see the message "node app stopped but pid file exists" when I ssh into the machine and check the status of the service. However, adding an additional line to the end of the script of "service status" prevents the service from being stopped. Why is this? On a side note, appending anything other than "service status" (for example, "true") does not have the same effect.
Any insight as to why this is happening? Is this a deliberate implementation decision? Is there a better way around this?
Here's the script I'm running, for reference, where hatjitsu is an open source planning poker node app:
mv /tmp/hatjitsu /etc/rc.d/init.d/
chmod +x /etc/rc.d/init.d/hatjitsu
yum -y install git
curl -sL https://rpm.nodesource.com/setup | bash -
yum -y install nodejs
mkdir -p /var/www/hatjitsu
cd /var/www/norbjitsu; npm install richarcher/Hatjitsu
chkconfig --add hatjitsu
chmod +x /etc/init.d/hatjitsu
service hatjitsu start
# prevent the service from being stopped on ssh disconnect
service hatjitsu status
Java code to run the script:
LoginCredentials login = LoginCredentials.builder().user("root").privateKey(privateKey).build();
SshClient client = compute.getContext().utils().sshForNode().apply(NodeMetadataBuilder.fromNodeMetadata(md)
.credentials(login).build());
ScriptBuilder scriptBuilder = new ScriptBuilder();
for (String s : sshCmds) {
scriptBuilder.addStatement(Statements.exec(s));
}
String script = scriptBuilder.render(OsFamily.UNIX);
client.exec(script);
Thanks for any insight you can provide,
Ashley Kasim
Re: jclouds SshClient and ScriptBuilder
Posted by Andrew Phillips <an...@apache.org>.
Hi Ashley
> Any insight as to why this is happening? Is this a deliberate
> implementation decision? Is there a better way around this?
I haven't had a chance to look at your scenario in detail, but this
sounds very similar to a common problem related to a nohup/ssh race
condition.
Basically, what can happen in many remote automation scenarios is as
follows:
* You open an SSH connection to a box which calls something like a
"service start" script
* The "service start" script returns as soon as it has run its final
command, which is something like "nohup /my/service/run.sh &"
Now there is a race condition between nohup doing its job and SSH
closing the connection. More specifically, there is a short period of
time before nohup has been able to disconnect the service process from
sshd. If the sshd process terminates before the service process has been
disconnected, it will kill the service, as that is (still) a child
process of sshd.
The solution here is to ensure that nohup has had a chance to kick in
before the sshd process terminates. Probably the best way to do this is
to ensure your service start script does not return until the service is
actually up.
Another common way is to simply change the command you're running via
SSH from "service start" to "service start && sleep 2", although all
that's really doing is giving nohup two more seconds to do its job.
Without knowing more about what your service script is doing, I can't
say whether this is actually what you're seeing, but the symptoms
certainly sound comparable.
Regards
ap