You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jclouds.apache.org by "Ashley Kasim (akasim)" <ak...@cisco.com> on 2015/08/26 20:23:32 UTC

jclouds SshClient and ScriptBuilder

Hi,

I've been using jclouds ScriptBuilder and SshClient to run scripts on nodes I've created with jclouds. Everything runs correctly, except when the last command is "service start" the service gets killed on ssh disconnect, and I see the message "node app stopped but pid file exists" when I ssh into the machine and check the status of the service. However, adding an additional line to the end of the script of "service status" prevents the service from being stopped. Why is this? On a side note, appending anything other than "service status" (for example, "true") does not have the same effect.

Any insight as to why this is happening? Is this a deliberate implementation decision? Is there a better way around this?

Here's the script I'm running, for reference, where hatjitsu is an open source planning poker node app:

mv /tmp/hatjitsu /etc/rc.d/init.d/
chmod +x /etc/rc.d/init.d/hatjitsu
yum -y install git
curl -sL https://rpm.nodesource.com/setup | bash -
yum -y install nodejs
mkdir -p /var/www/hatjitsu
cd /var/www/norbjitsu; npm install richarcher/Hatjitsu
chkconfig --add hatjitsu
chmod +x /etc/init.d/hatjitsu
service hatjitsu start
# prevent the service from being stopped on ssh disconnect
service hatjitsu status

Java code to run the script:

LoginCredentials login = LoginCredentials.builder().user("root").privateKey(privateKey).build();

SshClient client = compute.getContext().utils().sshForNode().apply(NodeMetadataBuilder.fromNodeMetadata(md)
                .credentials(login).build());

ScriptBuilder scriptBuilder = new ScriptBuilder();
for (String s : sshCmds) {
            scriptBuilder.addStatement(Statements.exec(s));
}
String script = scriptBuilder.render(OsFamily.UNIX);
client.exec(script);

Thanks for any insight you can provide,

Ashley Kasim

Re: jclouds SshClient and ScriptBuilder

Posted by Andrew Phillips <an...@apache.org>.

Hi Ashley

> Any insight as to why this is happening? Is this a deliberate
> implementation decision? Is there a better way around this?

I haven't had a chance to look at your scenario in detail, but this 
sounds very similar to a common problem related to a nohup/ssh race 
condition.

Basically, what can happen in many remote automation scenarios is as 
follows:

* You open an SSH connection to a box which calls something like a 
"service start" script
* The "service start" script returns as soon as it has run its final 
command, which is something like "nohup /my/service/run.sh &"

Now there is a race condition between nohup doing its job and SSH 
closing the connection. More specifically, there is a short period of 
time before nohup has been able to disconnect the service process from 
sshd. If the sshd process terminates before the service process has been 
disconnected, it will kill the service, as that is (still) a child 
process of sshd.

The solution here is to ensure that nohup has had a chance to kick in 
before the sshd process terminates. Probably the best way to do this is 
to ensure your service start script does not return until the service is 
actually up.

Another common way is to simply change the command you're running via 
SSH from "service start" to "service start && sleep 2", although all 
that's really doing is giving nohup two more seconds to do its job.

Without knowing more about what your service script is doing, I can't 
say whether this is actually what you're seeing, but the symptoms 
certainly sound comparable.

Regards

ap