You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airavata.apache.org by "Pierce, Marlon" <ma...@iu.edu> on 2016/05/12 18:04:57 UTC

Connection timeout settings

We have an occasional issue of connection timeouts when performing remote SSH operations. This has a potentially bad side effect of successfully launching a large job but not getting back the Job ID. One straightforward fix is to use longer than the default connection timeout in the Jsch clients.

Looking through the code, I don’t see that we are doing this. Is this correct?   Would there be some unintended consequences for using something longer, like 60 seconds? The default is 20 seconds.

There is also a longer discussion about the right way to handle these events in the first place. We may not want to depend on the standard output at all. Increasing the timeouts would at least put a bandaid on the current issue.

Marlon




Re: Connection timeout settings

Posted by K Yoshimoto <ke...@sdsc.edu>.
For the longer discussion, maybe some type of journaling system to
track, re-do, clean up remote execution might be good...

On Thu, May 12, 2016 at 06:04:57PM +0000, Pierce, Marlon wrote:
> We have an occasional issue of connection timeouts when performing remote SSH operations. This has a potentially bad side effect of successfully launching a large job but not getting back the Job ID. One straightforward fix is to use longer than the default connection timeout in the Jsch clients.
> 
> Looking through the code, I don\u2019t see that we are doing this. Is this correct?   Would there be some unintended consequences for using something longer, like 60 seconds? The default is 20 seconds.
> 
> There is also a longer discussion about the right way to handle these events in the first place. We may not want to depend on the standard output at all. Increasing the timeouts would at least put a bandaid on the current issue.
> 
> Marlon
> 
> 
> 

Re: Connection timeout settings

Posted by Amila Jayasekara <th...@gmail.com>.
Hi Marlon,

I am trying to recall some of my old memory related to job submission
implementations.
As far as I can remember, we implemented two-phase commit protocol with
GSISSH. With two-phase commit protocol, we first get the job id and then
submit the job in a single atomic step (So losing job id is not a problem).
We had some discussions, implementing same for JSCH (at least a design) but
I am not sure whether it was really integrated into Airavata. Maybe Lahiru
(or whoever working on GFac) can give more information about this.

However, two-phase commit protocol only works if the underlying job
scheduler is able to give a job id without actually submitting the job. As
far as I can remember moab is capable of doing that but not sure about job
schedulers such as slurm.

IMO, "longer connection timeout" is not the perfect solution but it could
be a good workaround.

Thanks
-Thejaka Amila

On Thu, May 12, 2016 at 2:04 PM, Pierce, Marlon <ma...@iu.edu> wrote:

> We have an occasional issue of connection timeouts when performing remote
> SSH operations. This has a potentially bad side effect of successfully
> launching a large job but not getting back the Job ID. One straightforward
> fix is to use longer than the default connection timeout in the Jsch
> clients.
>
> Looking through the code, I don’t see that we are doing this. Is this
> correct?   Would there be some unintended consequences for using something
> longer, like 60 seconds? The default is 20 seconds.
>
> There is also a longer discussion about the right way to handle these
> events in the first place. We may not want to depend on the standard output
> at all. Increasing the timeouts would at least put a bandaid on the current
> issue.
>
> Marlon
>
>
>
>