You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@aurora.apache.org by Hussein Elgridly <hu...@broadinstitute.org> on 2015/04/08 17:53:32 UTC

Preventing Thermos from repeatedly retrying killed processes

We're finding a lot of our jobs are getting stuck in a state where Thermos
is repeatedly retrying failed processes.

I ran through one of these with Brian Wickman, who noted in that particular
case that the processes in question was exiting with -6 SIGABRT, which
Thermos doesn't think is a fatal enough signal to be concerned with and
will retry.

I'm now seeing another process that seems to being killed with -9 SIGKILL,
and Thermos is still restarting it; the following pulled from
thermos_runner.DEBUG:

https://gist.github.com/helgridly/e4413fd01d45b8c6d1c0

So: what's going on here?

All our tasks are marked as max_failures = 1, but that doesn't seem to be
preventing Thermos from retrying processes. It looks like what's happening
is that Thermos is interpreting various kill signals as "lost" rather than
failed, and retrying without incrementing the failure count. What I can't
find is what's calling on_killed in runner.py. Nor can I figure out what to
do about any of this.

Brian and I talked about wrapping all commands in a shell script that exits
0 if its child command did and 1 in all other cases. While this might work,
I don't really understand why this is necessary; can someone explain the
reasoning behind Thermos ever deciding a process got "lost" rather than
simply failed?

Thanks,
Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard