You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@gump.apache.org by Leo Simons <ma...@leosimons.com> on 2005/04/15 14:51:47 UTC
Re: termination algorithm
Hi Greg, Ed, gang,
I've integrated this into our development version of gump. There were a few
kinks to hammer out (for example catch OSError and not IOError, exit out of
the "while not timed out" as early as possible, no "-" for killpg), but it
seems to be basically working at this point. For now I'm just using a single
process group for all children and not really object-orienting things as
much as I could, but it does what we need. The code is at
https://svn.apache.org/repos/asf/gump/branches/Gump3/pygump/python/gump/util
/executor.py
It's kinda cool how easy it was to combine this with the new "subprocess"
module in python2.4, which is kind enough to provide a hook for inserting
the "set child group" functionality.
It's still a little ugly in some ways, since we want to be sure that a
progress stays around even if mostly empty. That makes stuff like
introspection a little easier. I just created a child that runs for the
length of the application to visibly keep the group around. Ugly, but the
only thing I could think of atm.
Thanks for your help!
Cheers,
Leo
On 21-03-2005 03:46, "Greg Stein" <gs...@lyra.org> wrote:
> Hi all,
>
> Was talking with Leo here at the infrastructure gathering, and he
> mentioned that Gump was having issues cleaning up zombie processes. He
> asked me how to make that happen in Linux. The general reply is "use
> os.waitpid()" (in Python; waitpid() is the general POSIX thing).
>
> Ed Korthof and I explored this a bit more to come up with a general
> algorithm for cleaning up "everything" after a Gump run.
>
> To start with, Gump should put all fork'd children into their own process
> groups, and then remember those groups' ids. This will enable you to kill
> any grandchild process or other things that get spawned. Even if the
> process gets re-parented to the init process, you can give it the
> smackdown via the process group. Of course, if somebody else monkeys with
> process groups, you'll lose track of them. There are limits to cleanup :-)
>
> When you want to clean up, you can send every process group SIGTERM. If
> any killpg() call throws an exception with ESRCH (no processes in that
> group), then remove it from the saved list of groups. Next, you would
> start looping to wait for all processes to exit, or to reach a timer on
> that wait. You want to quickly loop on everything that exits, terminate
> the loop when there is nothing more, and then pause a second if stuff is
> still busy shutting down. If you timeout and some are left, then SIGKILL
> them and go reap again. The algorithm would look like:
>
> def clean_up_processes(pgrp_list):
> # send SIGTERM to everything, and update pgrp_list to just those
> # process groups which have processes in them.
> kill_groups(pgrp_list, signal.SIGTERM)
>
> # pass a copy of the process groups. we want to remember every
> # group that we SIGTERM'd so that we can SIGKILL them later. it
> # is possible that a process in the pgrp was reparented to the
> # init process. those will be invisible to wait(), so we don't
> # want to mistakenly think we've killed all processes in the
> # group. thus, we preserve the list and SIGKILL it later.
> reap_children(pgrp_list[:])
>
> # SIGKILL everything, editing pgrp_list again.
> kill_groups(pgrp_list, signal.SIGKILL)
>
> # reap everything left, but don't really bother waiting on them.
> # if we exit, then init will reap them.
> reap_children(pgrp_list, 60)
>
> def kill_groups(pgrp_list, sig)
> # NOTE: this function edits pgrp_list
>
> for pgrp in pgrp_list[:]:
> try:
> os.killpg(-pgrp, sig)
> except IOError, e:
> if e.errno == errno.ESRCH:
> pgrp_list.remove(pgrp)
>
> def reap_children(pgrp_list, timeout=300):
> # NOTE: this function edits pgrp_list
>
> # keep reaping until the timeout expires, or we finish
> end_time = time.time() + timeout
>
> # keep reaping until all pgrps are done, or we run out of time
> while pgrp_list and time.time() < end_time:
> # pause for a bit while processes work on exiting. this pause is
> # at the top, so we can also pause right after the killpg()
> time.sleep(1)
>
> # go through all pgrps to reap them
> for pgrp in pgrp_list[:]:
> # loop quickly to clean everything in this pgrp
> while 1:
> try:
> pid, status = os.waitpid(-pgrp, os.WNOHANG)
> except IOError, e:
> if e.errno == errno.ECHILD:
> # no more children in this pgrp.
> pgrp_list.remove(pgrp)
> break
> raise
> if pid == 0:
> # some stuff has not exited yet, and WNOHANG avoided
> # blocking. go ahead and move to the next pgrp.
> break
>
> That should clean up everything. If stuff *still* hasn't exited, then
> there isn't much you can do. But you will have tried :-)
>
> Hope that helps! EdK and I haven't built test cases for the above, but it
> has been doubly-reviewed, so we think the algorithm/code should work.
>
> Cheers,
> -g
>
> p.s. note that we aren't on general@gump, so CC: if you reply...
---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org