You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@gump.apache.org by Greg Stein <gs...@lyra.org> on 2005/03/21 03:46:52 UTC

termination algorithm

Hi all,

Was talking with Leo here at the infrastructure gathering, and he
mentioned that Gump was having issues cleaning up zombie processes. He
asked me how to make that happen in Linux. The general reply is "use
os.waitpid()" (in Python; waitpid() is the general POSIX thing).

Ed Korthof and I explored this a bit more to come up with a general
algorithm for cleaning up "everything" after a Gump run.

To start with, Gump should put all fork'd children into their own process
groups, and then remember those groups' ids. This will enable you to kill
any grandchild process or other things that get spawned. Even if the
process gets re-parented to the init process, you can give it the
smackdown via the process group. Of course, if somebody else monkeys with
process groups, you'll lose track of them. There are limits to cleanup :-)

When you want to clean up, you can send every process group SIGTERM. If
any killpg() call throws an exception with ESRCH (no processes in that
group), then remove it from the saved list of groups. Next, you would
start looping to wait for all processes to exit, or to reach a timer on
that wait. You want to quickly loop on everything that exits, terminate
the loop when there is nothing more, and then pause a second if stuff is
still busy shutting down. If you timeout and some are left, then SIGKILL
them and go reap again. The algorithm would look like:

def clean_up_processes(pgrp_list):
  # send SIGTERM to everything, and update pgrp_list to just those
  # process groups which have processes in them.
  kill_groups(pgrp_list, signal.SIGTERM)
  
  # pass a copy of the process groups. we want to remember every
  # group that we SIGTERM'd so that we can SIGKILL them later. it
  # is possible that a process in the pgrp was reparented to the
  # init process. those will be invisible to wait(), so we don't
  # want to mistakenly think we've killed all processes in the
  # group. thus, we preserve the list and SIGKILL it later.
  reap_children(pgrp_list[:])

  # SIGKILL everything, editing pgrp_list again.
  kill_groups(pgrp_list, signal.SIGKILL)
  
  # reap everything left, but don't really bother waiting on them.
  # if we exit, then init will reap them.
  reap_children(pgrp_list, 60)

def kill_groups(pgrp_list, sig)
  # NOTE: this function edits pgrp_list

  for pgrp in pgrp_list[:]:
    try:
      os.killpg(-pgrp, sig)
    except IOError, e:
      if e.errno == errno.ESRCH:
        pgrp_list.remove(pgrp)

def reap_children(pgrp_list, timeout=300):
  # NOTE: this function edits pgrp_list

  # keep reaping until the timeout expires, or we finish
  end_time = time.time() + timeout

  # keep reaping until all pgrps are done, or we run out of time
  while pgrp_list and time.time() < end_time:
    # pause for a bit while processes work on exiting. this pause is
    # at the top, so we can also pause right after the killpg()
    time.sleep(1)

    # go through all pgrps to reap them
    for pgrp in pgrp_list[:]:
      # loop quickly to clean everything in this pgrp
      while 1:
        try:
          pid, status = os.waitpid(-pgrp, os.WNOHANG)
        except IOError, e:
          if e.errno == errno.ECHILD:
            # no more children in this pgrp.
            pgrp_list.remove(pgrp)
            break
          raise
        if pid == 0:
	  # some stuff has not exited yet, and WNOHANG avoided
	  # blocking. go ahead and move to the next pgrp.
	  break

That should clean up everything. If stuff *still* hasn't exited, then
there isn't much you can do. But you will have tried :-)

Hope that helps! EdK and I haven't built test cases for the above, but it
has been doubly-reviewed, so we think the algorithm/code should work.

Cheers,
-g

p.s. note that we aren't on general@gump, so CC: if you reply...

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org

Re: termination algorithm

Posted by "Adam R. B. Jack" <aj...@apache.org>.

Gentlemen,

Thanks for your input. I recently (an hour ago, after a year+ of fighting
this & using finding ineffective solutions) implemented spawning along the
lines of:

>     1) fork, getting forkPID for the child process.
>     2) [child]:     setpgrp [become a group leader], run child
processes...
>     3) [parent]:   wait w/ a timer [60 mins] to sigkill the group
        [using, with some checking, the effect of
os.killpg(os.getpgid(forkPID))]

This ought allow us to clean as we go, with the limitations you stated. That
said, there are some additional good things to learn from your solution, so
I'll capture those sophistications also.

Thanks for the info.

regards,

Adam
----- Original Message ----- 
From: "Greg Stein" <gs...@lyra.org>
To: <ge...@gump.apache.org>
Cc: <ed...@apache.org>
Sent: Sunday, March 20, 2005 7:46 PM
Subject: termination algorithm


> Hi all,
>
> Was talking with Leo here at the infrastructure gathering, and he
> mentioned that Gump was having issues cleaning up zombie processes. He
> asked me how to make that happen in Linux. The general reply is "use
> os.waitpid()" (in Python; waitpid() is the general POSIX thing).
>
> Ed Korthof and I explored this a bit more to come up with a general
> algorithm for cleaning up "everything" after a Gump run.
>
> To start with, Gump should put all fork'd children into their own process
> groups, and then remember those groups' ids. This will enable you to kill
> any grandchild process or other things that get spawned. Even if the
> process gets re-parented to the init process, you can give it the
> smackdown via the process group. Of course, if somebody else monkeys with
> process groups, you'll lose track of them. There are limits to cleanup :-)
>
> When you want to clean up, you can send every process group SIGTERM. If
> any killpg() call throws an exception with ESRCH (no processes in that
> group), then remove it from the saved list of groups. Next, you would
> start looping to wait for all processes to exit, or to reach a timer on
> that wait. You want to quickly loop on everything that exits, terminate
> the loop when there is nothing more, and then pause a second if stuff is
> still busy shutting down. If you timeout and some are left, then SIGKILL
> them and go reap again. The algorithm would look like:
>
> def clean_up_processes(pgrp_list):
>   # send SIGTERM to everything, and update pgrp_list to just those
>   # process groups which have processes in them.
>   kill_groups(pgrp_list, signal.SIGTERM)
>
>   # pass a copy of the process groups. we want to remember every
>   # group that we SIGTERM'd so that we can SIGKILL them later. it
>   # is possible that a process in the pgrp was reparented to the
>   # init process. those will be invisible to wait(), so we don't
>   # want to mistakenly think we've killed all processes in the
>   # group. thus, we preserve the list and SIGKILL it later.
>   reap_children(pgrp_list[:])
>
>   # SIGKILL everything, editing pgrp_list again.
>   kill_groups(pgrp_list, signal.SIGKILL)
>
>   # reap everything left, but don't really bother waiting on them.
>   # if we exit, then init will reap them.
>   reap_children(pgrp_list, 60)
>
> def kill_groups(pgrp_list, sig)
>   # NOTE: this function edits pgrp_list
>
>   for pgrp in pgrp_list[:]:
>     try:
>       os.killpg(-pgrp, sig)
>     except IOError, e:
>       if e.errno == errno.ESRCH:
>         pgrp_list.remove(pgrp)
>
> def reap_children(pgrp_list, timeout=300):
>   # NOTE: this function edits pgrp_list
>
>   # keep reaping until the timeout expires, or we finish
>   end_time = time.time() + timeout
>
>   # keep reaping until all pgrps are done, or we run out of time
>   while pgrp_list and time.time() < end_time:
>     # pause for a bit while processes work on exiting. this pause is
>     # at the top, so we can also pause right after the killpg()
>     time.sleep(1)
>
>     # go through all pgrps to reap them
>     for pgrp in pgrp_list[:]:
>       # loop quickly to clean everything in this pgrp
>       while 1:
>         try:
>           pid, status = os.waitpid(-pgrp, os.WNOHANG)
>         except IOError, e:
>           if e.errno == errno.ECHILD:
>             # no more children in this pgrp.
>             pgrp_list.remove(pgrp)
>             break
>           raise
>         if pid == 0:
>   # some stuff has not exited yet, and WNOHANG avoided
>   # blocking. go ahead and move to the next pgrp.
>   break
>
> That should clean up everything. If stuff *still* hasn't exited, then
> there isn't much you can do. But you will have tried :-)
>
> Hope that helps! EdK and I haven't built test cases for the above, but it
> has been doubly-reviewed, so we think the algorithm/code should work.
>
> Cheers,
> -g
>
> p.s. note that we aren't on general@gump, so CC: if you reply...
>
> -- 
> Greg Stein, http://www.lyra.org/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
> For additional commands, e-mail: general-help@gump.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org

Re: termination algorithm

Posted by Leo Simons <ma...@leosimons.com>.

Hi Greg, Ed, gang,

I've integrated this into our development version of gump. There were a few
kinks to hammer out (for example catch OSError and not IOError, exit out of
the "while not timed out" as early as possible, no "-" for killpg), but it
seems to be basically working at this point. For now I'm just using a single
process group for all children and not really object-orienting things as
much as I could, but it does what we need. The code is at

https://svn.apache.org/repos/asf/gump/branches/Gump3/pygump/python/gump/util
/executor.py

It's kinda cool how easy it was to combine this with the new "subprocess"
module in python2.4, which is kind enough to provide a hook for inserting
the "set child group" functionality.

It's still a little ugly in some ways, since we want to be sure that a
progress stays around even if mostly empty. That makes stuff like
introspection a little easier. I just created a child that runs for the
length of the application to visibly keep the group around. Ugly, but the
only thing I could think of atm.

Thanks for your help!

Cheers,

Leo

On 21-03-2005 03:46, "Greg Stein" <gs...@lyra.org> wrote:
> Hi all,
> 
> Was talking with Leo here at the infrastructure gathering, and he
> mentioned that Gump was having issues cleaning up zombie processes. He
> asked me how to make that happen in Linux. The general reply is "use
> os.waitpid()" (in Python; waitpid() is the general POSIX thing).
> 
> Ed Korthof and I explored this a bit more to come up with a general
> algorithm for cleaning up "everything" after a Gump run.
> 
> To start with, Gump should put all fork'd children into their own process
> groups, and then remember those groups' ids. This will enable you to kill
> any grandchild process or other things that get spawned. Even if the
> process gets re-parented to the init process, you can give it the
> smackdown via the process group. Of course, if somebody else monkeys with
> process groups, you'll lose track of them. There are limits to cleanup :-)
> 
> When you want to clean up, you can send every process group SIGTERM. If
> any killpg() call throws an exception with ESRCH (no processes in that
> group), then remove it from the saved list of groups. Next, you would
> start looping to wait for all processes to exit, or to reach a timer on
> that wait. You want to quickly loop on everything that exits, terminate
> the loop when there is nothing more, and then pause a second if stuff is
> still busy shutting down. If you timeout and some are left, then SIGKILL
> them and go reap again. The algorithm would look like:
> 
> def clean_up_processes(pgrp_list):
>   # send SIGTERM to everything, and update pgrp_list to just those
>   # process groups which have processes in them.
>   kill_groups(pgrp_list, signal.SIGTERM)
>   
>   # pass a copy of the process groups. we want to remember every
>   # group that we SIGTERM'd so that we can SIGKILL them later. it
>   # is possible that a process in the pgrp was reparented to the
>   # init process. those will be invisible to wait(), so we don't
>   # want to mistakenly think we've killed all processes in the
>   # group. thus, we preserve the list and SIGKILL it later.
>   reap_children(pgrp_list[:])
> 
>   # SIGKILL everything, editing pgrp_list again.
>   kill_groups(pgrp_list, signal.SIGKILL)
>   
>   # reap everything left, but don't really bother waiting on them.
>   # if we exit, then init will reap them.
>   reap_children(pgrp_list, 60)
> 
> def kill_groups(pgrp_list, sig)
>   # NOTE: this function edits pgrp_list
> 
>   for pgrp in pgrp_list[:]:
>     try:
>       os.killpg(-pgrp, sig)
>     except IOError, e:
>       if e.errno == errno.ESRCH:
>         pgrp_list.remove(pgrp)
> 
> def reap_children(pgrp_list, timeout=300):
>   # NOTE: this function edits pgrp_list
> 
>   # keep reaping until the timeout expires, or we finish
>   end_time = time.time() + timeout
> 
>   # keep reaping until all pgrps are done, or we run out of time
>   while pgrp_list and time.time() < end_time:
>     # pause for a bit while processes work on exiting. this pause is
>     # at the top, so we can also pause right after the killpg()
>     time.sleep(1)
> 
>     # go through all pgrps to reap them
>     for pgrp in pgrp_list[:]:
>       # loop quickly to clean everything in this pgrp
>       while 1:
>         try:
>           pid, status = os.waitpid(-pgrp, os.WNOHANG)
>         except IOError, e:
>           if e.errno == errno.ECHILD:
>             # no more children in this pgrp.
>             pgrp_list.remove(pgrp)
>             break
>           raise
>         if pid == 0:
>  # some stuff has not exited yet, and WNOHANG avoided
>  # blocking. go ahead and move to the next pgrp.
>  break
> 
> That should clean up everything. If stuff *still* hasn't exited, then
> there isn't much you can do. But you will have tried :-)
> 
> Hope that helps! EdK and I haven't built test cases for the above, but it
> has been doubly-reviewed, so we think the algorithm/code should work.
> 
> Cheers,
> -g
> 
> p.s. note that we aren't on general@gump, so CC: if you reply...



---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@gump.apache.org
For additional commands, e-mail: general-help@gump.apache.org