You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Jeff Trawick <tr...@rdu26-58-158.nc.rr.com> on 2001/07/27 03:19:26 UTC

threaded.c assigning children to wrong slot

It seems that before a process has initialized fully we'll give
another process the same slot.

Here is a trace I added to the end of make_child():

Index: server/mpm/threaded/threaded.c
===================================================================
RCS file: /home/cvspublic/httpd-2.0/server/mpm/threaded/threaded.c,v
retrieving revision 1.50
diff -u -r1.50 threaded.c
--- server/mpm/threaded/threaded.c      2001/07/26 18:11:53     1.50
+++ server/mpm/threaded/threaded.c      2001/07/27 01:13:06
@@ -899,6 +899,11 @@
 
         clean_child_exit(0);
     }
+ap_log_error(APLOG_MARK, APLOG_CRIT, 0, ap_server_conf,
+             "just created process %d for slot %d (old value %d)",
+             pid, slot,
+             ap_scoreboard_image->parent[slot].pid);
+
     /* else */
     ap_scoreboard_image->parent[slot].pid = pid;
     return 0;

Here is an example of what is traced:

startup - three processes are created

[Thu Jul 26 21:02:02 2001] [crit] just created process 5471 for slot 0 (old value 0)
[Thu Jul 26 21:02:02 2001] [crit] just created process 5472 for slot 1 (old value 0)
[Thu Jul 26 21:02:02 2001] [crit] just created process 5473 for slot 2 (old value 0)
[Thu Jul 26 21:02:02 2001] [notice] Apache/2.0.22-dev (Unix) DAV/2 configured -- resuming normal operations
[Thu Jul 26 21:02:02 2001] [info] Server built: Jul 26 2001 20:24:29

started pounding the server, trying to create more processes

t0 - create one process

no problem yet

[Thu Jul 26 21:02:05 2001] [crit] just created process 5556 for slot 3 (old value 0)

t1 - create two processes

note that we screw up slot 3 'cause it is in use by process 5556,
which probably hasn't had time to fully initialize so I guess one of
the worker scores had the wrong status value

[Thu Jul 26 21:02:06 2001] [crit] just created process 5565 for slot 3 (old value 5556)
[Thu Jul 26 21:02:06 2001] [crit] just created process 5566 for slot 4 (old value 0)

t2 - create one process

again we steal the slot for a process which we kicked off in the
previous idle_server_maintenance window

[Thu Jul 26 21:02:07 2001] [crit] just created process 5594 for slot 3 (old value 5565)

[Thu Jul 26 21:02:08 2001] [crit] just created process 5595 for slot 4 (old value 5566)
[Thu Jul 26 21:02:08 2001] [crit] just created process 5602 for slot 5 (old value 0)
[Thu Jul 26 21:02:08 2001] [crit] just created process 5603 for slot 6 (old value 0)

[Thu Jul 26 21:02:09 2001] [info] server seems busy, (you may need to increase StartServers, ThreadsPerChild or Min/MaxSpareThreads), spawning 4 children, there are around 8 idle threads, and 8 total children

I don't think we should be taking over an in-use slot unless

1) in use by a previous generation process gracefully dying due to
apachectl restart
or
2) in use by a current generation process gracefully dying due to
reaching MaxRequestsPerChild

Currently, we're taking over slots just because the process hasn't had
a chance to initialize.

I can sorta see Ryan's design for allowing a process in a new
generation to use a slot still in use by the previous generation, but

                                ---/---

I'm getting a segfault with threaded.c on this particular Linux box I
haven't solved yet.  I had hoped that the one I found/fixed earlier
today would take care of this problem, but no such luck.  

no coredump, no clues

-- 
Jeff Trawick | trawick@attglobal.net | PGP public key at web site:
       http://www.geocities.com/SiliconValley/Park/9289/
             Born in Roswell... married an alien...

Re: threaded.c assigning children to wrong slot

Posted by Ryan Bloom <rb...@covalent.net>.
Okay, this is a bug in the implementation, not the design.  As I tried to explain,
the design does not allow for this.  We should fix the bug in the code, not reinvent
the design again.

As it happens, moving to a single-listen/multi-acceptor model allows us to fix this bug
FAR easier, because we can always have the listen thread in the t0 slot for a given process.
Then, when we want to determine if a process is shutting down, then we just check the t0
slot.

Today, we check all threads, which is a bit bogus.

Ryan


On Friday 27 July 2001 12:12, Jeff Trawick wrote:
> Ryan Bloom <rb...@covalent.net> writes:
> > As I have tried to explain MUTLIPLE times, we do not have two workers
> > fighting over the same field.  The threads are always separated
> > correctly.  We can have to processes using the same process_score, but
> > that only affects the pid, generation, and sb_type.  The only thing that
> > matters in that case is the pid, and that is easily fixable, by moving it
> > to the worker_score where it belongs.
>
> t0:
>
>   some thread slot is SERVER_DEAD so parent forks a new child (call it
>   "X") to take over that set of slots
>
> t1:
>
>   "X" is not done initializing
>
>   some thread slot is SERVER_DEAD so parent forks a new child (call it
>   "Y" to take over that same set of slots
>
> t2:
>
>   start_threads() in process "X" sees that an entry is SERVER_DEAD and
>   gets timesliced just before starting a thread to take over that
>   entry
>
> t3:
>
>   start_threads() in process "Y" sees that same entry is SERVER_DEAD
>   and starts a thread to take over that entry
>
> t4:
>
>   process "X" wakes up again and takes over that entry
>
> we now have threads in two different processes using the same slot
>
> Even if the threads were always separated correctly, which they
> aren't, why would we allow more than one process to take over the
> slots for a dying process?  That uses more system resources.

-- 

_____________________________________________________________________________
Ryan Bloom                        	rbb@apache.org
Covalent Technologies			rbb@covalent.net
-----------------------------------------------------------------------------

Re: threaded.c assigning children to wrong slot

Posted by Jeff Trawick <tr...@attglobal.net>.
Ryan Bloom <rb...@covalent.net> writes:

> As I have tried to explain MUTLIPLE times, we do not have two workers fighting over
> the same field.  The threads are always separated correctly.  We can have to processes using
> the same process_score, but that only affects the pid, generation, and sb_type.  The only thing
> that matters in that case is the pid, and that is easily fixable, by moving it to the worker_score
> where it belongs.

t0:

  some thread slot is SERVER_DEAD so parent forks a new child (call it
  "X") to take over that set of slots

t1:

  "X" is not done initializing

  some thread slot is SERVER_DEAD so parent forks a new child (call it
  "Y" to take over that same set of slots

t2:

  start_threads() in process "X" sees that an entry is SERVER_DEAD and
  gets timesliced just before starting a thread to take over that
  entry

t3:

  start_threads() in process "Y" sees that same entry is SERVER_DEAD
  and starts a thread to take over that entry

t4: 

  process "X" wakes up again and takes over that entry

we now have threads in two different processes using the same slot

Even if the threads were always separated correctly, which they
aren't, why would we allow more than one process to take over the
slots for a dying process?  That uses more system resources.

-- 
Jeff Trawick | trawick@attglobal.net | PGP public key at web site:
       http://www.geocities.com/SiliconValley/Park/9289/
             Born in Roswell... married an alien...

Re: threaded.c assigning children to wrong slot

Posted by Ryan Bloom <rb...@covalent.net>.

As I have tried to explain MUTLIPLE times, we do not have two workers fighting over
the same field.  The threads are always separated correctly.  We can have to processes using
the same process_score, but that only affects the pid, generation, and sb_type.  The only thing
that matters in that case is the pid, and that is easily fixable, by moving it to the worker_score
where it belongs.

Ryan

On Friday 27 July 2001 11:44, Jeff Trawick wrote:
> "Paul J. Reder" <re...@raleigh.ibm.com> writes:
> > Actually, it dawned on me that this is worse than I stated.
> >
> > There is no limit to processes joining a slot. If
> > perform_idle_server_maintenance deems it necessary to start more
> > processes, a new process can start within a slot where other processes
> > are still starting, but haven't yet grabbed all of the unused workers.
> >
> > Because there is no locking, more than one process can grab the same
> > worker slot. I have not looked to see what problems this can cause, but
> > two or more processes each starting a worker in the same slot can't be
> > good.
>
> This definitely sucks.  The two workers in the same slot fight over
> the same status fields.  This is a bug that needs to be fixed.
>
> We don't want more than one process taking over the slots for a
> process which is going away.

-- 

_____________________________________________________________________________
Ryan Bloom                        	rbb@apache.org
Covalent Technologies			rbb@covalent.net
-----------------------------------------------------------------------------

Re: threaded.c assigning children to wrong slot

Posted by Jeff Trawick <tr...@attglobal.net>.
"Paul J. Reder" <re...@raleigh.ibm.com> writes:

> Actually, it dawned on me that this is worse than I stated.
> 
> There is no limit to processes joining a slot. If perform_idle_server_maintenance
> deems it necessary to start more processes, a new process can start within
> a slot where other processes are still starting, but haven't yet grabbed
> all of the unused workers.
> 
> Because there is no locking, more than one process can grab the same worker
> slot. I have not looked to see what problems this can cause, but two or
> more processes each starting a worker in the same slot can't be good.

This definitely sucks.  The two workers in the same slot fight over
the same status fields.  This is a bug that needs to be fixed.

We don't want more than one process taking over the slots for a
process which is going away.
-- 
Jeff Trawick | trawick@attglobal.net | PGP public key at web site:
       http://www.geocities.com/SiliconValley/Park/9289/
             Born in Roswell... married an alien...

Re: threaded.c assigning children to wrong slot

Posted by Ryan Bloom <rb...@covalent.net>.
On Monday 30 July 2001 07:10, Paul J. Reder wrote:
> Ryan Bloom wrote:
> > This is an edge case, not a main-line case.  How big a problem is this in
> > a real-world situation, instead of a benchmarking situation?
>
> When I run my abuse test, this is a big problem. My abuse test consists of
> replaying an apache.org access_log. I think that is a real-world situation.
> The only non-real-world aspect is that I replay it without delay. The file
> sizes and mix of request types however is very real.
>
> If you set MaxClients to 30 and ThreadsPerChild to 50, there is a big
> difference between 30 processes + 1500 workers and 1500 processes + 1500
> workers. How does a user config the threaded mpm for a given set of
> resources if they can't control the number of processes?

I would consider this to be a poorly configured server.  Either MaxRequestsPerChild is too
low, or you have too many threads per child.  In this situation, you are basically saying that 
your server is never reaching a steady state, it always has processes shutting down.

The other problem is the inability of our current threaded server to do graceful shutdowns
correctly.  Give it a few days/weeks to get the worker MPM working correctly, and try it again.
I am willing to bet that with an MPM that handles shutdowns correctly 90% of your problems
with this will go away.

Ryan

_____________________________________________________________________________
Ryan Bloom                        	rbb@apache.org
Covalent Technologies			rbb@covalent.net
-----------------------------------------------------------------------------

Re: threaded.c assigning children to wrong slot

Posted by "Paul J. Reder" <re...@raleigh.ibm.com>.
Ryan Bloom wrote:
> This is an edge case, not a main-line case.  How big a problem is this in a real-world situation,
> instead of a benchmarking situation?

When I run my abuse test, this is a big problem. My abuse test consists of
replaying an apache.org access_log. I think that is a real-world situation.
The only non-real-world aspect is that I replay it without delay. The file
sizes and mix of request types however is very real.

If you set MaxClients to 30 and ThreadsPerChild to 50, there is a big difference
between 30 processes + 1500 workers and 1500 processes + 1500 workers. How does a
user config the threaded mpm for a given set of resources if they can't control
the number of processes?

-- 
Paul J. Reder
-----------------------------------------------------------
"The strength of the Constitution lies entirely in the determination of each
citizen to defend it.  Only if every single citizen feels duty bound to do
his share in this defense are the constitutional rights secure."
-- Albert Einstein

Re: threaded.c assigning children to wrong slot

Posted by Ryan Bloom <rb...@covalent.net>.
On Friday 27 July 2001 11:23, Bill Stoddard wrote:
> > Guys, this is as designed.  The worst case, is while the server is
> > stopping and starting child process (gracefully only) very quickly, we
> > can get MaxClients*ThreadsPerChild processes.  This is okay,
>
> Sounds like a problem to me. With Apache 1.3, admins use MaxClients to set
> an upper limit on the number of processes that may be started in order to
> prevent resource problems.

But that isn't how MaxClients is used in the threaded MPM.  If we want to rename the current
MaxClients from the threaded MPM and implement a real MaxClients, then cool we can do that.

Today though, MaxClients is not really used for resource limiting.  It is used in conjunction with
ThreadsPerChild to limit resource usage.  We have two options:

Create a lot of processes, don't hit MaxClients * threadsPerProcess

The user has told us to have MaxClients*ThreadsPerProcess threads dealing with requests, I
believe it is a BIG problem if we ever have more than that number doing work, and a problem if we
have less than that during a heavy load.

Remember, this is only an issue if we are constantly shutting down processing that are serving long-lived
requests.  In other words,

while (1) do
	kill -WINCH `cat httpd.pid`
done

while at the same time, we are making a lot of requests for VERY large files.  If we only make
requests for small files, then the old processes will die as soon as they can.

This is an edge case, not a main-line case.  How big a problem is this in a real-world situation,
instead of a benchmarking situation?

Ryan

_____________________________________________________________________________
Ryan Bloom                        	rbb@apache.org
Covalent Technologies			rbb@covalent.net
-----------------------------------------------------------------------------

Re: threaded.c assigning children to wrong slot

Posted by Bill Stoddard <bi...@wstoddard.com>.
> Guys, this is as designed.  The worst case, is while the server is stopping and starting
> child process (gracefully only) very quickly, we can get MaxClients*ThreadsPerChild
> processes.  This is okay,

Sounds like a problem to me. With Apache 1.3, admins use MaxClients to set an upper limit
on the number of processes that may be started in order to prevent resource problems.

Bill


Re: threaded.c assigning children to wrong slot

Posted by Ryan Bloom <rb...@covalent.net>.
Guys, this is as designed.  The worst case, is while the server is stopping and starting
child process (gracefully only) very quickly, we can get MaxClients*ThreadsPerChild
processes.  This is okay, because the user is really trying to tell us how many threads they want
handling requests, not how many processes they want.  We only ever go over the MaxClients if
we have processes shutting down.

I am willing to believe that we are starting processes too quickly, but that is a bug in the
implementation, not the design.

Ryan

On Friday 27 July 2001 06:40, Paul J. Reder wrote:
> Actually, it dawned on me that this is worse than I stated.
>
> There is no limit to processes joining a slot. If
> perform_idle_server_maintenance deems it necessary to start more processes,
> a new process can start within a slot where other processes are still
> starting, but haven't yet grabbed all of the unused workers.
>
> Because there is no locking, more than one process can grab the same worker
> slot. I have not looked to see what problems this can cause, but two or
> more processes each starting a worker in the same slot can't be good.

-- 

_____________________________________________________________________________
Ryan Bloom                        	rbb@apache.org
Covalent Technologies			rbb@covalent.net
-----------------------------------------------------------------------------

Re: threaded.c assigning children to wrong slot

Posted by "Paul J. Reder" <re...@raleigh.ibm.com>.
Actually, it dawned on me that this is worse than I stated.

There is no limit to processes joining a slot. If perform_idle_server_maintenance
deems it necessary to start more processes, a new process can start within
a slot where other processes are still starting, but haven't yet grabbed
all of the unused workers.

Because there is no locking, more than one process can grab the same worker
slot. I have not looked to see what problems this can cause, but two or
more processes each starting a worker in the same slot can't be good.

-- 
Paul J. Reder
-----------------------------------------------------------
"The strength of the Constitution lies entirely in the determination of each
citizen to defend it.  Only if every single citizen feels duty bound to do
his share in this defense are the constitutional rights secure."
-- Albert Einstein

Re: threaded.c assigning children to wrong slot

Posted by "Paul J. Reder" <re...@raleigh.ibm.com>.
Jeff,

This is the core of Ryan's patch to fix the threaded mpm problem where
we were left with N processes each with a small number of retiring
workers due to MaxRequestsPerChild being hit. 

Ryan's fix was to allow multiple (up to ThreadsPerChild) processes to
occupy the same slot (each process overwriting the previous owners pid - even
if that pid still exists. Httpd no longer pays any attention to the
MaxClients value. In perform_idle_server_maintenance, in the inner "for" loop,
it looks to see if there are "any_dead_threads". If there are *any* dead
threads (as opposed to all dead threads) it will start a new process within
that same slot.

You can end up with a worst case scenario of (MaxClients * ThreadsPerChild)
number of processes (with ThreadsPerChild processes occupying the same slot,
where the last process into the slot is listed as the owner).

As each of these processes is started it tries to grab all the available
workers in that slot. This is why we now have the new start_threads thread.
Each new process is started in its own thread so that it can keep looping
until it eventually grabs ThreadsPerChild workers (or discovers
workers_can_exit).

The user has *no* control over how many processes get started - well, they
can limit it by defining MaxClients and ThreadsPerChild... But at least you
never end up with dead workers when there is work to do simply because we
were out of processes. The user also never gets an indication of how many
processes have been started since we don't report that in the status. Because
the pids get overridden it appears, according to status, to contain only
MaxClient pids.

So what you are seeing is just the code working as it is designed to.

-- 
Paul J. Reder
-----------------------------------------------------------
"The strength of the Constitution lies entirely in the determination of each
citizen to defend it.  Only if every single citizen feels duty bound to do
his share in this defense are the constitutional rights secure."
-- Albert Einstein

Re: threaded.c assigning children to wrong slot

Posted by Jeff Trawick <tr...@attglobal.net>.
Jeff Trawick <tr...@rdu26-58-158.nc.rr.com> writes:

> I'm getting a segfault with threaded.c on this particular Linux box I
> haven't solved yet.  I had hoped that the one I found/fixed earlier
> today would take care of this problem, but no such luck.  
> 
> no coredump, no clues

darn...  after adding a bazillion traces it seems that
pthread_create() is the culprit...  the call to it looks okay...  I
guess glibc hit a rare error path and puked... at the same time other
Apache processes are creating threads okay

-- 
Jeff Trawick | trawick@attglobal.net | PGP public key at web site:
       http://www.geocities.com/SiliconValley/Park/9289/
             Born in Roswell... married an alien...