You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@httpd.apache.org by Joshua Slive <jo...@slive.ca> on 2001/11/09 19:05:32 UTC

stuck in keepalive (apache 1.3)

I'm curious if anyone knows the cause/solution of this one.  I've never seen
it personally, but I've seen at least half a dozen reports  in the bug
database and newsgroups.

Basically, people report that processes get "stuck" in keepalive and
eventually fill up all the process slots, bringing the server to a halt:

http://bugs.apache.org/index.cgi/full/8725
http://bugs.apache.org/index.cgi/full/8261
http://bugs.apache.org/index.cgi/full/8045

Some people report success with a signal change:

http://bugs.apache.org/index.cgi/full/3906

Joshua.

Re: stuck in keepalive (apache 1.3)

Posted by dean gaudet <de...@arctic.org>.

On Wed, 14 Nov 2001, Stipe Tolj wrote:

> I have reported such hanginf keepalive childs on the Cygwin 1.x
> platform.
>
> These come up after some days of load and ussually go up to 50-60
> "blocked" keepalive childs. Recently (after 16 days httpd uptime) the
> whole scoreboard "flushed" and the hanging keepalive processes
> disapeared (without restarting apache).

any chance you've got a NAT which times out connections after 15 days?

> I thought this was a Cygwin specific problem, but as the PRs report
> similiar effect I think this is related to Apache itself.

thing is, i never see it on my systems :)

maybe there's a pattern to the request previous to blocking -- what you
can do is log the PID for each request in your access_log.  then when you
discover hung children you can backtrack in the logs to see what the
previous request was...  er actually i guess you can get this from the
scoreboard.

thing is, there is a race condition in the OPTIMIZE_TIMEOUT code, but it's
the opposite problem from what folks are describing -- the race condition
can mean a SIGALRM delivered if the child receives data right when the
timeout happens.  (there's no way around this without resorting to
cpu-specific knowledge to implement memory barriers... and i don't
particularly think it's important.)

if you can reproduce the problem easily you might also want to disable
OPTIMIZE_TIMEOUT and see what happens... maybe there's another race i
haven't spotted.

-dean

Re: stuck in keepalive (apache 1.3)

Posted by Stipe Tolj <to...@wapme-systems.de>.

I have reported such hanginf keepalive childs on the Cygwin 1.x
platform.

These come up after some days of load and ussually go up to 50-60
"blocked" keepalive childs. Recently (after 16 days httpd uptime) the
whole scoreboard "flushed" and the hanging keepalive processes
disapeared (without restarting apache).

I thought this was a Cygwin specific problem, but as the PRs report
similiar effect I think this is related to Apache itself.

BTW, this was not caused by third-party modules! -- the installation
had only distribution modules activated.

Regards,
Stipe

tolj@wapme-systems.de
-------------------------------------------------------------------
Wapme Systems AG

Münsterstr. 248
40470 Düsseldorf

Tel: +49-211-74845-0
Fax: +49-211-74845-299

E-Mail: info@wapme-systems.de
Internet: http://www.wapme-systems.de
-------------------------------------------------------------------
wapme.net - wherever you are

Re: stuck in keepalive (apache 1.3)

Posted by dean gaudet <de...@arctic.org>.

On Fri, 9 Nov 2001, Bill Stoddard wrote:

> Curious what insite Dean had into suggesting a change from SIGALRM to
> SIGUSR2.

the two places the children were hanging (in PR#3906) were in places that
the parent was supposed to deliver SIGALRM.  they weren't hard hangs
because the SIGHUP was working... so i suspected that it was just a
problem with SIGALRM delivery.  such as some 3rd party library/module (or
even solaris libc, or even our own rfc1413) changing the ALRM handler.

i actually was wondering if maybe it wasn't legal to do SIGALRM like
this... well, not legal on solaris that is -- 'cause i don't think posix
has anything to say about it.

-dean

RE: stuck in keepalive (apache 1.3)

Posted by Joshua Slive <jo...@slive.ca>.

> From: Bill Stoddard [mailto:bill@wstoddard.com]

> The only way that I can see Apache 1.3 getting stuck on keepalive
> reads is:
> 1. storage overlay that whacked the timeout value to some BIG
> NUMBER (unlikely I think)
> 2. signals are not being delivered either because of a bug in the
> OS or because some
> errant module has blocked signals.
> 3. Some module has started one or more additional threads w/o
> blocking signals in the new
> thread. The new thread could be catching the keep-alive timeout
> signal rather than the
> thread blocked on the read.

2 or 3 sound right to me.  It is amazing the variety of wacky libraries that
Apache winds up sharing process space with.

Joshua.

Re: stuck in keepalive (apache 1.3)

Posted by Bill Stoddard <bi...@wstoddard.com>.

Curious what insite Dean had into suggesting a change from SIGALRM to SIGUSR2.

Seemed to work in the Solaris case. Not clear if the fellow who made the change on Linux
got any relief.

The only way that I can see Apache 1.3 getting stuck on keepalive reads is:
1. storage overlay that whacked the timeout value to some BIG NUMBER (unlikely I think)
2. signals are not being delivered either because of a bug in the OS or because some
errant module has blocked signals.
3. Some module has started one or more additional threads w/o blocking signals in the new
thread. The new thread could be catching the keep-alive timeout signal rather than the
thread blocked on the read.

In general it is bad mojo to introduce threads to an Apache 1.3 process but I believe it
can be done if you are really, really careful...

Bill

> I'm curious if anyone knows the cause/solution of this one.  I've never seen
> it personally, but I've seen at least half a dozen reports  in the bug
> database and newsgroups.
>
> Basically, people report that processes get "stuck" in keepalive and
> eventually fill up all the process slots, bringing the server to a halt:
>
> http://bugs.apache.org/index.cgi/full/8725
> http://bugs.apache.org/index.cgi/full/8261
> http://bugs.apache.org/index.cgi/full/8045
>
> Some people report success with a signal change:
>
> http://bugs.apache.org/index.cgi/full/3906
>
> Joshua.
>