You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@httpd.apache.org by Rainer Jung <ra...@kippdata.de> on 2019/01/27 13:40:32 UTC

Prefork MPM and mod_watchdog

Hi all,

since around 2.4.26 I notice occasional hangs during release testing 
only on Solaris 10 Sparc when using MPM prefork.

Those hangs were always observed during proxy testing. The situation was 
in most cases (all?), that only one running httpd process was left, so 
it would accept the incoming request but could not proxy them and each 
request ran into the proxy timeout (I think 1 minute in the test). The 
process was in fact an httpd child process. Parent process and other 
children were gone.

I now narrowed it down some more. I am not sure, this is the only 
situation, but at least this is one I do observe most often:

- no problems with event or worker MPM

- no problems on REHL and SLES (Linux)

- even on Solaris, running without mod_watchdog (and therefore without 
mod_heartbeat, mod_heartmonitor, mod_lbmethod_heartbeat and 
mod_proxy_hcheck) it seems the hang does not occur.

- as soon as I enable mod_watchdog only (but not the above modules that 
would use it), the hangs start to happen every now and then.

The pattern seems to be (from the error log):

[Sat Jan 26 21:36:01.855207 2019] [core:trace4] [pid 7850] 
mpm_common.c(536): mpm child 7853 (gen 0/slot 0) started
[Sat Jan 26 21:36:01.875826 2019] [mpm_prefork:notice] [pid 7850] 
AH00163: Apache/2.4.38 (Unix) OpenSSL/1.0.2 configured -- resuming 
normal operations
[Sat Jan 26 21:36:01.875975 2019] [mpm_prefork:info] [pid 7850] AH00164: 
Server built: Jan 18 2019 17:54:50
[Sat Jan 26 21:36:01.876208 2019] [core:notice] [pid 7850] AH00094: 
Command line: '/path/to/bin/httpd -d /path/to/t -f 
/path/to/t/conf/httpd.conf -D APACHE2 -D APACHE2_4'
[Sat Jan 26 21:36:01.876340 2019] [core:debug] [pid 7850] log.c(1571): 
AH02639: Using SO_REUSEPORT: no (1)
[Sat Jan 26 21:36:01.876417 2019] [mpm_prefork:debug] [pid 7850] 
prefork.c(923): AH00165: Accept mutex: pthread (default: pthread)
[Sat Jan 26 21:36:01.875753 2019] [core:trace4] [pid 7850] 
mpm_common.c(536): mpm child 7854 (gen 0/slot 1) started
...
[Sat Jan 26 21:36:18.969742 2019] [core:trace4] [pid 7850] 
mpm_common.c(536): mpm child 7889 (gen 0/slot 2) started
...
SHOULD BE FINE UNTIL HERE
...
[Sat Jan 26 21:38:29.016961 2019] [mpm_prefork:error] [pid 7850] 
AH00161: server reached MaxRequestWorkers setting, consider raising the 
MaxRequestWorkers setting
...
UNCLEAR, WHETHER THAT IS UNEXPECTED, BUT THEN THE REAL ERROR
...
[Sat Jan 26 21:42:48.213005 2019] [mpm_prefork:emerg] [pid 7889] 
(45)Deadlock situation detected/avoided: AH00144: couldn't grab the 
accept mutex
[Sat Jan 26 21:42:48.490884 2019] [mpm_prefork:emerg] [pid 7853] 
(45)Deadlock situation detected/avoided: AH00144: couldn't grab the 
accept mutex
...
[Sat Jan 26 21:42:49.129401 2019] [core:alert] [pid 7850] AH00050: Child 
7853 returned a Fatal error... Apache is exiting!
[Sat Jan 26 21:42:49.129519 2019] [:emerg] [pid 7850] AH02818: MPM run 
failed, exiting

bin/httpd -V
Server version: Apache/2.4.38 (Unix)
Server built:   Jan 18 2019 17:54:50
Server's Module Magic Number: 20120211:83
Server loaded:  APR 1.6.5, APR-UTIL 1.6.1
Compiled using: APR 1.6.5, APR-UTIL 1.6.1
Architecture:   32-bit
Server MPM:     prefork
   threaded:     no
     forked:     yes (variable process count)
Server compiled with....
  -D APR_HAS_SENDFILE
  -D APR_HAS_MMAP
  -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
  -D APR_USE_PROC_PTHREAD_SERIALIZE
  -D APR_USE_PTHREAD_SERIALIZE
  -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
  -D APR_HAS_OTHER_CHILD
  -D AP_HAVE_RELIABLE_PIPED_LOGS
  -D DYNAMIC_MODULE_LIMIT=256
  -D HTTPD_ROOT="..."
  -D SUEXEC_BIN=".../bin/suexec"
  -D DEFAULT_PIDLOG="logs/httpd.pid"
  -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
  -D DEFAULT_ERRORLOG="logs/error_log"
  -D AP_TYPES_CONFIG_FILE="conf/mime.types"
  -D SERVER_CONFIG_FILE="conf/httpd.conf"

I remember Solaris having a very bad deadlock detection but that goes 
back to using fcntl locks. Here we are using pthread locks. Since this 
is APR 1.6, my old observation about pthread timed lock problems on 
Solaris 10 also does not apply here.

I wonder what is special about prefork and mod_watchdog.

In addition, what I find strange is, that the accept mutex error lead to 
child crashes, but not to a start of new child processes. Instead one of 
the two child crashes (the second one) seems to be noticed by the parent:

[core:alert] [pid 7850] AH00050: Child 7853 returned a Fatal error... 
Apache is exiting!
[:emerg] [pid 7850] AH02818: MPM run failed, exiting

and in fact the parent goes away but the third and last child process 
stays and continues to handle requests. That doesn't seem right.

I also see complains about reaching MaxRequestWorkers in the prefork 
tests on Solaris. I haven't verified it, but it seems they run with a 
fixed limit of three child processes. On the one hand, I would only 
expect two of them to be actually used at any time (request plus proxied 
request), since the tests are not executed in parallel. So the message 
is a bit strange. On the other hand the limit three (if I am right) 
seems very limited.

Any hints, ideas, comments about prefork plus mod_watchdog, the 
incomplete shutdown after AH00050 plus AH02818, the limit of 3 child 
processes and the MaxRequestWorkers warning?

Questions, questions, ... ;)

Regards,

Rainer

Re: Solaris, prefork, accept mutex and mod_ext_filter (Was: Prefork MPM and mod_watchdog)

Posted by Rainer Jung <ra...@kippdata.de>.

Am 30.03.2020 um 11:28 schrieb Joe Orton:
> On Mon, Mar 30, 2020 at 12:31:05AM +0200, Rainer Jung wrote:
>> I can now see the same problem on Linux (eg. RHEL 7, SLES 12 and SLES 15)
>> when doing testing for 2.4.43. I think it is not a regression and for me it
>> is not a showstopper, but something that would be nice to fix. Symptom is
>>
>> [Sat Mar 28 15:53:21.121461 2020] [mpm_prefork:debug] [pid 4574]
>> prefork.c(919): AH00165: Accept mutex: pthread (default: pthread)
>> [Sat Mar 28 15:58:19.251858 2020] [mpm_prefork:emerg] [pid 6509] (22)Invalid
>> argument: AH00144: couldn't grab the accept mutex
>> [Sat Mar 28 15:58:20.624995 2020] [mpm_prefork:emerg] [pid 4902] (22)Invalid
>> argument: AH00146: couldn't release the accept mutex
>> [Sat Mar 28 15:58:21.410552 2020] [:emerg] [pid 4574] AH02818: MPM run
>> failed, exiting
>>
>> happening during t/modules/ext_filter.t and as a result all httpd processes
>> are gone.
>>
>> Although it happens only rarely, I think it does not happen when using APR
>> 1.6, but only for 1.7 (1.7.0 and also for latest head). The accept mutex is
>> pthread based. There are changes in APR 1.7 for those.
> 
> This is interesting, I saw some similar mutex state snafu (Fedora 31)
> but assumed it was some kind of PEBKAC issue and moved on.  You have
> only seen this with prefork?

Yes, only with prefork. If the root cause would also apply for 
multi-threaded MPMs, then the escalation might be different, ie. not 
leading to all child processes exiting. I can not rule out this.

> May be completely unrelated but since I added a grep for child segfaults
> in error_log that has triggered with prefork at least once:
> 
> https://travis-ci.org/github/apache/httpd/jobs/665874906

That looks more like my rarely observed prefork shutdown crashes when 
mod_watchdog shuts down after some other deinitialization that happened 
before.

Regards,

Rainer

> Regards, Joe
> 
>> Since I only observe it for mod_ext_filter, there should be some dependency
>> with the forked perl process.
>>
>> I didn't yet have the opportunity to check, how much of the follwing
>> description for Solaris also holds for Linux.
>>
>> Regards,
>>
>> Rainer
>>
>> Am 03.02.2019 um 13:30 schrieb Rainer Jung:
>>> I can now frequently reproduce running t/modules/ext_filter.t only. I
>>> stripped the reproducer down to the part of t/modules/ext_filter.t where
>>> it runs
>>>
>>> POST "/apache/extfilter/out-limit/modules/cgi/perl_echo.pl", content =>
>>> "foo and bar\n"
>>>
>>> 10 times in a row. If I increase the iteration count slightly to 20, the
>>> problem happens nearly every time. I also replaced perl_echo.pl and
>>> eval-cmd.pl by small C programs doing the echo and the s/foo/bar/ and
>>> can still reproduce.
>>>
>>> This test involved mod_ext_filter and LimitRequestBody.
>>>
>>> It seems I can not reproduce, if I shorten the POST body, so that it no
>>> longer hits the LimitRequestBody 6 configured for
>>> /apache/extfilter/out-limit/.
>>>
>>> What happens, but I do not understand is:
>>>
>>> - the first few requests are handled by two children in an alternating
>>> way. I can see the accept mutex being passed between these two children
>>> using lwp_mutex_timedlock(ADDRESS, 0x00000000) and
>>> lwp_mutex_unlock(ADDRESS) using truss (Solaris variant of strace). So
>>> Solaris seems to implement the pthread mutexes via these lwp mutexes.
>>>
>>> - after a few requests alternating between the two children, something
>>> strange happens:
>>>
>>>     - child A returns from port_getn() for 22 (probably part of the
>>> accept() impl)
>>>     - child A calls accept()
>>>     - child A unlocks the accept mutex using lwp_mutex_unlock()
>>>     - child B locks the accept mutex using lwp_mutex_timedlock()
>>>     - child B calls port_associate for 22 (probably part of accept() impl)
>>>     - child A handles the request
>>> ! - child A also calls port_associate for 22 (no sign of lock acquire)
>>> ! - child A returns from port_getn() for 22
>>>     - child A calls accept() and starts to handle the connection
>>> ! - child B also returns from port_getn() for 22
>>> ! - child B gets EAGAIN for the accept()
>>>     - child B calls port_associate for 22
>>>     - child A handles the request
>>> ! - child A gets EDEADLK for pthread_mutex_lock() (this is from the
>>> error_log; there's no system call for this instance of
>>> pthread_mutex_lock() in the truss output). EDEADLK for
>>> pthread_mutex_lock() means that this process already has that lock.
>>>     - child B returns from port_getn() for 22
>>>     - child B calls accept() and starts to handle the connection
>>>     - child A exits (now B is the only child left)
>>>     - child B proceeds request handling
>>>     - child B does all further port_associate(), port_getn(), and
>>> accept() plus connection and request handling. No more
>>> lwp_mutex_timedlock() or lwp_mutex_unlock() for B, maybe due to some
>>> optimization when only one process for a lock is left.
>>>
>>>
>>> It is quite possible, that there is an underlying lwp_mutex or portfs
>>> bug here. But it is strange, that this only comes up when used with
>>> mod_ext_filter in combination with LimitRequestBody.
>>>
>>> There was the fix
>>>
>>> https://bz.apache.org/bugzilla/show_bug.cgi?id=60375
>>>
>>> but I don't see, how this would influence the above list.
>>>
>>> I can try to further narrow down (using static content to eliminate one
>>> forked child; using LimitRequestBody without mod_exit-filter etc.), but
>>> maybe someone already has an idea?
>>>
>>> Regards,
>>>
>>> Rainer

Re: Solaris, prefork, accept mutex and mod_ext_filter (Was: Prefork MPM and mod_watchdog)

Posted by Joe Orton <jo...@redhat.com>.

On Mon, Mar 30, 2020 at 12:31:05AM +0200, Rainer Jung wrote:
> I can now see the same problem on Linux (eg. RHEL 7, SLES 12 and SLES 15)
> when doing testing for 2.4.43. I think it is not a regression and for me it
> is not a showstopper, but something that would be nice to fix. Symptom is
> 
> [Sat Mar 28 15:53:21.121461 2020] [mpm_prefork:debug] [pid 4574]
> prefork.c(919): AH00165: Accept mutex: pthread (default: pthread)
> [Sat Mar 28 15:58:19.251858 2020] [mpm_prefork:emerg] [pid 6509] (22)Invalid
> argument: AH00144: couldn't grab the accept mutex
> [Sat Mar 28 15:58:20.624995 2020] [mpm_prefork:emerg] [pid 4902] (22)Invalid
> argument: AH00146: couldn't release the accept mutex
> [Sat Mar 28 15:58:21.410552 2020] [:emerg] [pid 4574] AH02818: MPM run
> failed, exiting
> 
> happening during t/modules/ext_filter.t and as a result all httpd processes
> are gone.
>
> Although it happens only rarely, I think it does not happen when using APR
> 1.6, but only for 1.7 (1.7.0 and also for latest head). The accept mutex is
> pthread based. There are changes in APR 1.7 for those.

This is interesting, I saw some similar mutex state snafu (Fedora 31) 
but assumed it was some kind of PEBKAC issue and moved on.  You have 
only seen this with prefork?

May be completely unrelated but since I added a grep for child segfaults 
in error_log that has triggered with prefork at least once:

https://travis-ci.org/github/apache/httpd/jobs/665874906

Regards, Joe

> Since I only observe it for mod_ext_filter, there should be some dependency
> with the forked perl process.
> 
> I didn't yet have the opportunity to check, how much of the follwing
> description for Solaris also holds for Linux.
> 
> Regards,
> 
> Rainer
> 
> Am 03.02.2019 um 13:30 schrieb Rainer Jung:
> > I can now frequently reproduce running t/modules/ext_filter.t only. I
> > stripped the reproducer down to the part of t/modules/ext_filter.t where
> > it runs
> > 
> > POST "/apache/extfilter/out-limit/modules/cgi/perl_echo.pl", content =>
> > "foo and bar\n"
> > 
> > 10 times in a row. If I increase the iteration count slightly to 20, the
> > problem happens nearly every time. I also replaced perl_echo.pl and
> > eval-cmd.pl by small C programs doing the echo and the s/foo/bar/ and
> > can still reproduce.
> > 
> > This test involved mod_ext_filter and LimitRequestBody.
> > 
> > It seems I can not reproduce, if I shorten the POST body, so that it no
> > longer hits the LimitRequestBody 6 configured for
> > /apache/extfilter/out-limit/.
> > 
> > What happens, but I do not understand is:
> > 
> > - the first few requests are handled by two children in an alternating
> > way. I can see the accept mutex being passed between these two children
> > using lwp_mutex_timedlock(ADDRESS, 0x00000000) and
> > lwp_mutex_unlock(ADDRESS) using truss (Solaris variant of strace). So
> > Solaris seems to implement the pthread mutexes via these lwp mutexes.
> > 
> > - after a few requests alternating between the two children, something
> > strange happens:
> > 
> >    - child A returns from port_getn() for 22 (probably part of the
> > accept() impl)
> >    - child A calls accept()
> >    - child A unlocks the accept mutex using lwp_mutex_unlock()
> >    - child B locks the accept mutex using lwp_mutex_timedlock()
> >    - child B calls port_associate for 22 (probably part of accept() impl)
> >    - child A handles the request
> > ! - child A also calls port_associate for 22 (no sign of lock acquire)
> > ! - child A returns from port_getn() for 22
> >    - child A calls accept() and starts to handle the connection
> > ! - child B also returns from port_getn() for 22
> > ! - child B gets EAGAIN for the accept()
> >    - child B calls port_associate for 22
> >    - child A handles the request
> > ! - child A gets EDEADLK for pthread_mutex_lock() (this is from the
> > error_log; there's no system call for this instance of
> > pthread_mutex_lock() in the truss output). EDEADLK for
> > pthread_mutex_lock() means that this process already has that lock.
> >    - child B returns from port_getn() for 22
> >    - child B calls accept() and starts to handle the connection
> >    - child A exits (now B is the only child left)
> >    - child B proceeds request handling
> >    - child B does all further port_associate(), port_getn(), and
> > accept() plus connection and request handling. No more
> > lwp_mutex_timedlock() or lwp_mutex_unlock() for B, maybe due to some
> > optimization when only one process for a lock is left.
> > 
> > 
> > It is quite possible, that there is an underlying lwp_mutex or portfs
> > bug here. But it is strange, that this only comes up when used with
> > mod_ext_filter in combination with LimitRequestBody.
> > 
> > There was the fix
> > 
> > https://bz.apache.org/bugzilla/show_bug.cgi?id=60375
> > 
> > but I don't see, how this would influence the above list.
> > 
> > I can try to further narrow down (using static content to eliminate one
> > forked child; using LimitRequestBody without mod_exit-filter etc.), but
> > maybe someone already has an idea?
> > 
> > Regards,
> > 
> > Rainer
>

Re: Solaris, prefork, accept mutex and mod_ext_filter (Was: Prefork MPM and mod_watchdog)

Posted by Rainer Jung <ra...@kippdata.de>.

I can now see the same problem on Linux (eg. RHEL 7, SLES 12 and SLES 
15) when doing testing for 2.4.43. I think it is not a regression and 
for me it is not a showstopper, but something that would be nice to fix. 
Symptom is

[Sat Mar 28 15:53:21.121461 2020] [mpm_prefork:debug] [pid 4574] 
prefork.c(919): AH00165: Accept mutex: pthread (default: pthread)
[Sat Mar 28 15:58:19.251858 2020] [mpm_prefork:emerg] [pid 6509] 
(22)Invalid argument: AH00144: couldn't grab the accept mutex
[Sat Mar 28 15:58:20.624995 2020] [mpm_prefork:emerg] [pid 4902] 
(22)Invalid argument: AH00146: couldn't release the accept mutex
[Sat Mar 28 15:58:21.410552 2020] [:emerg] [pid 4574] AH02818: MPM run 
failed, exiting

happening during t/modules/ext_filter.t and as a result all httpd 
processes are gone.

Although it happens only rarely, I think it does not happen when using 
APR 1.6, but only for 1.7 (1.7.0 and also for latest head). The accept 
mutex is pthread based. There are changes in APR 1.7 for those.

Since I only observe it for mod_ext_filter, there should be some 
dependency with the forked perl process.

I didn't yet have the opportunity to check, how much of the follwing 
description for Solaris also holds for Linux.

Regards,

Rainer

Am 03.02.2019 um 13:30 schrieb Rainer Jung:
> I can now frequently reproduce running t/modules/ext_filter.t only. I 
> stripped the reproducer down to the part of t/modules/ext_filter.t where 
> it runs
> 
> POST "/apache/extfilter/out-limit/modules/cgi/perl_echo.pl", content => 
> "foo and bar\n"
> 
> 10 times in a row. If I increase the iteration count slightly to 20, the 
> problem happens nearly every time. I also replaced perl_echo.pl and 
> eval-cmd.pl by small C programs doing the echo and the s/foo/bar/ and 
> can still reproduce.
> 
> This test involved mod_ext_filter and LimitRequestBody.
> 
> It seems I can not reproduce, if I shorten the POST body, so that it no 
> longer hits the LimitRequestBody 6 configured for 
> /apache/extfilter/out-limit/.
> 
> What happens, but I do not understand is:
> 
> - the first few requests are handled by two children in an alternating 
> way. I can see the accept mutex being passed between these two children 
> using lwp_mutex_timedlock(ADDRESS, 0x00000000) and 
> lwp_mutex_unlock(ADDRESS) using truss (Solaris variant of strace). So 
> Solaris seems to implement the pthread mutexes via these lwp mutexes.
> 
> - after a few requests alternating between the two children, something 
> strange happens:
> 
>    - child A returns from port_getn() for 22 (probably part of the 
> accept() impl)
>    - child A calls accept()
>    - child A unlocks the accept mutex using lwp_mutex_unlock()
>    - child B locks the accept mutex using lwp_mutex_timedlock()
>    - child B calls port_associate for 22 (probably part of accept() impl)
>    - child A handles the request
> ! - child A also calls port_associate for 22 (no sign of lock acquire)
> ! - child A returns from port_getn() for 22
>    - child A calls accept() and starts to handle the connection
> ! - child B also returns from port_getn() for 22
> ! - child B gets EAGAIN for the accept()
>    - child B calls port_associate for 22
>    - child A handles the request
> ! - child A gets EDEADLK for pthread_mutex_lock() (this is from the 
> error_log; there's no system call for this instance of 
> pthread_mutex_lock() in the truss output). EDEADLK for 
> pthread_mutex_lock() means that this process already has that lock.
>    - child B returns from port_getn() for 22
>    - child B calls accept() and starts to handle the connection
>    - child A exits (now B is the only child left)
>    - child B proceeds request handling
>    - child B does all further port_associate(), port_getn(), and 
> accept() plus connection and request handling. No more 
> lwp_mutex_timedlock() or lwp_mutex_unlock() for B, maybe due to some 
> optimization when only one process for a lock is left.
> 
> 
> It is quite possible, that there is an underlying lwp_mutex or portfs 
> bug here. But it is strange, that this only comes up when used with 
> mod_ext_filter in combination with LimitRequestBody.
> 
> There was the fix
> 
> https://bz.apache.org/bugzilla/show_bug.cgi?id=60375
> 
> but I don't see, how this would influence the above list.
> 
> I can try to further narrow down (using static content to eliminate one 
> forked child; using LimitRequestBody without mod_exit-filter etc.), but 
> maybe someone already has an idea?
> 
> Regards,
> 
> Rainer

Solaris, prefork, accept mutex and mod_ext_filter (Was: Prefork MPM and mod_watchdog)

Posted by Rainer Jung <ra...@kippdata.de>.

Am 31.01.2019 um 10:31 schrieb Stefan Eissing:
> 
> 
>> Am 27.01.2019 um 14:40 schrieb Rainer Jung <ra...@kippdata.de>:
>>
>> - as soon as I enable mod_watchdog only (but not the above modules that would use it), the hangs start to happen every now and then.
> 
> Hmm, that sounds strange. As I understood the code, none of the parent/child watchdogs would be active then. So, its child_init should not do anything either.
> 
> But, if one of the proxy monitors runs against the server itself, could it connect against the child process it (the watchdog) is running on?

mod_watchdog was a red herring.

I can now frequently reproduce running t/modules/ext_filter.t only. I 
stripped the reproducer down to the part of t/modules/ext_filter.t where 
it runs

POST "/apache/extfilter/out-limit/modules/cgi/perl_echo.pl", content => 
"foo and bar\n"

10 times in a row. If I increase the iteration count slightly to 20, the 
problem happens nearly every time. I also replaced perl_echo.pl and 
eval-cmd.pl by small C programs doing the echo and the s/foo/bar/ and 
can still reproduce.

This test incolved mod_ext_filter and LimitRequestBody.

It seems I can not reproduce, if I shorten the POST body, so that it no 
longer hits the LimitRequestBody 6 configured for 
/apache/extfilter/out-limit/.

What happens, but I do not understand is:

- the first few requests are handled by two children in an alternating 
way. I can see the accept mutex being passed between these two children 
using lwp_mutex_timedlock(ADDRESS, 0x00000000) and 
lwp_mutex_unlock(ADDRESS) using truss (Solaris variant of strace). So 
Solaris seems to implement the pthread mutexes via these lwp mutexes.

- after a few requests alternating between the two children, something 
strange happens:

   - child A returns from port_getn() for 22 (probably part of the 
accept() impl)
   - child A calls accept()
   - child A unlocks the accept mutex using lwp_mutex_unlock()
   - child B locks the accept mutex using lwp_mutex_timedlock()
   - child B calls port_associate for 22 (probably part of accept() impl)
   - child A handles the request
! - child A also calls port_associate for 22 (no sign of lock acquire)
! - child A returns from port_getn() for 22
   - child A calls accept() and starts to handle the connection
! - child B also returns from port_getn() for 22
! - child B gets EAGAIN for the accept()
   - child B calls port_associate for 22
   - child A handles the request
! - child A gets EDEADLK for pthread_mutex_lock() (this is from the 
error_log; there's no system call for this instance of 
pthread_mutex_lock() in the truss output). EDEADLK for 
pthread_mutex_lock() means that this process already has that lock.
   - child B returns from port_getn() for 22
   - child B calls accept() and starts to handle the connection
   - child A exits (now B is the only child left)
   - child B proceeds request handling
   - child B does all further port_associate(), port_getn(), and 
accept() plus connection and request handling. No more 
lwp_mutex_timedlock() or lwp_mutex_unlock() for B, maybe due to some 
optimization when only one process for a lock is left.


It is quite possible, that there is an underlying lwp_mutex or portfs 
bug here. But it is strange, that this only comes up when used with 
mod_ext_filter in combination with LimitRequestBody.

There was the fix

https://bz.apache.org/bugzilla/show_bug.cgi?id=60375

but I don't see, how this would influence the above list.

I can try to further narrow down (using static content to eliminate one 
forked child; using LimitRequestBody without mod_exit-filter etc.), but 
maybe someone already has an idea?

Regards,

Rainer

Re: Prefork MPM and mod_watchdog

Posted by Stefan Eissing <st...@greenbytes.de>.


> Am 27.01.2019 um 14:40 schrieb Rainer Jung <ra...@kippdata.de>:
> 
> - as soon as I enable mod_watchdog only (but not the above modules that would use it), the hangs start to happen every now and then.

Hmm, that sounds strange. As I understood the code, none of the parent/child watchdogs would be active then. So, its child_init should not do anything either.

But, if one of the proxy monitors runs against the server itself, could it connect against the child process it (the watchdog) is running on?

-Stefan

Re: Prefork MPM and mod_watchdog

Posted by Rainer Jung <ra...@kippdata.de>.

Only about the part: "the limit of 3 child processes and the 
MaxRequestWorkers warning". That happens also on Linux. The sizing is 
part of the test framework and is insufficient, because apart from one 
child for a request plus another one for a proxied request on the 
backend, more child processes can be busy while sitting in 
KeepAliveTimeout from previous proxied requests (we are using persistent 
backend connections). The timeout is 5 seconds.

I added a little headroom to the prefork sizing in r1852306 to 
https://svn.apache.org/repos/asf/perl/Apache-Test/trunk.

The real problem, pthread mutex failures on Solaris with prefork and 
mod_watchdog is still open.

Regards,

Rainer

Am 27.01.2019 um 14:40 schrieb Rainer Jung:
> Hi all,
> 
> since around 2.4.26 I notice occasional hangs during release testing 
> only on Solaris 10 Sparc when using MPM prefork.
> 
> Those hangs were always observed during proxy testing. The situation was 
> in most cases (all?), that only one running httpd process was left, so 
> it would accept the incoming request but could not proxy them and each 
> request ran into the proxy timeout (I think 1 minute in the test). The 
> process was in fact an httpd child process. Parent process and other 
> children were gone.
> 
> I now narrowed it down some more. I am not sure, this is the only 
> situation, but at least this is one I do observe most often:
> 
> - no problems with event or worker MPM
> 
> - no problems on REHL and SLES (Linux)
> 
> - even on Solaris, running without mod_watchdog (and therefore without 
> mod_heartbeat, mod_heartmonitor, mod_lbmethod_heartbeat and 
> mod_proxy_hcheck) it seems the hang does not occur.
> 
> - as soon as I enable mod_watchdog only (but not the above modules that 
> would use it), the hangs start to happen every now and then.
> 
> The pattern seems to be (from the error log):
> 
> [Sat Jan 26 21:36:01.855207 2019] [core:trace4] [pid 7850] 
> mpm_common.c(536): mpm child 7853 (gen 0/slot 0) started
> [Sat Jan 26 21:36:01.875826 2019] [mpm_prefork:notice] [pid 7850] 
> AH00163: Apache/2.4.38 (Unix) OpenSSL/1.0.2 configured -- resuming 
> normal operations
> [Sat Jan 26 21:36:01.875975 2019] [mpm_prefork:info] [pid 7850] AH00164: 
> Server built: Jan 18 2019 17:54:50
> [Sat Jan 26 21:36:01.876208 2019] [core:notice] [pid 7850] AH00094: 
> Command line: '/path/to/bin/httpd -d /path/to/t -f 
> /path/to/t/conf/httpd.conf -D APACHE2 -D APACHE2_4'
> [Sat Jan 26 21:36:01.876340 2019] [core:debug] [pid 7850] log.c(1571): 
> AH02639: Using SO_REUSEPORT: no (1)
> [Sat Jan 26 21:36:01.876417 2019] [mpm_prefork:debug] [pid 7850] 
> prefork.c(923): AH00165: Accept mutex: pthread (default: pthread)
> [Sat Jan 26 21:36:01.875753 2019] [core:trace4] [pid 7850] 
> mpm_common.c(536): mpm child 7854 (gen 0/slot 1) started
> ...
> [Sat Jan 26 21:36:18.969742 2019] [core:trace4] [pid 7850] 
> mpm_common.c(536): mpm child 7889 (gen 0/slot 2) started
> ...
> SHOULD BE FINE UNTIL HERE
> ...
> [Sat Jan 26 21:38:29.016961 2019] [mpm_prefork:error] [pid 7850] 
> AH00161: server reached MaxRequestWorkers setting, consider raising the 
> MaxRequestWorkers setting
> ...
> UNCLEAR, WHETHER THAT IS UNEXPECTED, BUT THEN THE REAL ERROR
> ...
> [Sat Jan 26 21:42:48.213005 2019] [mpm_prefork:emerg] [pid 7889] 
> (45)Deadlock situation detected/avoided: AH00144: couldn't grab the 
> accept mutex
> [Sat Jan 26 21:42:48.490884 2019] [mpm_prefork:emerg] [pid 7853] 
> (45)Deadlock situation detected/avoided: AH00144: couldn't grab the 
> accept mutex
> ...
> [Sat Jan 26 21:42:49.129401 2019] [core:alert] [pid 7850] AH00050: Child 
> 7853 returned a Fatal error... Apache is exiting!
> [Sat Jan 26 21:42:49.129519 2019] [:emerg] [pid 7850] AH02818: MPM run 
> failed, exiting
> 
> bin/httpd -V
> Server version: Apache/2.4.38 (Unix)
> Server built:   Jan 18 2019 17:54:50
> Server's Module Magic Number: 20120211:83
> Server loaded:  APR 1.6.5, APR-UTIL 1.6.1
> Compiled using: APR 1.6.5, APR-UTIL 1.6.1
> Architecture:   32-bit
> Server MPM:     prefork
>    threaded:     no
>      forked:     yes (variable process count)
> Server compiled with....
>   -D APR_HAS_SENDFILE
>   -D APR_HAS_MMAP
>   -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
>   -D APR_USE_PROC_PTHREAD_SERIALIZE
>   -D APR_USE_PTHREAD_SERIALIZE
>   -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
>   -D APR_HAS_OTHER_CHILD
>   -D AP_HAVE_RELIABLE_PIPED_LOGS
>   -D DYNAMIC_MODULE_LIMIT=256
>   -D HTTPD_ROOT="..."
>   -D SUEXEC_BIN=".../bin/suexec"
>   -D DEFAULT_PIDLOG="logs/httpd.pid"
>   -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
>   -D DEFAULT_ERRORLOG="logs/error_log"
>   -D AP_TYPES_CONFIG_FILE="conf/mime.types"
>   -D SERVER_CONFIG_FILE="conf/httpd.conf"
> 
> I remember Solaris having a very bad deadlock detection but that goes 
> back to using fcntl locks. Here we are using pthread locks. Since this 
> is APR 1.6, my old observation about pthread timed lock problems on 
> Solaris 10 also does not apply here.
> 
> I wonder what is special about prefork and mod_watchdog.
> 
> In addition, what I find strange is, that the accept mutex error lead to 
> child crashes, but not to a start of new child processes. Instead one of 
> the two child crashes (the second one) seems to be noticed by the parent:
> 
> [core:alert] [pid 7850] AH00050: Child 7853 returned a Fatal error... 
> Apache is exiting!
> [:emerg] [pid 7850] AH02818: MPM run failed, exiting
> 
> and in fact the parent goes away but the third and last child process 
> stays and continues to handle requests. That doesn't seem right.
> 
> I also see complains about reaching MaxRequestWorkers in the prefork 
> tests on Solaris. I haven't verified it, but it seems they run with a 
> fixed limit of three child processes. On the one hand, I would only 
> expect two of them to be actually used at any time (request plus proxied 
> request), since the tests are not executed in parallel. So the message 
> is a bit strange. On the other hand the limit three (if I am right) 
> seems very limited.
> 
> Any hints, ideas, comments about prefork plus mod_watchdog, the 
> incomplete shutdown after AH00050 plus AH02818, the limit of 3 child 
> processes and the MaxRequestWorkers warning?
> 
> Questions, questions, ... ;)
> 
> Regards,
> 
> Rainer