You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@httpd.apache.org by Matthijs van der Klip <ma...@spill.nl> on 2004/07/08 11:16:14 UTC
[users@httpd] segmentation faults and zombie processes

Hi,

Hoping that someone can shed some light on the following (this has become
quite a long read I'm afraid):


ENVIRONMENT
-----------

I'm one of the administrators of a pool of seven webservers being
loadbalanced by LVS (http://linuxvirtualserver.org).  Linux / Apache /
MySQL / PHP installation on all seven webservers is identical:

- RedHat Linux 8.0 with kernel 2.4.20-28.8smp
- Apache 1.3.31 with patches applied most of which originate from RedHat
  (See below for a list of patches, and configure script)
- MySQL 4.0.20 client installed from RPM's from http://mysql.com
  (shared library linked to PHP module)
- PHP 4.3.7 without patches applied compiled as a module
  (See below for configure script)


Basically this consists of a customized Redhat 7.3 Apache 1.3.27
installation ported to RedHat 8.0 and upgraded to Apache 1.3.31.

At peak times (15:00 - 22:00) up to 100 (mostly PHP) requests per second 
per server are handled. Server hardware mostly consists of Dual Xeon 
2.4GHz machines which handle this kind of load with around 80% idle time 
available. Memory consumption is close to ideal: all 2GB of RAM are in use 
and no swapping is done.


RedHat supplied Apache patches (in order applied):

http://www.vdklip.nl/apache/apache_1.3.29-config.patch
http://www.vdklip.nl/apache/apache_1.3.31-apxs.patch
http://www.vdklip.nl/apache/apache_1.3.14-mkstemp.patch
http://www.vdklip.nl/apache/apache_1.3.14-redhat.patch
http://www.vdklip.nl/apache/apache_1.3.20-apachectl-init.patch
http://www.vdklip.nl/apache/apache_1.3.23-dbmdb.patch
http://www.vdklip.nl/apache/apache_1.3.27-db.patch
http://www.vdklip.nl/apache/apache_1.3.22-CAN-2003-0020.patch


Custom Apache patches (applied after RedHat patches):

http://www.vdklip.nl/apache/hard_server_limit.patch
http://www.vdklip.nl/apache/apachectl_status.patch


Apache configure script:

http://www.vdklip.nl/apache/apache.configure.txt


PHP configure script:

http://www.vdklip.nl/apache/php.configure.txt


PROBLEMS
--------

Our servers crash more often than desireable. This does not cause any 
major interuptance to our websites because we're loadbalancing but it is a 
pain in the ass for us administrators because we have not been able 
to find the cause of these crashes. Servers just sit there being pingable 
but no login can be done either through SSH or through the console. It is 
what I would call a 'userland' crash. We suspect somehow no more processes 
can be created. Sometimes a server can crash multiple times on one day, at 
other times it might run for months uninterrupted. All seven servers have 
experienced this kind of crash at some time though. After rebooting 
nothing is to be found in the logs.

Recently we experienced some issues with Apache that point in the 
direction of the crashes described above. We found out some Apache 
processes end in a segmentation fault:

[Wed Jul  7 16:50:55 2004] [notice] child pid 3848 exit signal 
Segmentation fault (11)


Restarting Apache (the hard way; doing a full stop and a clean start) 
seemed to get rid of these segmentation faults for a while. While examing 
where these segmentation faults originated we decided to add a special 
case to our Apache monitoring script in the meantime. This monitoring 
script runs every five minutes and checks the state of the Apache server. 
To these checks we added that whenever a segmentation fault is found in 
the last 10 lines of the main error log (which is seperated from our 
virtualhost error logs), Apache will be restarted (again: the hard way).

On several occasions though this check (or rather it's acting on finding a 
segmentation fault) has caused a server crash as described above. Order of 
events:

1) Apache child process exits with segmentation fault. This is being 
   logged in the main error log:

[Wed Jul  7 16:50:55 2004] [notice] child pid 3848 exit signal 
Segmentation fault (11)


2) Our monitoring script is being run by cron and finds the segmentation 
   fault in the log. Upon this it decides to restart the Apache server by 
   issueing a '/etc/rc.d/init.d/httpd restart' which sends a SIGTERM to 
   main httpd process and tries to start a new one.

3) Apache receives a SIGTERM signal and tries to shut down, but some child 
   processes are not willing to cooperate:

[Wed Jul  7 16:54:02 2004] [notice] caught SIGTERM, shutting down
[Wed Jul  7 16:59:02 2004] [warn] child process 4634 still did not exit, 
sending a SIGTERM
[Wed Jul  7 16:59:02 2004] [warn] child process 4639 still did not exit, 
sending a SIGTERM
[Wed Jul  7 16:59:02 2004] [warn] child process 4727 still did not exit, 
sending a SIGTERM
[Wed Jul  7 16:59:02 2004] [warn] child process 4740 still did not exit, 
sending a SIGTERM
[Wed Jul  7 16:59:02 2004] [warn] child process 4796 still did not exit, 
sending a SIGTERM
[Wed Jul  7 16:59:02 2004] [warn] child process 4823 still did not exit, 
sending a SIGTERM
[Wed Jul  7 16:59:02 2004] [warn] child process 4855 still did not exit, 
sending a SIGTERM
[Wed Jul  7 16:59:02 2004] [warn] child process 4861 still did not exit, 
sending a SIGTERM
[Wed Jul  7 16:59:02 2004] [warn] child process 4883 still did not exit, 
sending a SIGTERM

   I'm not sure how to explain the five minutes difference between 
   receival of the main SIGTERM signal and Apache warning it's child 
   processes did not exit. Can it be that Apache waits for exactly five 
   minutes? Alternatively it could be a second run of our monitoring 
   script, but that seems unlikely as we have not received output from 
   such a second run and on top of that there is no second '[notice] 
   caught SIGTERM, shutting down' message.

4) In the meantime the monitoring script tries to start a new Apache 
   instance:

[Wed Jul  7 16:54:03 2004] [notice] Apache/1.3.31 (Unix)  (Red-Hat/Linux) 
configured -- resuming normal operations
[Wed Jul  7 16:54:03 2004] [notice] Accept mutex: sysvsem (Default: 
sysvsem)

   It seems to succeed in this, although I have seen other occasions 
   where this ends in a 'Address already in use' message, meaning the 
   'old' Apache process did not exit before the 'new' one being started.

5) Something (Apache?) goes berserk an the server becomes very 
   unresponsive. Observed by the loadbalancer which starts removing and 
   adding this server as a result of the server sometimes responding to 
   requests and sometimes not.

6) Some time later (around 17:30) this is observed by an operator which 
   informs me. About a half hour later I try to login but the server does 
   not respond. Somewhat earlier the loadbalancer stopped adding the 
   server too, so it seems it has crashed.

7) I cycle power on the system and it is becomes available again:

[Wed Jul  7 18:24:07 2004] [notice] Apache/1.3.31 (Unix)  (Red-Hat/Linux) 
configured -- resuming normal operations
[Wed Jul  7 18:24:07 2004] [notice] Accept mutex: sysvsem (Default: 
sysvsem)

8) Logging in again I monitor the server for a while and I begin noticing 
   some load spikes which coincide with some 'httpd <defunct>' processes 
   (zombies). They disappear after some seconds but do increase the load 
   noticeably.


At this point I'm open for any suggestion. Specifically I'm interested in:

1) Is my current Apache setup any good? Meaning:

   a) Do the RedHat patches applied to the Apache 1.3.31 distribution make 
      any sense? They seem to do upon inspection by me, but I'd like to 
      hear from the experts.

   b) It seems unlikely, but is there anything wrong with my 
      configuration?:

      http://www.vdklip.nl/apache/httpd.conf

      Virtualhosts are included but only contain 'VirtualHost *' sections.

2) I know a full stop and start isn't the recommended way of restarting 
   Apache, but it seems the only way when dealing with an unstable Apache 
   server (causing segmentation faults). Isn't it?

3) Is it normal to experience zombie httpd processes or shouldn't they 
   appear at all on a properly configured Apache server?

4) Is it normal to have some child processes which cannot be terminated by 
   Apache on the first occasion ('child process still did not exit' 
   warnings)?

5) It seems mostly likely (to me) the main cause of all this lies within 
   my PHP and not my Apache setup. I have seen some bugreports which 
   suggest to compile PHP with the 'enable-sigchild' option:

   http://bugs.php.net/bug.php?id=6805

   Does anyone on this list have any experience with this? I cannot find 
   (google) any information on this option which tells me exactly what it 
   does...

6) We're in progress of upgrading from RedHat 8.0 to Fedora Core 2 mainly 
   because of the inclusion of the 2.6 kernel. Anyone out there running
   Apache 1.3 / PHP 4 on FC2 on large scale? Maybe we could exchange some 
   tips...


As said earlier, any help is appreciated. I hope this is a comprehensive
report and I did not leave anything out.


Best regards,

-- 
Matthijs van der Klip
System Administrator
Spill E-Projects
The Netherlands











---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org