You are viewing a plain text version of this content. The canonical link for it is here.
Posted to apache-bugdb@apache.org by "David J. MacKenzie" <dj...@web.us.uu.net> on 2000/11/21 05:20:04 UTC

Re: os-solaris/1190: server processes in keepalive state do not die after keepalive-timeout

The following reply was made to PR os-solaris/1190; it has been noted by GNATS.

From: "David J. MacKenzie" <dj...@web.us.uu.net>
To: apbugs@apache.org
Cc: djm@uu.net, rse@engelschall.com
Subject: Re: os-solaris/1190: server processes in keepalive state do not die after keepalive-timeout
Date: Mon, 20 Nov 2000 23:16:39 -0500 (EST)

 We have just started experiencing what seems to be the same problem
 as http://bugs.apache.org/index.cgi/full/1190
 which was reported by a Solaris 2.5.1 user in 1998 and never resolved.
 That person was also using mod_ssl and PHP, which seems to be relevant.
 Also http://bugs.apache.org/index.cgi/full/6211 may be related,
 though today I applied the patch in that PR to no apparent effect.
 
 We are using the newest versions of (almost) everything, on BSDI
 BSD/OS 4.0.1.  I have some additional data which should be helpful.
 In short, the finger *seems* to point at mod_ssl as the culprit,
 though I haven't looked at the code to see how that might be plausible.
 
 A week ago UUNET upgraded our server farm of about 800 servers, of which
 a few dozen have SSL, from apache 1.3.12 (for most servers) or
 Stronghold 2.4.2 (for those that have SSL).  They are now running:
 
 apache 1.3.14, with two patches from bugs.apache.org to fix corrupting
  PDF files and mod_rewrite maps (the Bugtraq patch)
 mod_ssl 2.7.1
 OpenSSL 0.9.5a
 PHP 4.0.3pl1
 mod_auth_kerb configured for Kerberos v5
 
 All modules except http_core and mod_so are loaded as DSO's.  All of
 the servers are using the same apache binary and DSO's, compiled with
 EAPI, but we only LoadModule mod_ssl for those servers that have SSL
 keys and certs.  We're not using Java or Perl modules, or anything
 that multithreads.  The BSD/OS pthreads are user-space anyway.
 
 root@enniskillen 39 $ ldd /usr/local/libexec/apache
         libkrb5.so => /usr/local/krb5/lib/libkrb5.so (0xc054000)
         libk5crypto.so => /usr/local/krb5/lib/libk5crypto.so (0xc0b4000)
         libmm.so.11 => /usr/local/lib/libmm.so.11 (0xc0ce000)
         libdl.so => /shlib/libdl.so (0xc0d2000)
         libgcc.so => /shlib/libgcc.so (0xc0d5000)
         libc.so => /shlib/libc.so (0xc0d8000)
         libcom_err.so => /usr/local/krb5/lib/libcom_err.so (0xc15b000)
 
 Our new apache+mod_ssl installation is not always handling HTTP
 Keepalive correctly.  It's configured to keep connections alive for 5
 seconds, but it's not letting some of them go.  We see the same
 behavior described in PR 1190, in which over the course of a few hours
 gradually most of the process slots become filled with Keepalive
 connections that are much older than is supposed to be allowed.
 Eventually our monitoring systems start alerting that they can't
 connect to the servers.  Some of the old connections eventually go
 away on their own, perhaps those from dialup lines; I'm not sure.
 
 I sampled the mod-status pages of several of our customers, loading
 the page, waiting 30 seconds or more, and loading it again in a second
 window, and comparing the lists.  I looked for which child processes
 had connections in the Keepalive state, and checked whether the amount
 of data transferred had changed.
 
 The random sample of about a dozen non-SSL customers I checked all
 looked normal.  Some of the customers I checked who have SSL showed
 the problem.  For example, one server got a few http (not https)
 requests at 7:29 this morning from IP address 212.250.100.120, and
 none since.  12 hours later, the TCP connection is still open, and
 taking up 3 apache process slots in the Keepalive state.  The browser
 is "Mozilla/4.0 (compatible; MSIE 4.01; Windows 98)".
 
 Another server shows the same sort of problem, with a connection at 1:13
 this afternoon from 192.44.136.113 which lasted 3 seconds but is still
 open:
 
 root@platform-33: netstat -an | grep 192.44.136.113
 tcp        0      0  208.240.90.209.80      192.44.136.113.39653   ESTABLISHED
 tcp        0      0  208.240.90.209.80      192.44.136.113.39650   ESTABLISHED
 tcp        0      0  208.240.90.209.80      192.44.136.113.39598   ESTABLISHED
 
 Their mod-status page confirms that 3 child processes are still in the
 Keepalive state for this IP address.  The browser is
 "Mozilla/4.5 [en] (Win98; I)".  That address is pingable:
 
 root@platform-31: ping 192.44.136.113
 PING 192.44.136.113 (192.44.136.113): 56 data bytes
 64 bytes from 192.44.136.113: icmp_seq=0 ttl=246 time=23.961 ms
 
 So the problem doesn't seem to depend on the browser (Netscape or
 MSIE).  I've seen it with clients on Windows 95/98 (mainly) and MacOS,
 and I think on NT.
 
 Most or all of the requests involved have been for static content.
 The affected servers aren't using PHP.
 
 Some of our SSL servers aren't showing the problem, but they are doing
 little volume.  Late this afternoon I temporarily turned Keepalive off
 for the two servers affected the worst, who keep failing our monitoring
 because all child processes are used.  They went from 40-60 child
 processes being used simultaneously, to 2-13, though this wasn't in
 the busiest part of the day.
 
 I also found this comment on Slashdot from a year ago,
 at http://slashdot.org/apache/99/12/22/1711203.shtml:
 
            I've tried both, and while admittedly mod_ssl looks cleaner,
            is easier to set up, and is updated more frequently, we mad
            several problems with Microsoft and AOL clients connecting
            via SSL.  All of these problems went away once we moved
            over to Apache-SSL. We tried fiddling with the keepalive
            and "unclean shutdown" settings to no avail with mod_ssl
            but it didn't seem to do any good.
 
 I haven't tried Apache-SSL yet.